2008-05-14 04:43:55

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased)

Subject: [PATCH 0/4] jbd: possible filesystem corruption fixes (rebased)

This is the rebased version against 2.6.26-rc2, so there is no
essential difference from the previous post.
The previous post can be found at: http://lkml.org/lkml/2008/4/18/154
(The previous post may have been filtered out as SPAM mails
due to a trouble in the mail submission.)


This patch set fixes several error handling problems. As the
result, we can save the filesystem from file data and structural
corruption especially caused by temporal I/O errors. Do temporal
I/O errors occur so often? At least it will be not uncommon
for iSCSI storages.
This fixes have been done only for ext3/JBD parts. The ext4/JBD2
version has not been prepared yet, but merging this patch set
will be worthwhile because it takes away possible filesystem
corruption.

[PATCH 1/4] jbd: strictly check for write errors on data buffers
Without this patch, some file data in ordered mode aren't
checked for errors. This means user processes can continue to
update the filesystem without noticing the write failure.
Furthermore, the page cache which we failed to write becomes
reclaimable. So if the page cache is reclaimed then we
succeed to read its data from the disk, the data corruption
will occur because the data is old.
Jan's ordered mode rewrite patch also fixes this problem, but
this patch will be needed at least for the current kernel.

[PATCH 2/4] jbd: ordered data integrity fix
This patch fixes the ordered mode violation problem caused
by write error. Jan's ordered mode rewrite patch will also
fix this problem.

[PATCH 3/4] jbd: abort when failed to log metadata buffers
Without this patch, the filesystem can corrupt along with
the following scenario:

1. fail to write a metadata buffer to block B in the journal
2. succeed to write the commit record
3. the system crashes, reboots and mount the filesystem
4. in the recovery phase, succeed to read data from block B
5. write back the read data to the filesystem, but it is
a stale metadata
6. lose some files and directories!

This problem wouldn't happen if we have JBD2's journal
checksumming feature and it's always turned on.

[PATCH 4/4] ext3/jbd: fix error handling for checkpoint io
Without this patch, the filesystem can lose some metadata
updates even though the transactions have been committed.

Regards,

--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center




2008-05-14 04:47:38

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased)

Subject: [PATCH 1/4] jbd: strictly check for write errors on data buffers

In ordered mode, we should abort journaling when an I/O error has
occurred on a file data buffer in the committing transaction. But
there can be data buffers which are not checked for error:

(a) the buffer which has already been written out by pdflush
(b) the buffer which has been unlocked before scanned in the
t_locked_list loop

This patch adds missing error checks and aborts journaling
appropriately.

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/jbd/commit.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

Index: linux-2.6.26-rc2/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc2/fs/jbd/commit.c
@@ -172,7 +172,7 @@ static void journal_do_submit_data(struc
/*
* Submit all the data buffers to disk
*/
-static void journal_submit_data_buffers(journal_t *journal,
+static int journal_submit_data_buffers(journal_t *journal,
transaction_t *commit_transaction)
{
struct journal_head *jh;
@@ -180,6 +180,7 @@ static void journal_submit_data_buffers(
int locked;
int bufs = 0;
struct buffer_head **wbuf = journal->j_wbuf;
+ int err = 0;

/*
* Whenever we unlock the journal and sleep, things can get added
@@ -253,6 +254,8 @@ write_out_data:
put_bh(bh);
} else {
BUFFER_TRACE(bh, "writeout complete: unfile");
+ if (unlikely(!buffer_uptodate(bh)))
+ err = -EIO;
__journal_unfile_buffer(jh);
jbd_unlock_bh_state(bh);
if (locked)
@@ -271,6 +274,8 @@ write_out_data:
}
spin_unlock(&journal->j_list_lock);
journal_do_submit_data(wbuf, bufs);
+
+ return err;
}

/*
@@ -410,8 +415,7 @@ void journal_commit_transaction(journal_
* Now start flushing things to disk, in the order they appear
* on the transaction lists. Data blocks go first.
*/
- err = 0;
- journal_submit_data_buffers(journal, commit_transaction);
+ err = journal_submit_data_buffers(journal, commit_transaction);

/*
* Wait for all previously submitted IO to complete.
@@ -426,10 +430,10 @@ void journal_commit_transaction(journal_
if (buffer_locked(bh)) {
spin_unlock(&journal->j_list_lock);
wait_on_buffer(bh);
- if (unlikely(!buffer_uptodate(bh)))
- err = -EIO;
spin_lock(&journal->j_list_lock);
}
+ if (unlikely(!buffer_uptodate(bh)))
+ err = -EIO;
if (!inverted_lock(journal, bh)) {
put_bh(bh);
spin_lock(&journal->j_list_lock);



2008-05-14 04:48:51

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 2/4] jbd: ordered data integrity fix (rebased)

Subject: [PATCH 2/4] jbd: ordered data integrity fix

In ordered mode, if a buffer being dirtied exists in the committing
transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it.
But we don't have to remove the buffer from the committing
transaction when the buffer couldn't be written out, otherwise it
breaks the ordered mode rule.

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/jbd/transaction.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6.26-rc2/fs/jbd/transaction.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/transaction.c
+++ linux-2.6.26-rc2/fs/jbd/transaction.c
@@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle,
journal_t *journal = handle->h_transaction->t_journal;
int need_brelse = 0;
struct journal_head *jh;
+ int ret = 0;

if (is_handle_aborted(handle))
- return 0;
+ return ret;

jh = journal_add_journal_head(bh);
JBUFFER_TRACE(jh, "entry");
@@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle,
time if it is redirtied */
}

- /* journal_clean_data_list() may have got there first */
+ /*
+ * We shouldn't remove the buffer from the committing
+ * transaction if it has failed to be written.
+ * Otherwise, it breaks the ordered mode rule.
+ */
+ if (unlikely(!buffer_uptodate(bh))) {
+ ret = -EIO;
+ goto no_journal;
+ }
+
if (jh->b_transaction != NULL) {
JBUFFER_TRACE(jh, "unfile from commit");
__journal_temp_unlink_buffer(jh);
@@ -1108,7 +1118,7 @@ no_journal:
}
JBUFFER_TRACE(jh, "exit");
journal_put_journal_head(jh);
- return 0;
+ return ret;
}

/**



2008-05-14 04:49:56

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased)

Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers

If we failed to write metadata buffers to the journal space and
succeeded to write the commit record, stale data can be written
back to the filesystem as metadata in the recovery phase.

To avoid this, when we failed to write out metadata buffers,
abort the journal before writing the commit record.

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/jbd/commit.c | 3 +++
1 file changed, 3 insertions(+)

Index: linux-2.6.26-rc2/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc2/fs/jbd/commit.c
@@ -703,6 +703,9 @@ wait_for_iobuf:
__brelse(bh);
}

+ if (err)
+ journal_abort(journal, err);
+
J_ASSERT (commit_transaction->t_shadow_list == NULL);

jbd_debug(3, "JBD: commit phase 5\n");



2008-05-14 04:50:43

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)

Subject: [PATCH 4/4] jbd: fix error handling for checkpoint io

When a checkpointing IO fails, current JBD code doesn't check the
error and continue journaling. This means latest metadata can be
lost from both the journal and filesystem.

This patch leaves the failed metadata blocks in the journal space
and aborts journaling in the case of log_do_checkpoint().
To achieve this, we need to do:

1. don't remove the failed buffer from the checkpoint list where in
the case of __try_to_free_cp_buf() because it may be released or
overwritten by a later transaction
2. log_do_checkpoint() is the last chance, remove the failed buffer
from the checkpoint list and abort journaling
3. when checkpointing fails, don't update the journal super block to
prevent the journalled contents from being cleaned
4. when checkpointing fails, don't clear the needs_recovery flag,
otherwise the journalled contents are ignored and cleaned in the
recovery phase
5. if the recovery fails, keep the needs_recovery flag

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/ext3/ioctl.c | 12 ++++----
fs/ext3/super.c | 13 +++++++--
fs/jbd/checkpoint.c | 59 ++++++++++++++++++++++++++++++++++--------
fs/jbd/journal.c | 21 +++++++++-----
fs/jbd/recovery.c | 7 +++-
include/linux/jbd.h | 7 ++++
6 files changed, 91 insertions(+), 28 deletions(-)

Index: linux-2.6.26-rc2/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/checkpoint.c
+++ linux-2.6.26-rc2/fs/jbd/checkpoint.c
@@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
int ret = 0;
struct buffer_head *bh = jh2bh(jh);

- if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
+ if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
+ !buffer_dirty(bh) && buffer_uptodate(bh)) {
JBUFFER_TRACE(jh, "remove from checkpoint list");
ret = __journal_remove_checkpoint(jh) + 1;
jbd_unlock_bh_state(bh);
@@ -140,6 +141,25 @@ void __log_wait_for_space(journal_t *jou
}

/*
+ * We couldn't write back some metadata buffers to the filesystem.
+ * To avoid unwritten-back metadata buffers from losing, set
+ * JFS_CP_ABORT flag and make sure that the journal space isn't
+ * cleaned. This function also aborts journaling.
+ */
+static void __journal_abort_checkpoint(journal_t *journal, int errno)
+{
+ if (is_checkpoint_aborted(journal))
+ return;
+
+ spin_lock(&journal->j_state_lock);
+ journal->j_flags |= JFS_CP_ABORT;
+ spin_unlock(&journal->j_state_lock);
+ printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks "
+ "are still old.\n");
+ journal_abort(journal, errno);
+}
+
+/*
* We were unable to perform jbd_trylock_bh_state() inside j_list_lock.
* The caller must restart a list walk. Wait for someone else to run
* jbd_unlock_bh_state().
@@ -160,21 +180,25 @@ static void jbd_sync_bh(journal_t *journ
* buffers. Note that we take the buffers in the opposite ordering
* from the one in which they were submitted for IO.
*
+ * Return 0 on success, and return <0 if some buffers have failed
+ * to be written out.
+ *
* Called with j_list_lock held.
*/
-static void __wait_cp_io(journal_t *journal, transaction_t *transaction)
+static int __wait_cp_io(journal_t *journal, transaction_t *transaction)
{
struct journal_head *jh;
struct buffer_head *bh;
tid_t this_tid;
int released = 0;
+ int ret = 0;

this_tid = transaction->t_tid;
restart:
/* Did somebody clean up the transaction in the meanwhile? */
if (journal->j_checkpoint_transactions != transaction ||
transaction->t_tid != this_tid)
- return;
+ return ret;
while (!released && transaction->t_checkpoint_io_list) {
jh = transaction->t_checkpoint_io_list;
bh = jh2bh(jh);
@@ -194,6 +218,9 @@ restart:
spin_lock(&journal->j_list_lock);
goto restart;
}
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
+
/*
* Now in whatever state the buffer currently is, we know that
* it has been written out and so we can drop it from the list
@@ -203,6 +230,8 @@ restart:
journal_remove_journal_head(bh);
__brelse(bh);
}
+
+ return ret;
}

#define NR_BATCH 64
@@ -226,7 +255,8 @@ __flush_batch(journal_t *journal, struct
* Try to flush one buffer from the checkpoint list to disk.
*
* Return 1 if something happened which requires us to abort the current
- * scan of the checkpoint list.
+ * scan of the checkpoint list. Return <0 if the buffer has failed to
+ * be written out.
*
* Called with j_list_lock held and drops it if 1 is returned
* Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
@@ -256,6 +286,9 @@ static int __process_buffer(journal_t *j
log_wait_commit(journal, tid);
ret = 1;
} else if (!buffer_dirty(bh)) {
+ ret = 1;
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
J_ASSERT_JH(jh, !buffer_jbddirty(bh));
BUFFER_TRACE(bh, "remove from checkpoint");
__journal_remove_checkpoint(jh);
@@ -263,7 +296,6 @@ static int __process_buffer(journal_t *j
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
__brelse(bh);
- ret = 1;
} else {
/*
* Important: we are about to write the buffer, and
@@ -318,6 +350,7 @@ int log_do_checkpoint(journal_t *journal
* OK, we need to start writing disk blocks. Take one transaction
* and write it.
*/
+ result = 0;
spin_lock(&journal->j_list_lock);
if (!journal->j_checkpoint_transactions)
goto out;
@@ -334,7 +367,7 @@ restart:
int batch_count = 0;
struct buffer_head *bhs[NR_BATCH];
struct journal_head *jh;
- int retry = 0;
+ int retry = 0, err;

while (!retry && transaction->t_checkpoint_list) {
struct buffer_head *bh;
@@ -347,6 +380,8 @@ restart:
break;
}
retry = __process_buffer(journal, jh, bhs,&batch_count);
+ if (retry < 0)
+ result = retry;
if (!retry && (need_resched() ||
spin_needbreak(&journal->j_list_lock))) {
spin_unlock(&journal->j_list_lock);
@@ -371,14 +406,18 @@ restart:
* Now we have cleaned up the first transaction's checkpoint
* list. Let's clean up the second one
*/
- __wait_cp_io(journal, transaction);
+ err = __wait_cp_io(journal, transaction);
+ if (!result)
+ result = err;
}
out:
spin_unlock(&journal->j_list_lock);
- result = cleanup_journal_tail(journal);
if (result < 0)
- return result;
- return 0;
+ __journal_abort_checkpoint(journal, result);
+ else
+ result = cleanup_journal_tail(journal);
+
+ return (result < 0) ? result : 0;
}

/*
Index: linux-2.6.26-rc2/include/linux/jbd.h
===================================================================
--- linux-2.6.26-rc2.orig/include/linux/jbd.h
+++ linux-2.6.26-rc2/include/linux/jbd.h
@@ -816,6 +816,8 @@ struct journal_s
#define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */
#define JFS_LOADED 0x010 /* The journal superblock has been loaded */
#define JFS_BARRIER 0x020 /* Use IDE barriers */
+#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to
+ * clean the journal space. */

/*
* Function declarations for the journaling transaction and buffer
@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
return journal->j_flags & JFS_ABORT;
}

+static inline int is_checkpoint_aborted(journal_t *journal)
+{
+ return journal->j_flags & JFS_CP_ABORT;
+}
+
static inline int is_handle_aborted(handle_t *handle)
{
if (handle->h_aborted)
Index: linux-2.6.26-rc2/fs/ext3/ioctl.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/ext3/ioctl.c
+++ linux-2.6.26-rc2/fs/ext3/ioctl.c
@@ -239,7 +239,7 @@ setrsvsz_out:
case EXT3_IOC_GROUP_EXTEND: {
ext3_fsblk_t n_blocks_count;
struct super_block *sb = inode->i_sb;
- int err;
+ int err, err2;

if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
@@ -254,16 +254,16 @@ setrsvsz_out:
}
err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
journal_lock_updates(EXT3_SB(sb)->s_journal);
- journal_flush(EXT3_SB(sb)->s_journal);
+ err2 = journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
group_extend_out:
mnt_drop_write(filp->f_path.mnt);
- return err;
+ return (err) ? err : err2;
}
case EXT3_IOC_GROUP_ADD: {
struct ext3_new_group_data input;
struct super_block *sb = inode->i_sb;
- int err;
+ int err, err2;

if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
@@ -280,11 +280,11 @@ group_extend_out:

err = ext3_group_add(sb, &input);
journal_lock_updates(EXT3_SB(sb)->s_journal);
- journal_flush(EXT3_SB(sb)->s_journal);
+ err2 = journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
group_add_out:
mnt_drop_write(filp->f_path.mnt);
- return err;
+ return (err) ? err : err2;
}


Index: linux-2.6.26-rc2/fs/ext3/super.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/ext3/super.c
+++ linux-2.6.26-rc2/fs/ext3/super.c
@@ -395,7 +395,10 @@ static void ext3_put_super (struct super
ext3_xattr_put_super(sb);
journal_destroy(sbi->s_journal);
if (!(sb->s_flags & MS_RDONLY)) {
- EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
+ if (!is_checkpoint_aborted(sbi->s_journal)) {
+ EXT3_CLEAR_INCOMPAT_FEATURE(sb,
+ EXT3_FEATURE_INCOMPAT_RECOVER);
+ }
es->s_state = cpu_to_le16(sbi->s_mount_state);
BUFFER_TRACE(sbi->s_sbh, "marking dirty");
mark_buffer_dirty(sbi->s_sbh);
@@ -2373,7 +2376,13 @@ static void ext3_write_super_lockfs(stru

/* Now we set up the journal barrier. */
journal_lock_updates(journal);
- journal_flush(journal);
+
+ /*
+ * We don't want to clear needs_recovery flag when we failed
+ * to flush the journal.
+ */
+ if (journal_flush(journal) < 0)
+ return;

/* Journal blocked and flushed, clear needs_recovery flag. */
EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
Index: linux-2.6.26-rc2/fs/jbd/journal.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/journal.c
+++ linux-2.6.26-rc2/fs/jbd/journal.c
@@ -674,7 +674,7 @@ static journal_t * journal_init_common (
journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE);

/* The journal is marked for error until we succeed with recovery! */
- journal->j_flags = JFS_ABORT;
+ journal->j_flags = JFS_ABORT | JFS_CP_ABORT;

/* Set up a default-sized revoke table for the new mount. */
err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
@@ -1107,7 +1107,7 @@ int journal_load(journal_t *journal)
if (journal_reset(journal))
goto recovery_error;

- journal->j_flags &= ~JFS_ABORT;
+ journal->j_flags &= ~(JFS_ABORT | JFS_CP_ABORT);
journal->j_flags |= JFS_LOADED;
return 0;

@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
journal->j_tail = 0;
journal->j_tail_sequence = ++journal->j_transaction_sequence;
if (journal->j_sb_buffer) {
- journal_update_superblock(journal, 1);
+ if (!is_checkpoint_aborted(journal))
+ journal_update_superblock(journal, 1);
brelse(journal->j_sb_buffer);
}

@@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1

int journal_flush(journal_t *journal)
{
- int err = 0;
transaction_t *transaction = NULL;
unsigned long old_tail;

@@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal)
spin_unlock(&journal->j_state_lock);
}

- /* ...and flush everything in the log out to disk. */
+ /* ...and flush everything in the log out to disk.
+ * Even if an error occurs, we don't stop this loop.
+ * We do checkpoint as much as possible. */
spin_lock(&journal->j_list_lock);
- while (!err && journal->j_checkpoint_transactions != NULL) {
+ while (journal->j_checkpoint_transactions != NULL) {
spin_unlock(&journal->j_list_lock);
- err = log_do_checkpoint(journal);
+ log_do_checkpoint(journal);
spin_lock(&journal->j_list_lock);
}
spin_unlock(&journal->j_list_lock);
+ if (is_checkpoint_aborted(journal))
+ return -EIO;
+
cleanup_journal_tail(journal);

/* Finally, mark the journal as really needing no recovery.
@@ -1385,7 +1390,7 @@ int journal_flush(journal_t *journal)
J_ASSERT(journal->j_head == journal->j_tail);
J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence);
spin_unlock(&journal->j_state_lock);
- return err;
+ return 0;
}

/**
Index: linux-2.6.26-rc2/fs/jbd/recovery.c
===================================================================
--- linux-2.6.26-rc2.orig/fs/jbd/recovery.c
+++ linux-2.6.26-rc2/fs/jbd/recovery.c
@@ -223,7 +223,7 @@ do { \
*/
int journal_recover(journal_t *journal)
{
- int err;
+ int err, err2;
journal_superblock_t * sb;

struct recovery_info info;
@@ -261,7 +261,10 @@ int journal_recover(journal_t *journal)
journal->j_transaction_sequence = ++info.end_transaction;

journal_clear_revoke(journal);
- sync_blockdev(journal->j_fs_dev);
+ err2 = sync_blockdev(journal->j_fs_dev);
+ if (!err)
+ err = err2;
+
return err;
}


2008-05-14 12:56:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/4] jbd: strictly check for write errors on data buffers (rebased)

Hi,

On Wed 14-05-08 13:47:31, Hidehiro Kawai wrote:
> Subject: [PATCH 1/4] jbd: strictly check for write errors on data buffers
>
> In ordered mode, we should abort journaling when an I/O error has
> occurred on a file data buffer in the committing transaction. But
> there can be data buffers which are not checked for error:
>
> (a) the buffer which has already been written out by pdflush
> (b) the buffer which has been unlocked before scanned in the
> t_locked_list loop
>
> This patch adds missing error checks and aborts journaling
> appropriately.
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Jan Kara <[email protected]>

Honza
> ---
> fs/jbd/commit.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> Index: linux-2.6.26-rc2/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/commit.c
> +++ linux-2.6.26-rc2/fs/jbd/commit.c
> @@ -172,7 +172,7 @@ static void journal_do_submit_data(struc
> /*
> * Submit all the data buffers to disk
> */
> -static void journal_submit_data_buffers(journal_t *journal,
> +static int journal_submit_data_buffers(journal_t *journal,
> transaction_t *commit_transaction)
> {
> struct journal_head *jh;
> @@ -180,6 +180,7 @@ static void journal_submit_data_buffers(
> int locked;
> int bufs = 0;
> struct buffer_head **wbuf = journal->j_wbuf;
> + int err = 0;
>
> /*
> * Whenever we unlock the journal and sleep, things can get added
> @@ -253,6 +254,8 @@ write_out_data:
> put_bh(bh);
> } else {
> BUFFER_TRACE(bh, "writeout complete: unfile");
> + if (unlikely(!buffer_uptodate(bh)))
> + err = -EIO;
> __journal_unfile_buffer(jh);
> jbd_unlock_bh_state(bh);
> if (locked)
> @@ -271,6 +274,8 @@ write_out_data:
> }
> spin_unlock(&journal->j_list_lock);
> journal_do_submit_data(wbuf, bufs);
> +
> + return err;
> }
>
> /*
> @@ -410,8 +415,7 @@ void journal_commit_transaction(journal_
> * Now start flushing things to disk, in the order they appear
> * on the transaction lists. Data blocks go first.
> */
> - err = 0;
> - journal_submit_data_buffers(journal, commit_transaction);
> + err = journal_submit_data_buffers(journal, commit_transaction);
>
> /*
> * Wait for all previously submitted IO to complete.
> @@ -426,10 +430,10 @@ void journal_commit_transaction(journal_
> if (buffer_locked(bh)) {
> spin_unlock(&journal->j_list_lock);
> wait_on_buffer(bh);
> - if (unlikely(!buffer_uptodate(bh)))
> - err = -EIO;
> spin_lock(&journal->j_list_lock);
> }
> + if (unlikely(!buffer_uptodate(bh)))
> + err = -EIO;
> if (!inverted_lock(journal, bh)) {
> put_bh(bh);
> spin_lock(&journal->j_list_lock);
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-14 13:10:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/4] jbd: ordered data integrity fix (rebased)

On Wed 14-05-08 13:48:43, Hidehiro Kawai wrote:
> Subject: [PATCH 2/4] jbd: ordered data integrity fix
>
> In ordered mode, if a buffer being dirtied exists in the committing
> transaction, we write the buffer to the disk, move it from the
> committing transaction to the running transaction, then dirty it.
> But we don't have to remove the buffer from the committing
> transaction when the buffer couldn't be written out, otherwise it
> breaks the ordered mode rule.
Hmm, could you elaborate a bit more what exactly is broken and how does
this help to fix it? Because even if we find EIO happened on data buffer,
we currently don't do anything else than just remove the buffer from the
transaction and abort the journal. And even if we later managed to write
the data buffer from other process before the journal is aborted, ordered
mode guarantees are satisfied - we only guarantee that too old data cannot
be seen, newer can be seen easily... Thanks.

Honza

>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> ---
> fs/jbd/transaction.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> Index: linux-2.6.26-rc2/fs/jbd/transaction.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/transaction.c
> +++ linux-2.6.26-rc2/fs/jbd/transaction.c
> @@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle,
> journal_t *journal = handle->h_transaction->t_journal;
> int need_brelse = 0;
> struct journal_head *jh;
> + int ret = 0;
>
> if (is_handle_aborted(handle))
> - return 0;
> + return ret;
>
> jh = journal_add_journal_head(bh);
> JBUFFER_TRACE(jh, "entry");
> @@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle,
> time if it is redirtied */
> }
>
> - /* journal_clean_data_list() may have got there first */
> + /*
> + * We shouldn't remove the buffer from the committing
> + * transaction if it has failed to be written.
> + * Otherwise, it breaks the ordered mode rule.
> + */
> + if (unlikely(!buffer_uptodate(bh))) {
> + ret = -EIO;
> + goto no_journal;
> + }
> +
> if (jh->b_transaction != NULL) {
> JBUFFER_TRACE(jh, "unfile from commit");
> __journal_temp_unlink_buffer(jh);
> @@ -1108,7 +1118,7 @@ no_journal:
> }
> JBUFFER_TRACE(jh, "exit");
> journal_put_journal_head(jh);
> - return 0;
> + return ret;
> }
>
> /**
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-14 13:15:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased)

On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote:
> Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers
>
> If we failed to write metadata buffers to the journal space and
> succeeded to write the commit record, stale data can be written
> back to the filesystem as metadata in the recovery phase.
>
> To avoid this, when we failed to write out metadata buffers,
> abort the journal before writing the commit record.
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> ---
> fs/jbd/commit.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> Index: linux-2.6.26-rc2/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/commit.c
> +++ linux-2.6.26-rc2/fs/jbd/commit.c
> @@ -703,6 +703,9 @@ wait_for_iobuf:
> __brelse(bh);
> }
>
> + if (err)
> + journal_abort(journal, err);
> +
> J_ASSERT (commit_transaction->t_shadow_list == NULL);
Shouldn't this rather be further just before
journal_write_commit_record()? We should abort also if writing revoke
records etc. failed, shouldn't we?

> jbd_debug(3, "JBD: commit phase 5\n");
>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-14 13:32:39

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)

On Wed, May 14, 2008 at 01:50:43PM +0900, Hidehiro Kawai wrote:
> Subject: [PATCH 4/4] jbd: fix error handling for checkpoint io
>
> When a checkpointing IO fails, current JBD code doesn't check the
> error and continue journaling. This means latest metadata can be
> lost from both the journal and filesystem.
>
> This patch leaves the failed metadata blocks in the journal space
> and aborts journaling in the case of log_do_checkpoint().
> To achieve this, we need to do:
>
> 1. don't remove the failed buffer from the checkpoint list where in
> the case of __try_to_free_cp_buf() because it may be released or
> overwritten by a later transaction
> 2. log_do_checkpoint() is the last chance, remove the failed buffer
> from the checkpoint list and abort journaling
> 3. when checkpointing fails, don't update the journal super block to
> prevent the journalled contents from being cleaned
> 4. when checkpointing fails, don't clear the needs_recovery flag,
> otherwise the journalled contents are ignored and cleaned in the
> recovery phase
> 5. if the recovery fails, keep the needs_recovery flag
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> ---
> fs/ext3/ioctl.c | 12 ++++----
> fs/ext3/super.c | 13 +++++++--
> fs/jbd/checkpoint.c | 59 ++++++++++++++++++++++++++++++++++--------
> fs/jbd/journal.c | 21 +++++++++-----
> fs/jbd/recovery.c | 7 +++-
> include/linux/jbd.h | 7 ++++
> 6 files changed, 91 insertions(+), 28 deletions(-)
>
> Index: linux-2.6.26-rc2/fs/jbd/checkpoint.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/checkpoint.c
> +++ linux-2.6.26-rc2/fs/jbd/checkpoint.c
> @@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
> int ret = 0;
> struct buffer_head *bh = jh2bh(jh);
>
> - if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
> + if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
> + !buffer_dirty(bh) && buffer_uptodate(bh)) {
> JBUFFER_TRACE(jh, "remove from checkpoint list");
> ret = __journal_remove_checkpoint(jh) + 1;
> jbd_unlock_bh_state(bh);
> @@ -140,6 +141,25 @@ void __log_wait_for_space(journal_t *jou
> }
>
> /*
> + * We couldn't write back some metadata buffers to the filesystem.
> + * To avoid unwritten-back metadata buffers from losing, set
> + * JFS_CP_ABORT flag and make sure that the journal space isn't
> + * cleaned. This function also aborts journaling.
> + */
> +static void __journal_abort_checkpoint(journal_t *journal, int errno)
> +{
> + if (is_checkpoint_aborted(journal))
> + return;
> +
> + spin_lock(&journal->j_state_lock);
> + journal->j_flags |= JFS_CP_ABORT;
> + spin_unlock(&journal->j_state_lock);
> + printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks "
> + "are still old.\n");
> + journal_abort(journal, errno);
> +}
> +
> +/*
> * We were unable to perform jbd_trylock_bh_state() inside j_list_lock.
> * The caller must restart a list walk. Wait for someone else to run
> * jbd_unlock_bh_state().
> @@ -160,21 +180,25 @@ static void jbd_sync_bh(journal_t *journ
> * buffers. Note that we take the buffers in the opposite ordering
> * from the one in which they were submitted for IO.
> *
> + * Return 0 on success, and return <0 if some buffers have failed
> + * to be written out.
> + *
> * Called with j_list_lock held.
> */
> -static void __wait_cp_io(journal_t *journal, transaction_t *transaction)
> +static int __wait_cp_io(journal_t *journal, transaction_t *transaction)
> {
> struct journal_head *jh;
> struct buffer_head *bh;
> tid_t this_tid;
> int released = 0;
> + int ret = 0;
>
> this_tid = transaction->t_tid;
> restart:
> /* Did somebody clean up the transaction in the meanwhile? */
> if (journal->j_checkpoint_transactions != transaction ||
> transaction->t_tid != this_tid)
> - return;
> + return ret;
> while (!released && transaction->t_checkpoint_io_list) {
> jh = transaction->t_checkpoint_io_list;
> bh = jh2bh(jh);
> @@ -194,6 +218,9 @@ restart:
> spin_lock(&journal->j_list_lock);
> goto restart;
> }
> + if (unlikely(!buffer_uptodate(bh)))
> + ret = -EIO;
> +
> /*
> * Now in whatever state the buffer currently is, we know that
> * it has been written out and so we can drop it from the list
> @@ -203,6 +230,8 @@ restart:
> journal_remove_journal_head(bh);
> __brelse(bh);
> }
> +
> + return ret;
> }
>
> #define NR_BATCH 64
> @@ -226,7 +255,8 @@ __flush_batch(journal_t *journal, struct
> * Try to flush one buffer from the checkpoint list to disk.
> *
> * Return 1 if something happened which requires us to abort the current
> - * scan of the checkpoint list.
> + * scan of the checkpoint list. Return <0 if the buffer has failed to
> + * be written out.
> *
> * Called with j_list_lock held and drops it if 1 is returned
> * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
> @@ -256,6 +286,9 @@ static int __process_buffer(journal_t *j
> log_wait_commit(journal, tid);
> ret = 1;
> } else if (!buffer_dirty(bh)) {
> + ret = 1;
> + if (unlikely(!buffer_uptodate(bh)))
> + ret = -EIO;
> J_ASSERT_JH(jh, !buffer_jbddirty(bh));
> BUFFER_TRACE(bh, "remove from checkpoint");
> __journal_remove_checkpoint(jh);
> @@ -263,7 +296,6 @@ static int __process_buffer(journal_t *j
> jbd_unlock_bh_state(bh);
> journal_remove_journal_head(bh);
> __brelse(bh);
> - ret = 1;
> } else {
> /*
> * Important: we are about to write the buffer, and
> @@ -318,6 +350,7 @@ int log_do_checkpoint(journal_t *journal
> * OK, we need to start writing disk blocks. Take one transaction
> * and write it.
> */
> + result = 0;
> spin_lock(&journal->j_list_lock);
> if (!journal->j_checkpoint_transactions)
> goto out;
> @@ -334,7 +367,7 @@ restart:
> int batch_count = 0;
> struct buffer_head *bhs[NR_BATCH];
> struct journal_head *jh;
> - int retry = 0;
> + int retry = 0, err;
>
> while (!retry && transaction->t_checkpoint_list) {
> struct buffer_head *bh;
> @@ -347,6 +380,8 @@ restart:
> break;
> }
> retry = __process_buffer(journal, jh, bhs,&batch_count);
> + if (retry < 0)
> + result = retry;
> if (!retry && (need_resched() ||
> spin_needbreak(&journal->j_list_lock))) {
> spin_unlock(&journal->j_list_lock);
> @@ -371,14 +406,18 @@ restart:
> * Now we have cleaned up the first transaction's checkpoint
> * list. Let's clean up the second one
> */
> - __wait_cp_io(journal, transaction);
> + err = __wait_cp_io(journal, transaction);
> + if (!result)
> + result = err;
> }
> out:
> spin_unlock(&journal->j_list_lock);
> - result = cleanup_journal_tail(journal);
> if (result < 0)
> - return result;
> - return 0;
> + __journal_abort_checkpoint(journal, result);
> + else
> + result = cleanup_journal_tail(journal);
> +
> + return (result < 0) ? result : 0;
> }
>
> /*
> Index: linux-2.6.26-rc2/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.26-rc2.orig/include/linux/jbd.h
> +++ linux-2.6.26-rc2/include/linux/jbd.h
> @@ -816,6 +816,8 @@ struct journal_s
> #define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */
> #define JFS_LOADED 0x010 /* The journal superblock has been loaded */
> #define JFS_BARRIER 0x020 /* Use IDE barriers */
> +#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to
> + * clean the journal space. */
>
> /*
> * Function declarations for the journaling transaction and buffer
> @@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
> return journal->j_flags & JFS_ABORT;
> }
>
> +static inline int is_checkpoint_aborted(journal_t *journal)
> +{
> + return journal->j_flags & JFS_CP_ABORT;
> +}
> +
> static inline int is_handle_aborted(handle_t *handle)
> {
> if (handle->h_aborted)
> Index: linux-2.6.26-rc2/fs/ext3/ioctl.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/ext3/ioctl.c
> +++ linux-2.6.26-rc2/fs/ext3/ioctl.c
> @@ -239,7 +239,7 @@ setrsvsz_out:
> case EXT3_IOC_GROUP_EXTEND: {
> ext3_fsblk_t n_blocks_count;
> struct super_block *sb = inode->i_sb;
> - int err;
> + int err, err2;
>
> if (!capable(CAP_SYS_RESOURCE))
> return -EPERM;
> @@ -254,16 +254,16 @@ setrsvsz_out:
> }
> err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
> journal_lock_updates(EXT3_SB(sb)->s_journal);
> - journal_flush(EXT3_SB(sb)->s_journal);
> + err2 = journal_flush(EXT3_SB(sb)->s_journal);
> journal_unlock_updates(EXT3_SB(sb)->s_journal);
> group_extend_out:
> mnt_drop_write(filp->f_path.mnt);
> - return err;
> + return (err) ? err : err2;
> }
> case EXT3_IOC_GROUP_ADD: {
> struct ext3_new_group_data input;
> struct super_block *sb = inode->i_sb;
> - int err;
> + int err, err2;
>
> if (!capable(CAP_SYS_RESOURCE))
> return -EPERM;
> @@ -280,11 +280,11 @@ group_extend_out:
>
> err = ext3_group_add(sb, &input);
> journal_lock_updates(EXT3_SB(sb)->s_journal);
> - journal_flush(EXT3_SB(sb)->s_journal);
> + err2 = journal_flush(EXT3_SB(sb)->s_journal);
> journal_unlock_updates(EXT3_SB(sb)->s_journal);
> group_add_out:
> mnt_drop_write(filp->f_path.mnt);
> - return err;
> + return (err) ? err : err2;
> }
>
>
> Index: linux-2.6.26-rc2/fs/ext3/super.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/ext3/super.c
> +++ linux-2.6.26-rc2/fs/ext3/super.c
> @@ -395,7 +395,10 @@ static void ext3_put_super (struct super
> ext3_xattr_put_super(sb);
> journal_destroy(sbi->s_journal);
> if (!(sb->s_flags & MS_RDONLY)) {
> - EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> + if (!is_checkpoint_aborted(sbi->s_journal)) {
> + EXT3_CLEAR_INCOMPAT_FEATURE(sb,
> + EXT3_FEATURE_INCOMPAT_RECOVER);
> + }
> es->s_state = cpu_to_le16(sbi->s_mount_state);
> BUFFER_TRACE(sbi->s_sbh, "marking dirty");
> mark_buffer_dirty(sbi->s_sbh);

Is this bit here really needed? If we abort the journal the fs will be mounted
read only and we should never get in here. Is there a case where we could abort
the journal and not be flipped to being read-only? If there is such a case I
would think that we should fix that by making the fs read-only, and not have
this check.

> @@ -2373,7 +2376,13 @@ static void ext3_write_super_lockfs(stru
>
> /* Now we set up the journal barrier. */
> journal_lock_updates(journal);
> - journal_flush(journal);
> +
> + /*
> + * We don't want to clear needs_recovery flag when we failed
> + * to flush the journal.
> + */
> + if (journal_flush(journal) < 0)
> + return;
>
> /* Journal blocked and flushed, clear needs_recovery flag. */
> EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> Index: linux-2.6.26-rc2/fs/jbd/journal.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/journal.c
> +++ linux-2.6.26-rc2/fs/jbd/journal.c
> @@ -674,7 +674,7 @@ static journal_t * journal_init_common (
> journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE);
>
> /* The journal is marked for error until we succeed with recovery! */
> - journal->j_flags = JFS_ABORT;
> + journal->j_flags = JFS_ABORT | JFS_CP_ABORT;
>
> /* Set up a default-sized revoke table for the new mount. */
> err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
> @@ -1107,7 +1107,7 @@ int journal_load(journal_t *journal)
> if (journal_reset(journal))
> goto recovery_error;
>
> - journal->j_flags &= ~JFS_ABORT;
> + journal->j_flags &= ~(JFS_ABORT | JFS_CP_ABORT);
> journal->j_flags |= JFS_LOADED;
> return 0;
>
> @@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
> journal->j_tail = 0;
> journal->j_tail_sequence = ++journal->j_transaction_sequence;
> if (journal->j_sb_buffer) {
> - journal_update_superblock(journal, 1);
> + if (!is_checkpoint_aborted(journal))
> + journal_update_superblock(journal, 1);
> brelse(journal->j_sb_buffer);
> }
>
> @@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1
>
> int journal_flush(journal_t *journal)
> {
> - int err = 0;
> transaction_t *transaction = NULL;
> unsigned long old_tail;
>
> @@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal)
> spin_unlock(&journal->j_state_lock);
> }
>
> - /* ...and flush everything in the log out to disk. */
> + /* ...and flush everything in the log out to disk.
> + * Even if an error occurs, we don't stop this loop.
> + * We do checkpoint as much as possible. */
> spin_lock(&journal->j_list_lock);
> - while (!err && journal->j_checkpoint_transactions != NULL) {
> + while (journal->j_checkpoint_transactions != NULL) {
> spin_unlock(&journal->j_list_lock);
> - err = log_do_checkpoint(journal);
> + log_do_checkpoint(journal);
> spin_lock(&journal->j_list_lock);
> }
> spin_unlock(&journal->j_list_lock);
> + if (is_checkpoint_aborted(journal))
> + return -EIO;
> +
> cleanup_journal_tail(journal);
>
> /* Finally, mark the journal as really needing no recovery.
> @@ -1385,7 +1390,7 @@ int journal_flush(journal_t *journal)
> J_ASSERT(journal->j_head == journal->j_tail);
> J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence);
> spin_unlock(&journal->j_state_lock);
> - return err;
> + return 0;
> }
>
> /**
> Index: linux-2.6.26-rc2/fs/jbd/recovery.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/recovery.c
> +++ linux-2.6.26-rc2/fs/jbd/recovery.c
> @@ -223,7 +223,7 @@ do { \
> */
> int journal_recover(journal_t *journal)
> {
> - int err;
> + int err, err2;
> journal_superblock_t * sb;
>
> struct recovery_info info;
> @@ -261,7 +261,10 @@ int journal_recover(journal_t *journal)
> journal->j_transaction_sequence = ++info.end_transaction;
>
> journal_clear_revoke(journal);
> - sync_blockdev(journal->j_fs_dev);
> + err2 = sync_blockdev(journal->j_fs_dev);
> + if (!err)
> + err = err2;
> +
> return err;
> }
>
>
>

Other than that one question it looks good to me. Thanks much,

Josef

2008-05-14 14:32:19

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)

Hello,

On Wed 14-05-08 13:50:43, Hidehiro Kawai wrote:
> Subject: [PATCH 4/4] jbd: fix error handling for checkpoint io
>
> When a checkpointing IO fails, current JBD code doesn't check the
> error and continue journaling. This means latest metadata can be
> lost from both the journal and filesystem.
>
> This patch leaves the failed metadata blocks in the journal space
> and aborts journaling in the case of log_do_checkpoint().
> To achieve this, we need to do:
>
> 1. don't remove the failed buffer from the checkpoint list where in
> the case of __try_to_free_cp_buf() because it may be released or
> overwritten by a later transaction
> 2. log_do_checkpoint() is the last chance, remove the failed buffer
> from the checkpoint list and abort journaling
> 3. when checkpointing fails, don't update the journal super block to
> prevent the journalled contents from being cleaned
> 4. when checkpointing fails, don't clear the needs_recovery flag,
> otherwise the journalled contents are ignored and cleaned in the
> recovery phase
> 5. if the recovery fails, keep the needs_recovery flag
I have a few comments below..

> Signed-off-by: Hidehiro Kawai <[email protected]>
> ---
> fs/ext3/ioctl.c | 12 ++++----
> fs/ext3/super.c | 13 +++++++--
> fs/jbd/checkpoint.c | 59 ++++++++++++++++++++++++++++++++++--------
> fs/jbd/journal.c | 21 +++++++++-----
> fs/jbd/recovery.c | 7 +++-
> include/linux/jbd.h | 7 ++++
> 6 files changed, 91 insertions(+), 28 deletions(-)
>
> Index: linux-2.6.26-rc2/fs/jbd/checkpoint.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/checkpoint.c
> +++ linux-2.6.26-rc2/fs/jbd/checkpoint.c
> @@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
> int ret = 0;
> struct buffer_head *bh = jh2bh(jh);
>
> - if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
> + if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
> + !buffer_dirty(bh) && buffer_uptodate(bh)) {
> JBUFFER_TRACE(jh, "remove from checkpoint list");
> ret = __journal_remove_checkpoint(jh) + 1;
> jbd_unlock_bh_state(bh);
> @@ -140,6 +141,25 @@ void __log_wait_for_space(journal_t *jou
> }
>
> /*
> + * We couldn't write back some metadata buffers to the filesystem.
> + * To avoid unwritten-back metadata buffers from losing, set
> + * JFS_CP_ABORT flag and make sure that the journal space isn't
> + * cleaned. This function also aborts journaling.
> + */
> +static void __journal_abort_checkpoint(journal_t *journal, int errno)
> +{
> + if (is_checkpoint_aborted(journal))
> + return;
> +
> + spin_lock(&journal->j_state_lock);
> + journal->j_flags |= JFS_CP_ABORT;
> + spin_unlock(&journal->j_state_lock);
> + printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks "
> + "are still old.\n");
> + journal_abort(journal, errno);
> +}
> +
> +/*
> * We were unable to perform jbd_trylock_bh_state() inside j_list_lock.
> * The caller must restart a list walk. Wait for someone else to run
> * jbd_unlock_bh_state().
> @@ -160,21 +180,25 @@ static void jbd_sync_bh(journal_t *journ
> * buffers. Note that we take the buffers in the opposite ordering
> * from the one in which they were submitted for IO.
> *
> + * Return 0 on success, and return <0 if some buffers have failed
> + * to be written out.
> + *
> * Called with j_list_lock held.
> */
> -static void __wait_cp_io(journal_t *journal, transaction_t *transaction)
> +static int __wait_cp_io(journal_t *journal, transaction_t *transaction)
> {
> struct journal_head *jh;
> struct buffer_head *bh;
> tid_t this_tid;
> int released = 0;
> + int ret = 0;
>
> this_tid = transaction->t_tid;
> restart:
> /* Did somebody clean up the transaction in the meanwhile? */
> if (journal->j_checkpoint_transactions != transaction ||
> transaction->t_tid != this_tid)
> - return;
> + return ret;
> while (!released && transaction->t_checkpoint_io_list) {
> jh = transaction->t_checkpoint_io_list;
> bh = jh2bh(jh);
> @@ -194,6 +218,9 @@ restart:
> spin_lock(&journal->j_list_lock);
> goto restart;
> }
> + if (unlikely(!buffer_uptodate(bh)))
> + ret = -EIO;
> +
> /*
> * Now in whatever state the buffer currently is, we know that
> * it has been written out and so we can drop it from the list
> @@ -203,6 +230,8 @@ restart:
> journal_remove_journal_head(bh);
> __brelse(bh);
> }
> +
> + return ret;
> }
>
> #define NR_BATCH 64
> @@ -226,7 +255,8 @@ __flush_batch(journal_t *journal, struct
> * Try to flush one buffer from the checkpoint list to disk.
> *
> * Return 1 if something happened which requires us to abort the current
> - * scan of the checkpoint list.
> + * scan of the checkpoint list. Return <0 if the buffer has failed to
> + * be written out.
> *
> * Called with j_list_lock held and drops it if 1 is returned
> * Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
> @@ -256,6 +286,9 @@ static int __process_buffer(journal_t *j
> log_wait_commit(journal, tid);
> ret = 1;
> } else if (!buffer_dirty(bh)) {
> + ret = 1;
> + if (unlikely(!buffer_uptodate(bh)))
> + ret = -EIO;
> J_ASSERT_JH(jh, !buffer_jbddirty(bh));
> BUFFER_TRACE(bh, "remove from checkpoint");
> __journal_remove_checkpoint(jh);
> @@ -263,7 +296,6 @@ static int __process_buffer(journal_t *j
> jbd_unlock_bh_state(bh);
> journal_remove_journal_head(bh);
> __brelse(bh);
> - ret = 1;
> } else {
> /*
> * Important: we are about to write the buffer, and
> @@ -318,6 +350,7 @@ int log_do_checkpoint(journal_t *journal
> * OK, we need to start writing disk blocks. Take one transaction
> * and write it.
> */
> + result = 0;
> spin_lock(&journal->j_list_lock);
> if (!journal->j_checkpoint_transactions)
> goto out;
> @@ -334,7 +367,7 @@ restart:
> int batch_count = 0;
> struct buffer_head *bhs[NR_BATCH];
> struct journal_head *jh;
> - int retry = 0;
> + int retry = 0, err;
>
> while (!retry && transaction->t_checkpoint_list) {
> struct buffer_head *bh;
> @@ -347,6 +380,8 @@ restart:
> break;
> }
> retry = __process_buffer(journal, jh, bhs,&batch_count);
> + if (retry < 0)
> + result = retry;
> if (!retry && (need_resched() ||
> spin_needbreak(&journal->j_list_lock))) {
> spin_unlock(&journal->j_list_lock);
> @@ -371,14 +406,18 @@ restart:
> * Now we have cleaned up the first transaction's checkpoint
> * list. Let's clean up the second one
> */
> - __wait_cp_io(journal, transaction);
> + err = __wait_cp_io(journal, transaction);
> + if (!result)
> + result = err;
> }
> out:
> spin_unlock(&journal->j_list_lock);
> - result = cleanup_journal_tail(journal);
> if (result < 0)
> - return result;
> - return 0;
> + __journal_abort_checkpoint(journal, result);
> + else
> + result = cleanup_journal_tail(journal);
> +
> + return (result < 0) ? result : 0;
> }
>
> /*
> Index: linux-2.6.26-rc2/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.26-rc2.orig/include/linux/jbd.h
> +++ linux-2.6.26-rc2/include/linux/jbd.h
> @@ -816,6 +816,8 @@ struct journal_s
> #define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */
> #define JFS_LOADED 0x010 /* The journal superblock has been loaded */
> #define JFS_BARRIER 0x020 /* Use IDE barriers */
> +#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to
> + * clean the journal space. */
>
> /*
> * Function declarations for the journaling transaction and buffer
> @@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
> return journal->j_flags & JFS_ABORT;
> }
>
> +static inline int is_checkpoint_aborted(journal_t *journal)
> +{
> + return journal->j_flags & JFS_CP_ABORT;
> +}
> +
Actually, I don't think this new flag is needed (not that it would really
harm anything). At least at the places where you check it you can as well
check whether the journal is aborted (maybe except for journal_destroy()
but see my comment there).

> static inline int is_handle_aborted(handle_t *handle)
> {
> if (handle->h_aborted)
> Index: linux-2.6.26-rc2/fs/ext3/ioctl.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/ext3/ioctl.c
> +++ linux-2.6.26-rc2/fs/ext3/ioctl.c
> @@ -239,7 +239,7 @@ setrsvsz_out:
> case EXT3_IOC_GROUP_EXTEND: {
> ext3_fsblk_t n_blocks_count;
> struct super_block *sb = inode->i_sb;
> - int err;
> + int err, err2;
>
> if (!capable(CAP_SYS_RESOURCE))
> return -EPERM;
> @@ -254,16 +254,16 @@ setrsvsz_out:
> }
> err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
> journal_lock_updates(EXT3_SB(sb)->s_journal);
> - journal_flush(EXT3_SB(sb)->s_journal);
> + err2 = journal_flush(EXT3_SB(sb)->s_journal);
> journal_unlock_updates(EXT3_SB(sb)->s_journal);
> group_extend_out:
> mnt_drop_write(filp->f_path.mnt);
> - return err;
> + return (err) ? err : err2;
> }
> case EXT3_IOC_GROUP_ADD: {
> struct ext3_new_group_data input;
> struct super_block *sb = inode->i_sb;
> - int err;
> + int err, err2;
>
> if (!capable(CAP_SYS_RESOURCE))
> return -EPERM;
> @@ -280,11 +280,11 @@ group_extend_out:
>
> err = ext3_group_add(sb, &input);
> journal_lock_updates(EXT3_SB(sb)->s_journal);
> - journal_flush(EXT3_SB(sb)->s_journal);
> + err2 = journal_flush(EXT3_SB(sb)->s_journal);
> journal_unlock_updates(EXT3_SB(sb)->s_journal);
> group_add_out:
> mnt_drop_write(filp->f_path.mnt);
> - return err;
> + return (err) ? err : err2;
> }
Please split ext3 changes into a separate patch. There's no reason
why they should be together with JBD fixes (i.e. JBD fixes don't depend on
ext3 being fixed). Thanks.

> Index: linux-2.6.26-rc2/fs/ext3/super.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/ext3/super.c
> +++ linux-2.6.26-rc2/fs/ext3/super.c
> @@ -395,7 +395,10 @@ static void ext3_put_super (struct super
> ext3_xattr_put_super(sb);
> journal_destroy(sbi->s_journal);
> if (!(sb->s_flags & MS_RDONLY)) {
> - EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> + if (!is_checkpoint_aborted(sbi->s_journal)) {
> + EXT3_CLEAR_INCOMPAT_FEATURE(sb,
> + EXT3_FEATURE_INCOMPAT_RECOVER);
> + }
I think you should test here whether the journal is aborted (and
EXT3_MOUNT_ABORT isn't set) and if so, call ext3_abort() and just
completely avoid updating super block...

> es->s_state = cpu_to_le16(sbi->s_mount_state);
> BUFFER_TRACE(sbi->s_sbh, "marking dirty");
> mark_buffer_dirty(sbi->s_sbh);
> @@ -2373,7 +2376,13 @@ static void ext3_write_super_lockfs(stru
>
> /* Now we set up the journal barrier. */
> journal_lock_updates(journal);
> - journal_flush(journal);
> +
> + /*
> + * We don't want to clear needs_recovery flag when we failed
> + * to flush the journal.
> + */
> + if (journal_flush(journal) < 0)
> + return;
>
> /* Journal blocked and flushed, clear needs_recovery flag. */
> EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> Index: linux-2.6.26-rc2/fs/jbd/journal.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/journal.c
> +++ linux-2.6.26-rc2/fs/jbd/journal.c
> @@ -674,7 +674,7 @@ static journal_t * journal_init_common (
> journal->j_commit_interval = (HZ * JBD_DEFAULT_MAX_COMMIT_AGE);
>
> /* The journal is marked for error until we succeed with recovery! */
> - journal->j_flags = JFS_ABORT;
> + journal->j_flags = JFS_ABORT | JFS_CP_ABORT;
>
> /* Set up a default-sized revoke table for the new mount. */
> err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
> @@ -1107,7 +1107,7 @@ int journal_load(journal_t *journal)
> if (journal_reset(journal))
> goto recovery_error;
>
> - journal->j_flags &= ~JFS_ABORT;
> + journal->j_flags &= ~(JFS_ABORT | JFS_CP_ABORT);
> journal->j_flags |= JFS_LOADED;
> return 0;
>
> @@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
> journal->j_tail = 0;
> journal->j_tail_sequence = ++journal->j_transaction_sequence;
> if (journal->j_sb_buffer) {
> - journal_update_superblock(journal, 1);
> + if (!is_checkpoint_aborted(journal))
> + journal_update_superblock(journal, 1);
> brelse(journal->j_sb_buffer);
> }
I don't like this much. I'd rather completely avoid updating j_tail and
j_tail_sequence above in case the journal is aborted but I'd write the
journal superblock so that information about abortion gets written...

> @@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1
>
> int journal_flush(journal_t *journal)
> {
> - int err = 0;
> transaction_t *transaction = NULL;
> unsigned long old_tail;
>
> @@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal)
> spin_unlock(&journal->j_state_lock);
> }
>
> - /* ...and flush everything in the log out to disk. */
> + /* ...and flush everything in the log out to disk.
> + * Even if an error occurs, we don't stop this loop.
> + * We do checkpoint as much as possible. */
> spin_lock(&journal->j_list_lock);
> - while (!err && journal->j_checkpoint_transactions != NULL) {
> + while (journal->j_checkpoint_transactions != NULL) {
> spin_unlock(&journal->j_list_lock);
> - err = log_do_checkpoint(journal);
> + log_do_checkpoint(journal);
> spin_lock(&journal->j_list_lock);
> }
> spin_unlock(&journal->j_list_lock);
> + if (is_checkpoint_aborted(journal))
> + return -EIO;
> +
Why do you change the loop? If we fail to checkpoint some buffer, we stop
journaling anyway and so journal will be replayed when fsck is run...

> cleanup_journal_tail(journal);
>
> /* Finally, mark the journal as really needing no recovery.
> @@ -1385,7 +1390,7 @@ int journal_flush(journal_t *journal)
> J_ASSERT(journal->j_head == journal->j_tail);
> J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence);
> spin_unlock(&journal->j_state_lock);
> - return err;
> + return 0;
> }
>
> /**
> Index: linux-2.6.26-rc2/fs/jbd/recovery.c
> ===================================================================
> --- linux-2.6.26-rc2.orig/fs/jbd/recovery.c
> +++ linux-2.6.26-rc2/fs/jbd/recovery.c
> @@ -223,7 +223,7 @@ do { \
> */
> int journal_recover(journal_t *journal)
> {
> - int err;
> + int err, err2;
> journal_superblock_t * sb;
>
> struct recovery_info info;
> @@ -261,7 +261,10 @@ int journal_recover(journal_t *journal)
> journal->j_transaction_sequence = ++info.end_transaction;
>
> journal_clear_revoke(journal);
> - sync_blockdev(journal->j_fs_dev);
> + err2 = sync_blockdev(journal->j_fs_dev);
> + if (!err)
> + err = err2;
> +
> return err;
> }
>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-14 14:44:11

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)

> >
> > Index: linux-2.6.26-rc2/fs/ext3/super.c
> > ===================================================================
> > --- linux-2.6.26-rc2.orig/fs/ext3/super.c
> > +++ linux-2.6.26-rc2/fs/ext3/super.c
> > @@ -395,7 +395,10 @@ static void ext3_put_super (struct super
> > ext3_xattr_put_super(sb);
> > journal_destroy(sbi->s_journal);
> > if (!(sb->s_flags & MS_RDONLY)) {
> > - EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> > + if (!is_checkpoint_aborted(sbi->s_journal)) {
> > + EXT3_CLEAR_INCOMPAT_FEATURE(sb,
> > + EXT3_FEATURE_INCOMPAT_RECOVER);
> > + }
> > es->s_state = cpu_to_le16(sbi->s_mount_state);
> > BUFFER_TRACE(sbi->s_sbh, "marking dirty");
> > mark_buffer_dirty(sbi->s_sbh);
>
> Is this bit here really needed? If we abort the journal the fs will be mounted
> read only and we should never get in here. Is there a case where we could abort
> the journal and not be flipped to being read-only? If there is such a case I
> would think that we should fix that by making the fs read-only, and not have
> this check.
Actually, journal_abort() (which could be called from journal_destroy())
does nothing to the filesystem as such. So at this moment, ext3 can still
happily think everything is OK. We only detect aborted journal in
ext3_journal_start_sb() and call ext3_abort() in that case, which does all
that is needed...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-14 14:54:03

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)

On Wed, May 14, 2008 at 04:44:10PM +0200, Jan Kara wrote:
> > >
> > > Index: linux-2.6.26-rc2/fs/ext3/super.c
> > > ===================================================================
> > > --- linux-2.6.26-rc2.orig/fs/ext3/super.c
> > > +++ linux-2.6.26-rc2/fs/ext3/super.c
> > > @@ -395,7 +395,10 @@ static void ext3_put_super (struct super
> > > ext3_xattr_put_super(sb);
> > > journal_destroy(sbi->s_journal);
> > > if (!(sb->s_flags & MS_RDONLY)) {
> > > - EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> > > + if (!is_checkpoint_aborted(sbi->s_journal)) {
> > > + EXT3_CLEAR_INCOMPAT_FEATURE(sb,
> > > + EXT3_FEATURE_INCOMPAT_RECOVER);
> > > + }
> > > es->s_state = cpu_to_le16(sbi->s_mount_state);
> > > BUFFER_TRACE(sbi->s_sbh, "marking dirty");
> > > mark_buffer_dirty(sbi->s_sbh);
> >
> > Is this bit here really needed? If we abort the journal the fs will be mounted
> > read only and we should never get in here. Is there a case where we could abort
> > the journal and not be flipped to being read-only? If there is such a case I
> > would think that we should fix that by making the fs read-only, and not have
> > this check.
> Actually, journal_abort() (which could be called from journal_destroy())
> does nothing to the filesystem as such. So at this moment, ext3 can still
> happily think everything is OK. We only detect aborted journal in
> ext3_journal_start_sb() and call ext3_abort() in that case, which does all
> that is needed...
>

Hmm you're right, I was thinking we did some other stuff before put_super which
would have caught a journal abort but it looks like thats not the case. Still
shouldn't do is_checkpoint_aborted(sbi->s_journal) since journal_destroy()
kfree's the journal. Should probably move the is_journal_aborted() check above
that or something. Thanks,

Josef

2008-05-16 10:25:56

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 2/4] jbd: ordered data integrity fix (rebased)

Hi,

Thanks for your comment.

Jan Kara wrote:

> On Wed 14-05-08 13:48:43, Hidehiro Kawai wrote:
>
>>Subject: [PATCH 2/4] jbd: ordered data integrity fix
>>
>>In ordered mode, if a buffer being dirtied exists in the committing
>>transaction, we write the buffer to the disk, move it from the
>>committing transaction to the running transaction, then dirty it.
>>But we don't have to remove the buffer from the committing
>>transaction when the buffer couldn't be written out, otherwise it
>>breaks the ordered mode rule.
>
> Hmm, could you elaborate a bit more what exactly is broken and how does
> this help to fix it? Because even if we find EIO happened on data buffer,
> we currently don't do anything else than just remove the buffer from the
> transaction and abort the journal. And even if we later managed to write
> the data buffer from other process before the journal is aborted, ordered
> mode guarantees are satisfied - we only guarantee that too old data cannot
> be seen, newer can be seen easily... Thanks.

In the case where I stated the above, error checking is postponed to
the next (currently running) transaction because the buffer is removed
from the committing transaction before checked for an error. This can
happen repeatedly, then the error won't be detected "for a long time".
However, finally the error is detected by, for example,
journal_commit_transaction(), we can abort the journal. So this
problem is not so serious than the other patches which I sent.

Thanks,

--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


2008-05-16 10:27:05

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased)

Hi,

Thank you for review.

Jan Kara wrote:

> On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote:
>
>>Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers
>>
>>If we failed to write metadata buffers to the journal space and
>>succeeded to write the commit record, stale data can be written
>>back to the filesystem as metadata in the recovery phase.
>>
>>To avoid this, when we failed to write out metadata buffers,
>>abort the journal before writing the commit record.
>>
>>Signed-off-by: Hidehiro Kawai <[email protected]>
>>---
>> fs/jbd/commit.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>>Index: linux-2.6.26-rc2/fs/jbd/commit.c
>>===================================================================
>>--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
>>+++ linux-2.6.26-rc2/fs/jbd/commit.c
>>@@ -703,6 +703,9 @@ wait_for_iobuf:
>> __brelse(bh);
>> }
>>
>>+ if (err)
>>+ journal_abort(journal, err);
>>+
>> J_ASSERT (commit_transaction->t_shadow_list == NULL);
>
> Shouldn't this rather be further just before
> journal_write_commit_record()? We should abort also if writing revoke
> records etc. failed, shouldn't we?

Unlike metadata blocks, each revoke block has a descriptor with the
sequence number of the commiting transaction. If we failed to write
a revoke block, there should be an old control block, metadata block,
or zero-filled block where we tried to write the revoke block.
In the recovery process, this old invalid block is detected by
checking its magic number and sequence number, then the transaction
is ignored even if we have succeeded to write the commit record.
So I think we don't need to check for errors just after writing
revoke records.

Thanks,

>> jbd_debug(3, "JBD: commit phase 5\n");
>>

--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


2008-05-16 10:28:20

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)

Hi,

Thank you for review.

Josef Bacik wrote:

> On Wed, May 14, 2008 at 04:44:10PM +0200, Jan Kara wrote:
>
>>>>
>>>>Index: linux-2.6.26-rc2/fs/ext3/super.c
>>>>===================================================================
>>>>--- linux-2.6.26-rc2.orig/fs/ext3/super.c
>>>>+++ linux-2.6.26-rc2/fs/ext3/super.c
>>>>@@ -395,7 +395,10 @@ static void ext3_put_super (struct super
>>>> ext3_xattr_put_super(sb);
>>>> journal_destroy(sbi->s_journal);
>>>> if (!(sb->s_flags & MS_RDONLY)) {
>>>>- EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
>>>>+ if (!is_checkpoint_aborted(sbi->s_journal)) {
>>>>+ EXT3_CLEAR_INCOMPAT_FEATURE(sb,
>>>>+ EXT3_FEATURE_INCOMPAT_RECOVER);
>>>>+ }
>>>> es->s_state = cpu_to_le16(sbi->s_mount_state);
>>>> BUFFER_TRACE(sbi->s_sbh, "marking dirty");
>>>> mark_buffer_dirty(sbi->s_sbh);
>>>
>>>Is this bit here really needed? If we abort the journal the fs will be mounted
>>>read only and we should never get in here. Is there a case where we could abort
>>>the journal and not be flipped to being read-only? If there is such a case I
>>>would think that we should fix that by making the fs read-only, and not have
>>>this check.
>>
>> Actually, journal_abort() (which could be called from journal_destroy())
>>does nothing to the filesystem as such. So at this moment, ext3 can still
>>happily think everything is OK. We only detect aborted journal in
>>ext3_journal_start_sb() and call ext3_abort() in that case, which does all
>>that is needed...

Yes, that is why I added this check.


> Hmm you're right, I was thinking we did some other stuff before put_super which
> would have caught a journal abort but it looks like thats not the case. Still
> shouldn't do is_checkpoint_aborted(sbi->s_journal) since journal_destroy()
> kfree's the journal. Should probably move the is_journal_aborted() check above
> that or something. Thanks,

Good catch, I will fix it.
Thanks!

--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


2008-05-16 10:29:27

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)

Hi,
Thank you for review.

Jan Kara wrote:

>> /*
>> + * We couldn't write back some metadata buffers to the filesystem.
>> + * To avoid unwritten-back metadata buffers from losing, set
>> + * JFS_CP_ABORT flag and make sure that the journal space isn't
>> + * cleaned. This function also aborts journaling.
>> + */
>> +static void __journal_abort_checkpoint(journal_t *journal, int errno)
>> +{
>> + if (is_checkpoint_aborted(journal))
>> + return;
>> +
>> + spin_lock(&journal->j_state_lock);
>> + journal->j_flags |= JFS_CP_ABORT;
>> + spin_unlock(&journal->j_state_lock);
>> + printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks "
>> + "are still old.\n");
>> + journal_abort(journal, errno);
>> +}

[snip]

>>Index: linux-2.6.26-rc2/include/linux/jbd.h
>>===================================================================
>>--- linux-2.6.26-rc2.orig/include/linux/jbd.h
>>+++ linux-2.6.26-rc2/include/linux/jbd.h
>>@@ -816,6 +816,8 @@ struct journal_s
>> #define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */
>> #define JFS_LOADED 0x010 /* The journal superblock has been loaded */
>> #define JFS_BARRIER 0x020 /* Use IDE barriers */
>>+#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to
>>+ * clean the journal space. */
>>
>> /*
>> * Function declarations for the journaling transaction and buffer
>>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
>> return journal->j_flags & JFS_ABORT;
>> }
>>
>>+static inline int is_checkpoint_aborted(journal_t *journal)
>>+{
>>+ return journal->j_flags & JFS_CP_ABORT;
>>+}
>>+
>
> Actually, I don't think this new flag is needed (not that it would really
> harm anything). At least at the places where you check it you can as well
> check whether the journal is aborted (maybe except for journal_destroy()
> but see my comment there).

As you say, JFS_CP_ABORT isn't necessarily needed in the most cases,
but it is needed for __journal_abort_checkpoint() which report the
checkpoint related error by printk only once. If we use JFS_ABORT
to check whether this call is second time, the error message may be
never printed out because the journal has been aborted by another
reason. If we don't check for the second call, the error message
will be printed out several times.

Instead of using __journal_abort_checkpoint(), we'll be able to do
the similar thing by adding the printk() directly in journal_destroy(),
journal_flush(), and __log_wait_for_space() (but it can be more
than one time).

I agree that we should not add another flag as much as possible.
So I'll try to revise to remove the flag if you agree to add
3 printk()s.

> Please split ext3 changes into a separate patch. There's no reason
> why they should be together with JBD fixes (i.e. JBD fixes don't depend on
> ext3 being fixed). Thanks.

OK, I'll do so.


>>Index: linux-2.6.26-rc2/fs/ext3/super.c
>>===================================================================
>>--- linux-2.6.26-rc2.orig/fs/ext3/super.c
>>+++ linux-2.6.26-rc2/fs/ext3/super.c
>>@@ -395,7 +395,10 @@ static void ext3_put_super (struct super
>> ext3_xattr_put_super(sb);
>> journal_destroy(sbi->s_journal);
>> if (!(sb->s_flags & MS_RDONLY)) {
>>- EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
>>+ if (!is_checkpoint_aborted(sbi->s_journal)) {
>>+ EXT3_CLEAR_INCOMPAT_FEATURE(sb,
>>+ EXT3_FEATURE_INCOMPAT_RECOVER);
>>+ }
>
> I think you should test here whether the journal is aborted (and
> EXT3_MOUNT_ABORT isn't set) and if so, call ext3_abort() and just
> completely avoid updating super block...

Your idea looks good. But as Josef pointed out, journal_destroy()
frees sbi->s_journal, so we need to make journal_destroy() return the
error status.


>>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
>> journal->j_tail = 0;
>> journal->j_tail_sequence = ++journal->j_transaction_sequence;
>> if (journal->j_sb_buffer) {
>>- journal_update_superblock(journal, 1);
>>+ if (!is_checkpoint_aborted(journal))
>>+ journal_update_superblock(journal, 1);
>> brelse(journal->j_sb_buffer);
>> }
>
> I don't like this much. I'd rather completely avoid updating j_tail and
> j_tail_sequence above in case the journal is aborted but I'd write the
> journal superblock so that information about abortion gets written...

log_do_checkpoint() calls cleanup_journal_tail(), which advances
j_tail and j_tail_sequence. So if we adopt the policy that we don't
modify j_tail and j_tail_sequence when checkpointing failed, we should
also fix cleanup_journal_tail().

I adopted the policy that we don't update the journal super block
when checkpointing failed. When checkpointing failed, journal_abort()
is called before cleanup_journal_tail(). So what we want to write
should be in the journal super block.


>>@@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1
>>
>> int journal_flush(journal_t *journal)
>> {
>>- int err = 0;
>> transaction_t *transaction = NULL;
>> unsigned long old_tail;
>>
>>@@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal)
>> spin_unlock(&journal->j_state_lock);
>> }
>>
>>- /* ...and flush everything in the log out to disk. */
>>+ /* ...and flush everything in the log out to disk.
>>+ * Even if an error occurs, we don't stop this loop.
>>+ * We do checkpoint as much as possible. */
>> spin_lock(&journal->j_list_lock);
>>- while (!err && journal->j_checkpoint_transactions != NULL) {
>>+ while (journal->j_checkpoint_transactions != NULL) {
>> spin_unlock(&journal->j_list_lock);
>>- err = log_do_checkpoint(journal);
>>+ log_do_checkpoint(journal);
>> spin_lock(&journal->j_list_lock);
>> }
>> spin_unlock(&journal->j_list_lock);
>>+ if (is_checkpoint_aborted(journal))
>>+ return -EIO;
>>+
>
> Why do you change the loop? If we fail to checkpoint some buffer, we stop
> journaling anyway and so journal will be replayed when fsck is run...

Certainly, fsck or journal_recover() replay the unwritten-back metadata
block. journal_destroy() also call log_do_checkpoint() against all
un-checkpointed transactions. I just thought `flush' means writing out
all data which we should write to disk. There is no problem if I
don't change the loop.

Thanks,

--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


2008-05-19 03:11:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/4] jbd: ordered data integrity fix (rebased)

Hello,

On Fri 16-05-08 19:25:44, Hidehiro Kawai wrote:
> Jan Kara wrote:
>
> > On Wed 14-05-08 13:48:43, Hidehiro Kawai wrote:
> >
> >>Subject: [PATCH 2/4] jbd: ordered data integrity fix
> >>
> >>In ordered mode, if a buffer being dirtied exists in the committing
> >>transaction, we write the buffer to the disk, move it from the
> >>committing transaction to the running transaction, then dirty it.
> >>But we don't have to remove the buffer from the committing
> >>transaction when the buffer couldn't be written out, otherwise it
> >>breaks the ordered mode rule.
> >
> > Hmm, could you elaborate a bit more what exactly is broken and how does
> > this help to fix it? Because even if we find EIO happened on data buffer,
> > we currently don't do anything else than just remove the buffer from the
> > transaction and abort the journal. And even if we later managed to write
> > the data buffer from other process before the journal is aborted, ordered
> > mode guarantees are satisfied - we only guarantee that too old data cannot
> > be seen, newer can be seen easily... Thanks.
>
> In the case where I stated the above, error checking is postponed to
> the next (currently running) transaction because the buffer is removed
> from the committing transaction before checked for an error. This can
> happen repeatedly, then the error won't be detected "for a long time".
> However, finally the error is detected by, for example,
> journal_commit_transaction(), we can abort the journal. So this
> problem is not so serious than the other patches which I sent.
OK, I see. So I agree with the change but please add this explanation
(like: cannot remove buffer with io error from the committing transaction
because otherwise it would miss the error and commit would not abort) to
the comment in journal_dirty_data(). Thanks.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-19 03:14:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased)

Hello,

On Fri 16-05-08 19:26:57, Hidehiro Kawai wrote:
> Jan Kara wrote:
>
> > On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote:
> >
> >>Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers
> >>
> >>If we failed to write metadata buffers to the journal space and
> >>succeeded to write the commit record, stale data can be written
> >>back to the filesystem as metadata in the recovery phase.
> >>
> >>To avoid this, when we failed to write out metadata buffers,
> >>abort the journal before writing the commit record.
> >>
> >>Signed-off-by: Hidehiro Kawai <[email protected]>
> >>---
> >> fs/jbd/commit.c | 3 +++
> >> 1 file changed, 3 insertions(+)
> >>
> >>Index: linux-2.6.26-rc2/fs/jbd/commit.c
> >>===================================================================
> >>--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
> >>+++ linux-2.6.26-rc2/fs/jbd/commit.c
> >>@@ -703,6 +703,9 @@ wait_for_iobuf:
> >> __brelse(bh);
> >> }
> >>
> >>+ if (err)
> >>+ journal_abort(journal, err);
> >>+
> >> J_ASSERT (commit_transaction->t_shadow_list == NULL);
> >
> > Shouldn't this rather be further just before
> > journal_write_commit_record()? We should abort also if writing revoke
> > records etc. failed, shouldn't we?
>
> Unlike metadata blocks, each revoke block has a descriptor with the
> sequence number of the commiting transaction. If we failed to write
> a revoke block, there should be an old control block, metadata block,
> or zero-filled block where we tried to write the revoke block.
> In the recovery process, this old invalid block is detected by
> checking its magic number and sequence number, then the transaction
> is ignored even if we have succeeded to write the commit record.
> So I think we don't need to check for errors just after writing
> revoke records.
Yes, I agree that not doing such check will not cause data corruption but
still I think that in case we fail to properly commit a transaction, we
should detect the error and abort the journal...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-19 03:38:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)

Hello,

On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote:
> Jan Kara wrote:
> >> /*
> >> + * We couldn't write back some metadata buffers to the filesystem.
> >> + * To avoid unwritten-back metadata buffers from losing, set
> >> + * JFS_CP_ABORT flag and make sure that the journal space isn't
> >> + * cleaned. This function also aborts journaling.
> >> + */
> >> +static void __journal_abort_checkpoint(journal_t *journal, int errno)
> >> +{
> >> + if (is_checkpoint_aborted(journal))
> >> + return;
> >> +
> >> + spin_lock(&journal->j_state_lock);
> >> + journal->j_flags |= JFS_CP_ABORT;
> >> + spin_unlock(&journal->j_state_lock);
> >> + printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks "
> >> + "are still old.\n");
> >> + journal_abort(journal, errno);
> >> +}
>
> [snip]
>
> >>Index: linux-2.6.26-rc2/include/linux/jbd.h
> >>===================================================================
> >>--- linux-2.6.26-rc2.orig/include/linux/jbd.h
> >>+++ linux-2.6.26-rc2/include/linux/jbd.h
> >>@@ -816,6 +816,8 @@ struct journal_s
> >> #define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */
> >> #define JFS_LOADED 0x010 /* The journal superblock has been loaded */
> >> #define JFS_BARRIER 0x020 /* Use IDE barriers */
> >>+#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to
> >>+ * clean the journal space. */
> >>
> >> /*
> >> * Function declarations for the journaling transaction and buffer
> >>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
> >> return journal->j_flags & JFS_ABORT;
> >> }
> >>
> >>+static inline int is_checkpoint_aborted(journal_t *journal)
> >>+{
> >>+ return journal->j_flags & JFS_CP_ABORT;
> >>+}
> >>+
> >
> > Actually, I don't think this new flag is needed (not that it would really
> > harm anything). At least at the places where you check it you can as well
> > check whether the journal is aborted (maybe except for journal_destroy()
> > but see my comment there).
>
> As you say, JFS_CP_ABORT isn't necessarily needed in the most cases,
> but it is needed for __journal_abort_checkpoint() which report the
> checkpoint related error by printk only once. If we use JFS_ABORT
> to check whether this call is second time, the error message may be
> never printed out because the journal has been aborted by another
> reason. If we don't check for the second call, the error message
> will be printed out several times.
Yes, I think that checking out for JFS_ABORT is the right thing to do.
Once the journal has aborted for some reason, it is enough that we print
some error message (and that is responsibility of the first caller of
journal_abort()). Printing that checkpointing has not succeeded as well is
IMO not needed.

> Instead of using __journal_abort_checkpoint(), we'll be able to do
> the similar thing by adding the printk() directly in journal_destroy(),
> journal_flush(), and __log_wait_for_space() (but it can be more
> than one time).
>
> I agree that we should not add another flag as much as possible.
> So I'll try to revise to remove the flag if you agree to add
> 3 printk()s.
Well, as I said, one flag in journal is not a big deal but I found it a
bit confusing why there's a special flag for checkpoint abort when standard
abort would do fine as well.

> >>Index: linux-2.6.26-rc2/fs/ext3/super.c
> >>===================================================================
> >>--- linux-2.6.26-rc2.orig/fs/ext3/super.c
> >>+++ linux-2.6.26-rc2/fs/ext3/super.c
> >>@@ -395,7 +395,10 @@ static void ext3_put_super (struct super
> >> ext3_xattr_put_super(sb);
> >> journal_destroy(sbi->s_journal);
> >> if (!(sb->s_flags & MS_RDONLY)) {
> >>- EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
> >>+ if (!is_checkpoint_aborted(sbi->s_journal)) {
> >>+ EXT3_CLEAR_INCOMPAT_FEATURE(sb,
> >>+ EXT3_FEATURE_INCOMPAT_RECOVER);
> >>+ }
> >
> > I think you should test here whether the journal is aborted (and
> > EXT3_MOUNT_ABORT isn't set) and if so, call ext3_abort() and just
> > completely avoid updating super block...
>
> Your idea looks good. But as Josef pointed out, journal_destroy()
> frees sbi->s_journal, so we need to make journal_destroy() return the
> error status.
Yes.

> >>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
> >> journal->j_tail = 0;
> >> journal->j_tail_sequence = ++journal->j_transaction_sequence;
> >> if (journal->j_sb_buffer) {
> >>- journal_update_superblock(journal, 1);
> >>+ if (!is_checkpoint_aborted(journal))
> >>+ journal_update_superblock(journal, 1);
> >> brelse(journal->j_sb_buffer);
> >> }
> >
> > I don't like this much. I'd rather completely avoid updating j_tail and
> > j_tail_sequence above in case the journal is aborted but I'd write the
> > journal superblock so that information about abortion gets written...
>
> log_do_checkpoint() calls cleanup_journal_tail(), which advances
> j_tail and j_tail_sequence. So if we adopt the policy that we don't
> modify j_tail and j_tail_sequence when checkpointing failed, we should
> also fix cleanup_journal_tail().
>
> I adopted the policy that we don't update the journal super block
> when checkpointing failed. When checkpointing failed, journal_abort()
> is called before cleanup_journal_tail(). So what we want to write
> should be in the journal super block.
I see. The thing I'm afraid of with this policy is, that when sometime later
we add somewhere journal_update_superblock() and forget about checking
whether journal isn't aborted, we will magically get filesystem corruption
when IO error happens and that would be really hard to debug. So I'd
rather refrain from updating j_tail and j_tail_sequence.

> >>@@ -1333,7 +1334,6 @@ static int journal_convert_superblock_v1
> >>
> >> int journal_flush(journal_t *journal)
> >> {
> >>- int err = 0;
> >> transaction_t *transaction = NULL;
> >> unsigned long old_tail;
> >>
> >>@@ -1356,14 +1356,19 @@ int journal_flush(journal_t *journal)
> >> spin_unlock(&journal->j_state_lock);
> >> }
> >>
> >>- /* ...and flush everything in the log out to disk. */
> >>+ /* ...and flush everything in the log out to disk.
> >>+ * Even if an error occurs, we don't stop this loop.
> >>+ * We do checkpoint as much as possible. */
> >> spin_lock(&journal->j_list_lock);
> >>- while (!err && journal->j_checkpoint_transactions != NULL) {
> >>+ while (journal->j_checkpoint_transactions != NULL) {
> >> spin_unlock(&journal->j_list_lock);
> >>- err = log_do_checkpoint(journal);
> >>+ log_do_checkpoint(journal);
> >> spin_lock(&journal->j_list_lock);
> >> }
> >> spin_unlock(&journal->j_list_lock);
> >>+ if (is_checkpoint_aborted(journal))
> >>+ return -EIO;
> >>+
> >
> > Why do you change the loop? If we fail to checkpoint some buffer, we stop
> > journaling anyway and so journal will be replayed when fsck is run...
>
> Certainly, fsck or journal_recover() replay the unwritten-back metadata
> block. journal_destroy() also call log_do_checkpoint() against all
> un-checkpointed transactions. I just thought `flush' means writing out
> all data which we should write to disk. There is no problem if I
> don't change the loop.
"flush" means "make journal empty". If we are not able to do it all, it
does not really matter how much do we manage to write out. So I wouldn't
change the loop.
Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-05-21 01:33:26

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 3/4] jbd: abort when failed to log metadata buffers (rebased)

Hi,

Jan Kara wrote:
>
> On Fri 16-05-08 19:26:57, Hidehiro Kawai wrote:
>
>>Jan Kara wrote:
>>
>>
>>>On Wed 14-05-08 13:49:51, Hidehiro Kawai wrote:
>>>
>>>
>>>>Subject: [PATCH 3/4] jbd: abort when failed to log metadata buffers
>>>>
>>>>If we failed to write metadata buffers to the journal space and
>>>>succeeded to write the commit record, stale data can be written
>>>>back to the filesystem as metadata in the recovery phase.
>>>>
>>>>To avoid this, when we failed to write out metadata buffers,
>>>>abort the journal before writing the commit record.
>>>>
>>>>Signed-off-by: Hidehiro Kawai <[email protected]>
>>>>---
>>>>fs/jbd/commit.c | 3 +++
>>>>1 file changed, 3 insertions(+)
>>>>
>>>>Index: linux-2.6.26-rc2/fs/jbd/commit.c
>>>>===================================================================
>>>>--- linux-2.6.26-rc2.orig/fs/jbd/commit.c
>>>>+++ linux-2.6.26-rc2/fs/jbd/commit.c
>>>>@@ -703,6 +703,9 @@ wait_for_iobuf:
>>>> __brelse(bh);
>>>> }
>>>>
>>>>+ if (err)
>>>>+ journal_abort(journal, err);
>>>>+
>>>> J_ASSERT (commit_transaction->t_shadow_list == NULL);
>>>
>>> Shouldn't this rather be further just before
>>>journal_write_commit_record()? We should abort also if writing revoke
>>>records etc. failed, shouldn't we?
>>
>>Unlike metadata blocks, each revoke block has a descriptor with the
>>sequence number of the commiting transaction. If we failed to write
>>a revoke block, there should be an old control block, metadata block,
>>or zero-filled block where we tried to write the revoke block.
>>In the recovery process, this old invalid block is detected by
>>checking its magic number and sequence number, then the transaction
>>is ignored even if we have succeeded to write the commit record.
>>So I think we don't need to check for errors just after writing
>>revoke records.
>
> Yes, I agree that not doing such check will not cause data corruption but
> still I think that in case we fail to properly commit a transaction, we
> should detect the error and abort the journal...

I see. I'll move the aborting point to just before
journal_write_commit_record() in the next version.

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


2008-05-21 01:35:04

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)

Hi,

Jan Kara wrote:
>
> On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote:
>
>>Jan Kara wrote:
>>
>>>>/*
>>>>+ * We couldn't write back some metadata buffers to the filesystem.
>>>>+ * To avoid unwritten-back metadata buffers from losing, set
>>>>+ * JFS_CP_ABORT flag and make sure that the journal space isn't
>>>>+ * cleaned. This function also aborts journaling.
>>>>+ */
>>>>+static void __journal_abort_checkpoint(journal_t *journal, int errno)
>>>>+{
>>>>+ if (is_checkpoint_aborted(journal))
>>>>+ return;
>>>>+
>>>>+ spin_lock(&journal->j_state_lock);
>>>>+ journal->j_flags |= JFS_CP_ABORT;
>>>>+ spin_unlock(&journal->j_state_lock);
>>>>+ printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks "
>>>>+ "are still old.\n");
>>>>+ journal_abort(journal, errno);
>>>>+}
>>
>>[snip]
>>
>>
>>>>Index: linux-2.6.26-rc2/include/linux/jbd.h
>>>>===================================================================
>>>>--- linux-2.6.26-rc2.orig/include/linux/jbd.h
>>>>+++ linux-2.6.26-rc2/include/linux/jbd.h
>>>>@@ -816,6 +816,8 @@ struct journal_s
>>>>#define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */
>>>>#define JFS_LOADED 0x010 /* The journal superblock has been loaded */
>>>>#define JFS_BARRIER 0x020 /* Use IDE barriers */
>>>>+#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to
>>>>+ * clean the journal space. */
>>>>
>>>>/*
>>>> * Function declarations for the journaling transaction and buffer
>>>>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
>>>> return journal->j_flags & JFS_ABORT;
>>>>}
>>>>
>>>>+static inline int is_checkpoint_aborted(journal_t *journal)
>>>>+{
>>>>+ return journal->j_flags & JFS_CP_ABORT;
>>>>+}
>>>>+
>>>
>>> Actually, I don't think this new flag is needed (not that it would really
>>>harm anything). At least at the places where you check it you can as well
>>>check whether the journal is aborted (maybe except for journal_destroy()
>>>but see my comment there).
>>
>>As you say, JFS_CP_ABORT isn't necessarily needed in the most cases,
>>but it is needed for __journal_abort_checkpoint() which report the
>>checkpoint related error by printk only once. If we use JFS_ABORT
>>to check whether this call is second time, the error message may be
>>never printed out because the journal has been aborted by another
>>reason. If we don't check for the second call, the error message
>>will be printed out several times.
>
> Yes, I think that checking out for JFS_ABORT is the right thing to do.
> Once the journal has aborted for some reason, it is enough that we print
> some error message (and that is responsibility of the first caller of
> journal_abort()). Printing that checkpointing has not succeeded as well is
> IMO not needed.

A checkpointing failure is a bit special. If the journal is aborted
by journal_commit_transaction(), the integrity of the filesystem is
ensured although the latest changes will be lost. However, if the
journal is aborted by log_do_checkpoint() and the replay also failed,
the filesystem may be corrupted because some of the metadata blocks
are in old states. In this case, the user will have to recover the
filesystem manually and carefully. So I think printing the special
message is needed to inform users about that.


>>>>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
>>>> journal->j_tail = 0;
>>>> journal->j_tail_sequence = ++journal->j_transaction_sequence;
>>>> if (journal->j_sb_buffer) {
>>>>- journal_update_superblock(journal, 1);
>>>>+ if (!is_checkpoint_aborted(journal))
>>>>+ journal_update_superblock(journal, 1);
>>>> brelse(journal->j_sb_buffer);
>>>> }
>>>
>>> I don't like this much. I'd rather completely avoid updating j_tail and
>>>j_tail_sequence above in case the journal is aborted but I'd write the
>>>journal superblock so that information about abortion gets written...
>>
>>log_do_checkpoint() calls cleanup_journal_tail(), which advances
>>j_tail and j_tail_sequence. So if we adopt the policy that we don't
>>modify j_tail and j_tail_sequence when checkpointing failed, we should
>>also fix cleanup_journal_tail().
>>
>>I adopted the policy that we don't update the journal super block
>>when checkpointing failed. When checkpointing failed, journal_abort()
>>is called before cleanup_journal_tail(). So what we want to write
>>should be in the journal super block.
>
> I see. The thing I'm afraid of with this policy is, that when sometime later
> we add somewhere journal_update_superblock() and forget about checking
> whether journal isn't aborted, we will magically get filesystem corruption
> when IO error happens and that would be really hard to debug. So I'd
> rather refrain from updating j_tail and j_tail_sequence.

I can understand your concern as the current JBD updates the journal
super block even if a checkpoint has failed. I think it will be
improved by adding a comment to journal_update_superblock().
For example: don't invoke this function if checkpointing has
failed, otherwise unwritten-back journaled data can be lost.

Or we may stop both modifying j_tail and updating the super block
when the journal has aborted (especially for a reason of checkpoint
failure). Of course, __journal_abort_soft() still updates the
super block once.

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


2008-05-23 22:51:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)

Hello,

> Jan Kara wrote:
> >
> > On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote:
> >
> >>Jan Kara wrote:
> >>
> >>>>/*
> >>>>+ * We couldn't write back some metadata buffers to the filesystem.
> >>>>+ * To avoid unwritten-back metadata buffers from losing, set
> >>>>+ * JFS_CP_ABORT flag and make sure that the journal space isn't
> >>>>+ * cleaned. This function also aborts journaling.
> >>>>+ */
> >>>>+static void __journal_abort_checkpoint(journal_t *journal, int errno)
> >>>>+{
> >>>>+ if (is_checkpoint_aborted(journal))
> >>>>+ return;
> >>>>+
> >>>>+ spin_lock(&journal->j_state_lock);
> >>>>+ journal->j_flags |= JFS_CP_ABORT;
> >>>>+ spin_unlock(&journal->j_state_lock);
> >>>>+ printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks "
> >>>>+ "are still old.\n");
> >>>>+ journal_abort(journal, errno);
> >>>>+}
> >>
> >>[snip]
> >>
> >>
> >>>>Index: linux-2.6.26-rc2/include/linux/jbd.h
> >>>>===================================================================
> >>>>--- linux-2.6.26-rc2.orig/include/linux/jbd.h
> >>>>+++ linux-2.6.26-rc2/include/linux/jbd.h
> >>>>@@ -816,6 +816,8 @@ struct journal_s
> >>>>#define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */
> >>>>#define JFS_LOADED 0x010 /* The journal superblock has been loaded */
> >>>>#define JFS_BARRIER 0x020 /* Use IDE barriers */
> >>>>+#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to
> >>>>+ * clean the journal space. */
> >>>>
> >>>>/*
> >>>> * Function declarations for the journaling transaction and buffer
> >>>>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
> >>>> return journal->j_flags & JFS_ABORT;
> >>>>}
> >>>>
> >>>>+static inline int is_checkpoint_aborted(journal_t *journal)
> >>>>+{
> >>>>+ return journal->j_flags & JFS_CP_ABORT;
> >>>>+}
> >>>>+
> >>>
> >>> Actually, I don't think this new flag is needed (not that it would really
> >>>harm anything). At least at the places where you check it you can as well
> >>>check whether the journal is aborted (maybe except for journal_destroy()
> >>>but see my comment there).
> >>
> >>As you say, JFS_CP_ABORT isn't necessarily needed in the most cases,
> >>but it is needed for __journal_abort_checkpoint() which report the
> >>checkpoint related error by printk only once. If we use JFS_ABORT
> >>to check whether this call is second time, the error message may be
> >>never printed out because the journal has been aborted by another
> >>reason. If we don't check for the second call, the error message
> >>will be printed out several times.
> >
> > Yes, I think that checking out for JFS_ABORT is the right thing to do.
> > Once the journal has aborted for some reason, it is enough that we print
> > some error message (and that is responsibility of the first caller of
> > journal_abort()). Printing that checkpointing has not succeeded as well is
> > IMO not needed.
>
> A checkpointing failure is a bit special. If the journal is aborted
> by journal_commit_transaction(), the integrity of the filesystem is
> ensured although the latest changes will be lost. However, if the
> journal is aborted by log_do_checkpoint() and the replay also failed,
> the filesystem may be corrupted because some of the metadata blocks
> are in old states. In this case, the user will have to recover the
> filesystem manually and carefully. So I think printing the special
> message is needed to inform users about that.
Is this really different? If we spot error during checkpointing, we
abort journal (so data still remain in the journal because checkpointing
has failed). On next mount, either replay succeeds and everything is fine,
or we spot error during replay and we should just refuse to mount the
filesystem. So we are correct again - user then has to run fsck to solve
the problem and that will try it's best to get most data readable. So I
don't see why checkpointing bug would need a special attention.

> >>>>@@ -1151,7 +1151,8 @@ void journal_destroy(journal_t *journal)
> >>>> journal->j_tail = 0;
> >>>> journal->j_tail_sequence = ++journal->j_transaction_sequence;
> >>>> if (journal->j_sb_buffer) {
> >>>>- journal_update_superblock(journal, 1);
> >>>>+ if (!is_checkpoint_aborted(journal))
> >>>>+ journal_update_superblock(journal, 1);
> >>>> brelse(journal->j_sb_buffer);
> >>>> }
> >>>
> >>> I don't like this much. I'd rather completely avoid updating j_tail and
> >>>j_tail_sequence above in case the journal is aborted but I'd write the
> >>>journal superblock so that information about abortion gets written...
> >>
> >>log_do_checkpoint() calls cleanup_journal_tail(), which advances
> >>j_tail and j_tail_sequence. So if we adopt the policy that we don't
> >>modify j_tail and j_tail_sequence when checkpointing failed, we should
> >>also fix cleanup_journal_tail().
> >>
> >>I adopted the policy that we don't update the journal super block
> >>when checkpointing failed. When checkpointing failed, journal_abort()
> >>is called before cleanup_journal_tail(). So what we want to write
> >>should be in the journal super block.
> >
> > I see. The thing I'm afraid of with this policy is, that when sometime later
> > we add somewhere journal_update_superblock() and forget about checking
> > whether journal isn't aborted, we will magically get filesystem corruption
> > when IO error happens and that would be really hard to debug. So I'd
> > rather refrain from updating j_tail and j_tail_sequence.
>
> I can understand your concern as the current JBD updates the journal
> super block even if a checkpoint has failed. I think it will be
> improved by adding a comment to journal_update_superblock().
> For example: don't invoke this function if checkpointing has
> failed, otherwise unwritten-back journaled data can be lost.
>
> Or we may stop both modifying j_tail and updating the super block
> when the journal has aborted (especially for a reason of checkpoint
> failure). Of course, __journal_abort_soft() still updates the
> super block once.
OK, the last option is also fine with me. Thanks for your work.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2008-05-26 04:57:33

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 4/4] jbd: fix error handling for checkpoint io (rebased)

Hello,

>>Jan Kara wrote:
>>
>>>On Fri 16-05-08 19:29:15, Hidehiro Kawai wrote:
>>>
>>>
>>>>Jan Kara wrote:
>>>>
>>>>
>>>>>>/*
>>>>>>+ * We couldn't write back some metadata buffers to the filesystem.
>>>>>>+ * To avoid unwritten-back metadata buffers from losing, set
>>>>>>+ * JFS_CP_ABORT flag and make sure that the journal space isn't
>>>>>>+ * cleaned. This function also aborts journaling.
>>>>>>+ */
>>>>>>+static void __journal_abort_checkpoint(journal_t *journal, int errno)
>>>>>>+{
>>>>>>+ if (is_checkpoint_aborted(journal))
>>>>>>+ return;
>>>>>>+
>>>>>>+ spin_lock(&journal->j_state_lock);
>>>>>>+ journal->j_flags |= JFS_CP_ABORT;
>>>>>>+ spin_unlock(&journal->j_state_lock);
>>>>>>+ printk(KERN_WARNING "JBD: Checkpointing failed. Some metadata blocks "
>>>>>>+ "are still old.\n");
>>>>>>+ journal_abort(journal, errno);
>>>>>>+}
>>>>
>>>>[snip]
>>>>
>>>>
>>>>
>>>>>>Index: linux-2.6.26-rc2/include/linux/jbd.h
>>>>>>===================================================================
>>>>>>--- linux-2.6.26-rc2.orig/include/linux/jbd.h
>>>>>>+++ linux-2.6.26-rc2/include/linux/jbd.h
>>>>>>@@ -816,6 +816,8 @@ struct journal_s
>>>>>>#define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */
>>>>>>#define JFS_LOADED 0x010 /* The journal superblock has been loaded */
>>>>>>#define JFS_BARRIER 0x020 /* Use IDE barriers */
>>>>>>+#define JFS_CP_ABORT 0x040 /* Checkpointing has failed. We don't have to
>>>>>>+ * clean the journal space. */
>>>>>>
>>>>>>/*
>>>>>>* Function declarations for the journaling transaction and buffer
>>>>>>@@ -1004,6 +1006,11 @@ static inline int is_journal_aborted(jou
>>>>>> return journal->j_flags & JFS_ABORT;
>>>>>>}
>>>>>>
>>>>>>+static inline int is_checkpoint_aborted(journal_t *journal)
>>>>>>+{
>>>>>>+ return journal->j_flags & JFS_CP_ABORT;
>>>>>>+}
>>>>>>+
>>>>>
>>>>> Actually, I don't think this new flag is needed (not that it would really
>>>>>harm anything). At least at the places where you check it you can as well
>>>>>check whether the journal is aborted (maybe except for journal_destroy()
>>>>>but see my comment there).
>>>>
>>>>As you say, JFS_CP_ABORT isn't necessarily needed in the most cases,
>>>>but it is needed for __journal_abort_checkpoint() which report the
>>>>checkpoint related error by printk only once. If we use JFS_ABORT
>>>>to check whether this call is second time, the error message may be
>>>>never printed out because the journal has been aborted by another
>>>>reason. If we don't check for the second call, the error message
>>>>will be printed out several times.
>>>
>>> Yes, I think that checking out for JFS_ABORT is the right thing to do.
>>>Once the journal has aborted for some reason, it is enough that we print
>>>some error message (and that is responsibility of the first caller of
>>>journal_abort()). Printing that checkpointing has not succeeded as well is
>>>IMO not needed.
>>
>>A checkpointing failure is a bit special. If the journal is aborted
>>by journal_commit_transaction(), the integrity of the filesystem is
>>ensured although the latest changes will be lost. However, if the
>>journal is aborted by log_do_checkpoint() and the replay also failed,
>>the filesystem may be corrupted because some of the metadata blocks
>>are in old states. In this case, the user will have to recover the
>>filesystem manually and carefully. So I think printing the special
>>message is needed to inform users about that.
>
> Is this really different? If we spot error during checkpointing, we
> abort journal (so data still remain in the journal because checkpointing
> has failed). On next mount, either replay succeeds and everything is fine,
> or we spot error during replay and we should just refuse to mount the
> filesystem. So we are correct again - user then has to run fsck to solve
> the problem and that will try it's best to get most data readable. So I
> don't see why checkpointing bug would need a special attention.

I understand.

I have thought the following scenario. When we failed to checkpoint
and replay, copy all readable blocks from the bad disk to a good disk,
and we can replay all journaled data on the good disk. But it will be
overkill. Normally we run the fsck when we failed to mount, and it's
enough for us.

I'm going to just use JFS_ABORT instead of introducing JFS_CP_ABORT
in the next revision.

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center