2008-06-02 10:40:53

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2)

Subject: [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2)

This patch set is the take 2 of fixing error handling problem in
ext3/JBD. The previous discussion can be found here:
http://lkml.org/lkml/2008/5/14/10

The same problem should also be in ext4/JBD, but I haven't prepared
it yet.


Problem
=======
Currently some error checkings are missing, so the journal cannot abort
correctly. This causes breakage of the ordered mode rule and filesystem
corruption. Missing error checkings are:

(1) error check for dirty buffers flushed before the commit
(addressed by PATCH 1/5 and 2/5)
(2) error check for the metadata writes to the journal before the
commit (addressed by PATCH 3/5)
(3) error check for checkpointing and replay (addressed by PATCH 4/5
and 5/5)


Changes from take 1
===================
[PATCH 1/5]
o not changed

[PATCH 2/5]
o rewrite my coment in journal_dirty_data() comprehensibly

[PATCH 3/5]
o check for errors and abort the journal just before
journal_write_commit_record() instead of after writing metadata
buffers

[PATCH 4/5 and 5/5]
o separate the ext3 part from the jbd part in a patch
o use JFS_ABORT for checkpointing failures instead of introducing
JFS_CP_ABORT flag
o don't update only the journal super block, but also j_tail and
j_tail_sequence when the journal has aborted (at least we only
have to avoid updating the super block, but keeping j_tail*'s
values will be good thing because it may protect someone from
adding bugs in the future)
o journal_destroy() returns -EIO when the journal has aborted so that
ext3_put_super() can detect the abort
o journal_flush() uses j_checkpoint_mutex to avoid a race with
__log_wait_for_space()

The last item targets a newly found problem. journal_flush() can be
called while processing __log_wait_for_space(). In this case,
cleanup_journal_tail() can be called between
__journal_drop_transaction() and journal_abort(), then
the transaction with checkpointing failure is lost from the journal.
Using j_checkpoint_mutex which is used by __log_wait_for_space(),
we should avoid the race condition. But the test is not so sufficient
because it is very difficult to produce this race. So I hope that
this locking is reviewed carefully (including a possibility of
deadlock.)

Regards,

--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


2008-06-02 10:44:22

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 1/5] jbd: strictly check for write errors on data buffers

Subject: [PATCH 1/5] jbd: strictly check for write errors on data buffers

In ordered mode, we should abort journaling when an I/O error has
occurred on a file data buffer in the committing transaction. But
there can be data buffers which are not checked for error:

(a) the buffer which has already been written out by pdflush
(b) the buffer which has been unlocked before scanned in the
t_locked_list loop

This patch adds missing error checks and aborts journaling
appropriately.

Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Jan Kara <[email protected]>
---
fs/jbd/commit.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

Index: linux-2.6.26-rc4/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc4/fs/jbd/commit.c
@@ -172,7 +172,7 @@ static void journal_do_submit_data(struc
/*
* Submit all the data buffers to disk
*/
-static void journal_submit_data_buffers(journal_t *journal,
+static int journal_submit_data_buffers(journal_t *journal,
transaction_t *commit_transaction)
{
struct journal_head *jh;
@@ -180,6 +180,7 @@ static void journal_submit_data_buffers(
int locked;
int bufs = 0;
struct buffer_head **wbuf = journal->j_wbuf;
+ int err = 0;

/*
* Whenever we unlock the journal and sleep, things can get added
@@ -253,6 +254,8 @@ write_out_data:
put_bh(bh);
} else {
BUFFER_TRACE(bh, "writeout complete: unfile");
+ if (unlikely(!buffer_uptodate(bh)))
+ err = -EIO;
__journal_unfile_buffer(jh);
jbd_unlock_bh_state(bh);
if (locked)
@@ -271,6 +274,8 @@ write_out_data:
}
spin_unlock(&journal->j_list_lock);
journal_do_submit_data(wbuf, bufs);
+
+ return err;
}

/*
@@ -410,8 +415,7 @@ void journal_commit_transaction(journal_
* Now start flushing things to disk, in the order they appear
* on the transaction lists. Data blocks go first.
*/
- err = 0;
- journal_submit_data_buffers(journal, commit_transaction);
+ err = journal_submit_data_buffers(journal, commit_transaction);

/*
* Wait for all previously submitted IO to complete.
@@ -426,10 +430,10 @@ void journal_commit_transaction(journal_
if (buffer_locked(bh)) {
spin_unlock(&journal->j_list_lock);
wait_on_buffer(bh);
- if (unlikely(!buffer_uptodate(bh)))
- err = -EIO;
spin_lock(&journal->j_list_lock);
}
+ if (unlikely(!buffer_uptodate(bh)))
+ err = -EIO;
if (!inverted_lock(journal, bh)) {
put_bh(bh);
spin_lock(&journal->j_list_lock);

2008-06-02 10:45:35

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 2/5] jbd: ordered data integrity fix

Subject: [PATCH 2/5] jbd: ordered data integrity fix

In ordered mode, if a buffer being dirtied exists in the committing
transaction, we write the buffer to the disk, move it from the
committing transaction to the running transaction, then dirty it.
But we don't have to remove the buffer from the committing
transaction when the buffer couldn't be written out, otherwise it
would miss the error and the committing transaction would not abort.

This patch adds an error check before removing the buffer from the
committing transaction.

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/jbd/transaction.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

Index: linux-2.6.26-rc4/fs/jbd/transaction.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/transaction.c
+++ linux-2.6.26-rc4/fs/jbd/transaction.c
@@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle,
journal_t *journal = handle->h_transaction->t_journal;
int need_brelse = 0;
struct journal_head *jh;
+ int ret = 0;

if (is_handle_aborted(handle))
- return 0;
+ return ret;

jh = journal_add_journal_head(bh);
JBUFFER_TRACE(jh, "entry");
@@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle,
time if it is redirtied */
}

- /* journal_clean_data_list() may have got there first */
+ /*
+ * We cannot remove the buffer with io error from the
+ * committing transaction, because otherwise it would
+ * miss the error and the commit would not abort.
+ */
+ if (unlikely(!buffer_uptodate(bh))) {
+ ret = -EIO;
+ goto no_journal;
+ }
+
if (jh->b_transaction != NULL) {
JBUFFER_TRACE(jh, "unfile from commit");
__journal_temp_unlink_buffer(jh);
@@ -1108,7 +1118,7 @@ no_journal:
}
JBUFFER_TRACE(jh, "exit");
journal_put_journal_head(jh);
- return 0;
+ return ret;
}

/**

2008-06-02 10:46:20

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 3/5] jbd: abort when failed to log metadata buffers

Subject: [PATCH 3/5] jbd: abort when failed to log metadata buffers

If we failed to write metadata buffers to the journal space and
succeeded to write the commit record, stale data can be written
back to the filesystem as metadata in the recovery phase.

To avoid this, when we failed to write out metadata buffers,
abort the journal before writing the commit record.

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/jbd/commit.c | 3 +++
1 file changed, 3 insertions(+)

Index: linux-2.6.26-rc4/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc4/fs/jbd/commit.c
@@ -734,6 +734,9 @@ wait_for_iobuf:
/* AKPM: bforget here */
}

+ if (err)
+ journal_abort(journal, err);
+
jbd_debug(3, "JBD: commit phase 6\n");

if (journal_write_commit_record(journal, commit_transaction))

2008-06-02 10:47:47

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io

Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io

When a checkpointing IO fails, current JBD code doesn't check the
error and continue journaling. This means latest metadata can be
lost from both the journal and filesystem.

This patch leaves the failed metadata blocks in the journal space
and aborts journaling in the case of log_do_checkpoint().
To achieve this, we need to do:

1. don't remove the failed buffer from the checkpoint list where in
the case of __try_to_free_cp_buf() because it may be released or
overwritten by a later transaction
2. log_do_checkpoint() is the last chance, remove the failed buffer
from the checkpoint list and abort the journal
3. when checkpointing fails, don't update the journal super block to
prevent the journaled contents from being cleaned. For safety,
don't update j_tail and j_tail_sequence either
4. when checkpointing fails, notify this error to the ext3 layer so
that ext3 don't clear the needs_recovery flag, otherwise the
journaled contents are ignored and cleaned in the recovery phase
5. if the recovery fails, keep the needs_recovery flag
6. prevent cleanup_journal_tail() from being called between
__journal_drop_transaction() and journal_abort() (a race issue
between journal_flush() and __log_wait_for_space()

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/jbd/checkpoint.c | 48 +++++++++++++++++++++++++++++++-----------
fs/jbd/journal.c | 28 +++++++++++++++++++-----
fs/jbd/recovery.c | 7 ++++--
include/linux/jbd.h | 2 -
4 files changed, 64 insertions(+), 21 deletions(-)

Index: linux-2.6.26-rc4/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/checkpoint.c
+++ linux-2.6.26-rc4/fs/jbd/checkpoint.c
@@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
int ret = 0;
struct buffer_head *bh = jh2bh(jh);

- if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
+ if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
+ !buffer_dirty(bh) && buffer_uptodate(bh)) {
JBUFFER_TRACE(jh, "remove from checkpoint list");
ret = __journal_remove_checkpoint(jh) + 1;
jbd_unlock_bh_state(bh);
@@ -160,21 +161,25 @@ static void jbd_sync_bh(journal_t *journ
* buffers. Note that we take the buffers in the opposite ordering
* from the one in which they were submitted for IO.
*
+ * Return 0 on success, and return <0 if some buffers have failed
+ * to be written out.
+ *
* Called with j_list_lock held.
*/
-static void __wait_cp_io(journal_t *journal, transaction_t *transaction)
+static int __wait_cp_io(journal_t *journal, transaction_t *transaction)
{
struct journal_head *jh;
struct buffer_head *bh;
tid_t this_tid;
int released = 0;
+ int ret = 0;

this_tid = transaction->t_tid;
restart:
/* Did somebody clean up the transaction in the meanwhile? */
if (journal->j_checkpoint_transactions != transaction ||
transaction->t_tid != this_tid)
- return;
+ return ret;
while (!released && transaction->t_checkpoint_io_list) {
jh = transaction->t_checkpoint_io_list;
bh = jh2bh(jh);
@@ -194,6 +199,9 @@ restart:
spin_lock(&journal->j_list_lock);
goto restart;
}
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
+
/*
* Now in whatever state the buffer currently is, we know that
* it has been written out and so we can drop it from the list
@@ -203,6 +211,8 @@ restart:
journal_remove_journal_head(bh);
__brelse(bh);
}
+
+ return ret;
}

#define NR_BATCH 64
@@ -226,7 +236,8 @@ __flush_batch(journal_t *journal, struct
* Try to flush one buffer from the checkpoint list to disk.
*
* Return 1 if something happened which requires us to abort the current
- * scan of the checkpoint list.
+ * scan of the checkpoint list. Return <0 if the buffer has failed to
+ * be written out.
*
* Called with j_list_lock held and drops it if 1 is returned
* Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
@@ -256,6 +267,9 @@ static int __process_buffer(journal_t *j
log_wait_commit(journal, tid);
ret = 1;
} else if (!buffer_dirty(bh)) {
+ ret = 1;
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
J_ASSERT_JH(jh, !buffer_jbddirty(bh));
BUFFER_TRACE(bh, "remove from checkpoint");
__journal_remove_checkpoint(jh);
@@ -263,7 +277,6 @@ static int __process_buffer(journal_t *j
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
__brelse(bh);
- ret = 1;
} else {
/*
* Important: we are about to write the buffer, and
@@ -318,6 +331,7 @@ int log_do_checkpoint(journal_t *journal
* OK, we need to start writing disk blocks. Take one transaction
* and write it.
*/
+ result = 0;
spin_lock(&journal->j_list_lock);
if (!journal->j_checkpoint_transactions)
goto out;
@@ -334,7 +348,7 @@ restart:
int batch_count = 0;
struct buffer_head *bhs[NR_BATCH];
struct journal_head *jh;
- int retry = 0;
+ int retry = 0, err;

while (!retry && transaction->t_checkpoint_list) {
struct buffer_head *bh;
@@ -347,6 +361,8 @@ restart:
break;
}
retry = __process_buffer(journal, jh, bhs,&batch_count);
+ if (retry < 0)
+ result = retry;
if (!retry && (need_resched() ||
spin_needbreak(&journal->j_list_lock))) {
spin_unlock(&journal->j_list_lock);
@@ -371,14 +387,18 @@ restart:
* Now we have cleaned up the first transaction's checkpoint
* list. Let's clean up the second one
*/
- __wait_cp_io(journal, transaction);
+ err = __wait_cp_io(journal, transaction);
+ if (!result)
+ result = err;
}
out:
spin_unlock(&journal->j_list_lock);
- result = cleanup_journal_tail(journal);
if (result < 0)
- return result;
- return 0;
+ journal_abort(journal, result);
+ else
+ result = cleanup_journal_tail(journal);
+
+ return (result < 0) ? result : 0;
}

/*
@@ -394,8 +414,9 @@ out:
* This is the only part of the journaling code which really needs to be
* aware of transaction aborts. Checkpointing involves writing to the
* main filesystem area rather than to the journal, so it can proceed
- * even in abort state, but we must not update the journal superblock if
- * we have an abort error outstanding.
+ * even in abort state, but we must not update the super block if
+ * checkpointing may have failed. Otherwise, we would lose some metadata
+ * buffers which should be written-back to the filesystem.
*/

int cleanup_journal_tail(journal_t *journal)
@@ -404,6 +425,9 @@ int cleanup_journal_tail(journal_t *jour
tid_t first_tid;
unsigned long blocknr, freed;

+ if (is_journal_aborted(journal))
+ return 1;
+
/* OK, work out the oldest transaction remaining in the log, and
* the log block it starts at.
*
Index: linux-2.6.26-rc4/fs/jbd/journal.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/journal.c
+++ linux-2.6.26-rc4/fs/jbd/journal.c
@@ -1122,9 +1122,12 @@ recovery_error:
*
* Release a journal_t structure once it is no longer in use by the
* journaled object.
+ * Return <0 if we couldn't clean up the journal.
*/
-void journal_destroy(journal_t *journal)
+int journal_destroy(journal_t *journal)
{
+ int err = 0;
+
/* Wait for the commit thread to wake up and die. */
journal_kill_thread(journal);

@@ -1147,11 +1150,16 @@ void journal_destroy(journal_t *journal)
J_ASSERT(journal->j_checkpoint_transactions == NULL);
spin_unlock(&journal->j_list_lock);

- /* We can now mark the journal as empty. */
- journal->j_tail = 0;
- journal->j_tail_sequence = ++journal->j_transaction_sequence;
if (journal->j_sb_buffer) {
- journal_update_superblock(journal, 1);
+ if (!is_journal_aborted(journal)) {
+ /* We can now mark the journal as empty. */
+ journal->j_tail = 0;
+ journal->j_tail_sequence =
+ ++journal->j_transaction_sequence;
+ journal_update_superblock(journal, 1);
+ } else {
+ err = -EIO;
+ }
brelse(journal->j_sb_buffer);
}

@@ -1161,6 +1169,8 @@ void journal_destroy(journal_t *journal)
journal_destroy_revoke(journal);
kfree(journal->j_wbuf);
kfree(journal);
+
+ return err;
}


@@ -1360,10 +1370,16 @@ int journal_flush(journal_t *journal)
spin_lock(&journal->j_list_lock);
while (!err && journal->j_checkpoint_transactions != NULL) {
spin_unlock(&journal->j_list_lock);
+ mutex_lock(&journal->j_checkpoint_mutex);
err = log_do_checkpoint(journal);
+ mutex_unlock(&journal->j_checkpoint_mutex);
spin_lock(&journal->j_list_lock);
}
spin_unlock(&journal->j_list_lock);
+
+ if (is_journal_aborted(journal))
+ return -EIO;
+
cleanup_journal_tail(journal);

/* Finally, mark the journal as really needing no recovery.
@@ -1385,7 +1401,7 @@ int journal_flush(journal_t *journal)
J_ASSERT(journal->j_head == journal->j_tail);
J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence);
spin_unlock(&journal->j_state_lock);
- return err;
+ return 0;
}

/**
Index: linux-2.6.26-rc4/fs/jbd/recovery.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/recovery.c
+++ linux-2.6.26-rc4/fs/jbd/recovery.c
@@ -223,7 +223,7 @@ do { \
*/
int journal_recover(journal_t *journal)
{
- int err;
+ int err, err2;
journal_superblock_t * sb;

struct recovery_info info;
@@ -261,7 +261,10 @@ int journal_recover(journal_t *journal)
journal->j_transaction_sequence = ++info.end_transaction;

journal_clear_revoke(journal);
- sync_blockdev(journal->j_fs_dev);
+ err2 = sync_blockdev(journal->j_fs_dev);
+ if (!err)
+ err = err2;
+
return err;
}

Index: linux-2.6.26-rc4/include/linux/jbd.h
===================================================================
--- linux-2.6.26-rc4.orig/include/linux/jbd.h
+++ linux-2.6.26-rc4/include/linux/jbd.h
@@ -908,7 +908,7 @@ extern int journal_set_features
(journal_t *, unsigned long, unsigned long, unsigned long);
extern int journal_create (journal_t *);
extern int journal_load (journal_t *journal);
-extern void journal_destroy (journal_t *);
+extern int journal_destroy (journal_t *);
extern int journal_recover (journal_t *journal);
extern int journal_wipe (journal_t *, int);
extern int journal_skip_recovery (journal_t *);

2008-06-02 10:48:59

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 5/5] ext3: abort ext3 if the journal has aborted

Subject: [PATCH 5/5] ext3: abort ext3 if the journal has aborted

If the journal has aborted due to a checkpointing failure, we
have to keep the contents of the journal space. ext3_put_super()
detects the journal abort, then it invokes ext3_abort() to make
the filesystem read only and keep needs_recovery flag.

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/ext3/ioctl.c | 12 ++++++------
fs/ext3/super.c | 11 +++++++++--
2 files changed, 15 insertions(+), 8 deletions(-)

Index: linux-2.6.26-rc4/fs/ext3/ioctl.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/ext3/ioctl.c
+++ linux-2.6.26-rc4/fs/ext3/ioctl.c
@@ -239,7 +239,7 @@ setrsvsz_out:
case EXT3_IOC_GROUP_EXTEND: {
ext3_fsblk_t n_blocks_count;
struct super_block *sb = inode->i_sb;
- int err;
+ int err, err2;

if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
@@ -254,16 +254,16 @@ setrsvsz_out:
}
err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
journal_lock_updates(EXT3_SB(sb)->s_journal);
- journal_flush(EXT3_SB(sb)->s_journal);
+ err2 = journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
group_extend_out:
mnt_drop_write(filp->f_path.mnt);
- return err;
+ return (err) ? err : err2;
}
case EXT3_IOC_GROUP_ADD: {
struct ext3_new_group_data input;
struct super_block *sb = inode->i_sb;
- int err;
+ int err, err2;

if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
@@ -280,11 +280,11 @@ group_extend_out:

err = ext3_group_add(sb, &input);
journal_lock_updates(EXT3_SB(sb)->s_journal);
- journal_flush(EXT3_SB(sb)->s_journal);
+ err2 = journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
group_add_out:
mnt_drop_write(filp->f_path.mnt);
- return err;
+ return (err) ? err : err2;
}


Index: linux-2.6.26-rc4/fs/ext3/super.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/ext3/super.c
+++ linux-2.6.26-rc4/fs/ext3/super.c
@@ -393,7 +393,8 @@ static void ext3_put_super (struct super
int i;

ext3_xattr_put_super(sb);
- journal_destroy(sbi->s_journal);
+ if (journal_destroy(sbi->s_journal) < 0)
+ ext3_abort(sb, __func__, "Couldn't clean up the journal");
if (!(sb->s_flags & MS_RDONLY)) {
EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
es->s_state = cpu_to_le16(sbi->s_mount_state);
@@ -2373,7 +2374,13 @@ static void ext3_write_super_lockfs(stru

/* Now we set up the journal barrier. */
journal_lock_updates(journal);
- journal_flush(journal);
+
+ /*
+ * We don't want to clear needs_recovery flag when we failed
+ * to flush the journal.
+ */
+ if (journal_flush(journal) < 0)
+ return;

/* Journal blocked and flushed, clear needs_recovery flag. */
EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);

2008-06-02 11:59:19

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/5] jbd: ordered data integrity fix

On Mon 02-06-08 19:45:04, Hidehiro Kawai wrote:
> Subject: [PATCH 2/5] jbd: ordered data integrity fix
>
> In ordered mode, if a buffer being dirtied exists in the committing
> transaction, we write the buffer to the disk, move it from the
> committing transaction to the running transaction, then dirty it.
> But we don't have to remove the buffer from the committing
> transaction when the buffer couldn't be written out, otherwise it
> would miss the error and the committing transaction would not abort.
>
> This patch adds an error check before removing the buffer from the
> committing transaction.
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
You can add
Acked-by: Jan Kara <[email protected]>

Honza
> ---
> fs/jbd/transaction.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> Index: linux-2.6.26-rc4/fs/jbd/transaction.c
> ===================================================================
> --- linux-2.6.26-rc4.orig/fs/jbd/transaction.c
> +++ linux-2.6.26-rc4/fs/jbd/transaction.c
> @@ -954,9 +954,10 @@ int journal_dirty_data(handle_t *handle,
> journal_t *journal = handle->h_transaction->t_journal;
> int need_brelse = 0;
> struct journal_head *jh;
> + int ret = 0;
>
> if (is_handle_aborted(handle))
> - return 0;
> + return ret;
>
> jh = journal_add_journal_head(bh);
> JBUFFER_TRACE(jh, "entry");
> @@ -1067,7 +1068,16 @@ int journal_dirty_data(handle_t *handle,
> time if it is redirtied */
> }
>
> - /* journal_clean_data_list() may have got there first */
> + /*
> + * We cannot remove the buffer with io error from the
> + * committing transaction, because otherwise it would
> + * miss the error and the commit would not abort.
> + */
> + if (unlikely(!buffer_uptodate(bh))) {
> + ret = -EIO;
> + goto no_journal;
> + }
> +
> if (jh->b_transaction != NULL) {
> JBUFFER_TRACE(jh, "unfile from commit");
> __journal_temp_unlink_buffer(jh);
> @@ -1108,7 +1118,7 @@ no_journal:
> }
> JBUFFER_TRACE(jh, "exit");
> journal_put_journal_head(jh);
> - return 0;
> + return ret;
> }
>
> /**
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-02 12:00:29

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/5] jbd: abort when failed to log metadata buffers

On Mon 02-06-08 19:46:02, Hidehiro Kawai wrote:
> Subject: [PATCH 3/5] jbd: abort when failed to log metadata buffers
>
> If we failed to write metadata buffers to the journal space and
> succeeded to write the commit record, stale data can be written
> back to the filesystem as metadata in the recovery phase.
>
> To avoid this, when we failed to write out metadata buffers,
> abort the journal before writing the commit record.
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
Acked-by: Jan Kara <[email protected]>

Honza
> ---
> fs/jbd/commit.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> Index: linux-2.6.26-rc4/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.26-rc4.orig/fs/jbd/commit.c
> +++ linux-2.6.26-rc4/fs/jbd/commit.c
> @@ -734,6 +734,9 @@ wait_for_iobuf:
> /* AKPM: bforget here */
> }
>
> + if (err)
> + journal_abort(journal, err);
> +
> jbd_debug(3, "JBD: commit phase 6\n");
>
> if (journal_write_commit_record(journal, commit_transaction))
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-02 12:05:48

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2)

On Mon 02-06-08 19:40:21, Hidehiro Kawai wrote:
> Subject: [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2)
Just a small note - repeating the subject in the changelog makes it a bit
harder for Andrew to merge the patches. Standard merging scripts prepend
subject of the email to the changelog below and so in your case the message
would be there twice... Just that you know when you create the patch series
next time.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-02 12:44:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

On Mon 02-06-08 19:47:25, Hidehiro Kawai wrote:
> Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io
>
> When a checkpointing IO fails, current JBD code doesn't check the
> error and continue journaling. This means latest metadata can be
> lost from both the journal and filesystem.
>
> This patch leaves the failed metadata blocks in the journal space
> and aborts journaling in the case of log_do_checkpoint().
> To achieve this, we need to do:
>
> 1. don't remove the failed buffer from the checkpoint list where in
> the case of __try_to_free_cp_buf() because it may be released or
> overwritten by a later transaction
> 2. log_do_checkpoint() is the last chance, remove the failed buffer
> from the checkpoint list and abort the journal
> 3. when checkpointing fails, don't update the journal super block to
> prevent the journaled contents from being cleaned. For safety,
> don't update j_tail and j_tail_sequence either
> 4. when checkpointing fails, notify this error to the ext3 layer so
> that ext3 don't clear the needs_recovery flag, otherwise the
> journaled contents are ignored and cleaned in the recovery phase
> 5. if the recovery fails, keep the needs_recovery flag
> 6. prevent cleanup_journal_tail() from being called between
> __journal_drop_transaction() and journal_abort() (a race issue
> between journal_flush() and __log_wait_for_space()
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
Just a few minor comments:

>
> Index: linux-2.6.26-rc4/fs/jbd/checkpoint.c
> ===================================================================
> --- linux-2.6.26-rc4.orig/fs/jbd/checkpoint.c
> +++ linux-2.6.26-rc4/fs/jbd/checkpoint.c

<snip>

> @@ -318,6 +331,7 @@ int log_do_checkpoint(journal_t *journal
> * OK, we need to start writing disk blocks. Take one transaction
> * and write it.
> */
> + result = 0;
> spin_lock(&journal->j_list_lock);
> if (!journal->j_checkpoint_transactions)
> goto out;
> @@ -334,7 +348,7 @@ restart:
> int batch_count = 0;
> struct buffer_head *bhs[NR_BATCH];
> struct journal_head *jh;
> - int retry = 0;
> + int retry = 0, err;
>
> while (!retry && transaction->t_checkpoint_list) {
> struct buffer_head *bh;
> @@ -347,6 +361,8 @@ restart:
> break;
> }
> retry = __process_buffer(journal, jh, bhs,&batch_count);
> + if (retry < 0)
> + result = retry;
Here you update result whenever retry is < 0 and below when result == 0.
I think it's better to have these two consistent (not that it would be
currently any functional difference).

> if (!retry && (need_resched() ||
> spin_needbreak(&journal->j_list_lock))) {
> spin_unlock(&journal->j_list_lock);
> @@ -371,14 +387,18 @@ restart:
> * Now we have cleaned up the first transaction's checkpoint
> * list. Let's clean up the second one
> */
> - __wait_cp_io(journal, transaction);
> + err = __wait_cp_io(journal, transaction);
> + if (!result)
> + result = err;
> }

> @@ -1360,10 +1370,16 @@ int journal_flush(journal_t *journal)
> spin_lock(&journal->j_list_lock);
> while (!err && journal->j_checkpoint_transactions != NULL) {
> spin_unlock(&journal->j_list_lock);
> + mutex_lock(&journal->j_checkpoint_mutex);
> err = log_do_checkpoint(journal);
> + mutex_unlock(&journal->j_checkpoint_mutex);
> spin_lock(&journal->j_list_lock);
> }
> spin_unlock(&journal->j_list_lock);
> +
> + if (is_journal_aborted(journal))
> + return -EIO;
> +
> cleanup_journal_tail(journal);
>
> /* Finally, mark the journal as really needing no recovery.
OK, so this way you've basically serialized all users of
log_do_checkpoint(). That should be fine because performance-wise interesting
is only log_wait_for_space() and that was already serialized before. So
this change is fine with me. Only please add a comment in front of
log_do_checkpoint() that it's supposed to be called with j_checkpoint_mutex
held so that EIO propagation works correctly.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-02 12:49:55

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 5/5] ext3: abort ext3 if the journal has aborted

On Mon 02-06-08 19:48:41, Hidehiro Kawai wrote:
> Subject: [PATCH 5/5] ext3: abort ext3 if the journal has aborted
>
> If the journal has aborted due to a checkpointing failure, we
> have to keep the contents of the journal space. ext3_put_super()
> detects the journal abort, then it invokes ext3_abort() to make
> the filesystem read only and keep needs_recovery flag.
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
...
> @@ -2373,7 +2374,13 @@ static void ext3_write_super_lockfs(stru
>
> /* Now we set up the journal barrier. */
> journal_lock_updates(journal);
> - journal_flush(journal);
> +
> + /*
> + * We don't want to clear needs_recovery flag when we failed
> + * to flush the journal.
> + */
> + if (journal_flush(journal) < 0)
> + return;
>
> /* Journal blocked and flushed, clear needs_recovery flag. */
> EXT3_CLEAR_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER);
It's a bit scary that LVM will start creating filesystem snapshot when
journal_flush() failed but since there's no way to propagate error further
up, I guess we cannot do anything better.
You can add:
Acked-by: Jan Kara <[email protected]>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-03 04:30:29

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2)

Hello,

Jan Kara wrote:
> On Mon 02-06-08 19:40:21, Hidehiro Kawai wrote:
>
>>Subject: [PATCH 0/5] jbd: possible filesystem corruption fixes (take 2)
>
> Just a small note - repeating the subject in the changelog makes it a bit
> harder for Andrew to merge the patches. Standard merging scripts prepend
> subject of the email to the changelog below and so in your case the message
> would be there twice... Just that you know when you create the patch series
> next time.

Thank you for letting me know. I'll not do such odd thing next time.

Regards,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

2008-06-03 04:32:14

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

Hello,

Thank you for reviewing.

Jan Kara wrote:

>>@@ -347,6 +361,8 @@ restart:
>> break;
>> }
>> retry = __process_buffer(journal, jh, bhs,&batch_count);
>>+ if (retry < 0)
>>+ result = retry;
>
> Here you update result whenever retry is < 0 and below when result == 0.
> I think it's better to have these two consistent (not that it would be
> currently any functional difference).

I fixed it.

> Only please add a comment in front of
> log_do_checkpoint() that it's supposed to be called with j_checkpoint_mutex
> held so that EIO propagation works correctly.

I added a comment.

I'll send the revised patch soon.

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

2008-06-03 04:40:49

by Hidehiro Kawai

[permalink] [raw]
Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io

Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io

When a checkpointing IO fails, current JBD code doesn't check the
error and continue journaling. This means latest metadata can be
lost from both the journal and filesystem.

This patch leaves the failed metadata blocks in the journal space
and aborts journaling in the case of log_do_checkpoint().
To achieve this, we need to do:

1. don't remove the failed buffer from the checkpoint list where in
the case of __try_to_free_cp_buf() because it may be released or
overwritten by a later transaction
2. log_do_checkpoint() is the last chance, remove the failed buffer
from the checkpoint list and abort the journal
3. when checkpointing fails, don't update the journal super block to
prevent the journaled contents from being cleaned. For safety,
don't update j_tail and j_tail_sequence either
4. when checkpointing fails, notify this error to the ext3 layer so
that ext3 don't clear the needs_recovery flag, otherwise the
journaled contents are ignored and cleaned in the recovery phase
5. if the recovery fails, keep the needs_recovery flag
6. prevent cleanup_journal_tail() from being called between
__journal_drop_transaction() and journal_abort() (a race issue
between journal_flush() and __log_wait_for_space()

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/jbd/checkpoint.c | 49 +++++++++++++++++++++++++++++++-----------
fs/jbd/journal.c | 28 ++++++++++++++++++------
fs/jbd/recovery.c | 7 ++++--
include/linux/jbd.h | 2 -
4 files changed, 65 insertions(+), 21 deletions(-)

Index: linux-2.6.26-rc4/fs/jbd/checkpoint.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/checkpoint.c
+++ linux-2.6.26-rc4/fs/jbd/checkpoint.c
@@ -93,7 +93,8 @@ static int __try_to_free_cp_buf(struct j
int ret = 0;
struct buffer_head *bh = jh2bh(jh);

- if (jh->b_jlist == BJ_None && !buffer_locked(bh) && !buffer_dirty(bh)) {
+ if (jh->b_jlist == BJ_None && !buffer_locked(bh) &&
+ !buffer_dirty(bh) && buffer_uptodate(bh)) {
JBUFFER_TRACE(jh, "remove from checkpoint list");
ret = __journal_remove_checkpoint(jh) + 1;
jbd_unlock_bh_state(bh);
@@ -160,21 +161,25 @@ static void jbd_sync_bh(journal_t *journ
* buffers. Note that we take the buffers in the opposite ordering
* from the one in which they were submitted for IO.
*
+ * Return 0 on success, and return <0 if some buffers have failed
+ * to be written out.
+ *
* Called with j_list_lock held.
*/
-static void __wait_cp_io(journal_t *journal, transaction_t *transaction)
+static int __wait_cp_io(journal_t *journal, transaction_t *transaction)
{
struct journal_head *jh;
struct buffer_head *bh;
tid_t this_tid;
int released = 0;
+ int ret = 0;

this_tid = transaction->t_tid;
restart:
/* Did somebody clean up the transaction in the meanwhile? */
if (journal->j_checkpoint_transactions != transaction ||
transaction->t_tid != this_tid)
- return;
+ return ret;
while (!released && transaction->t_checkpoint_io_list) {
jh = transaction->t_checkpoint_io_list;
bh = jh2bh(jh);
@@ -194,6 +199,9 @@ restart:
spin_lock(&journal->j_list_lock);
goto restart;
}
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
+
/*
* Now in whatever state the buffer currently is, we know that
* it has been written out and so we can drop it from the list
@@ -203,6 +211,8 @@ restart:
journal_remove_journal_head(bh);
__brelse(bh);
}
+
+ return ret;
}

#define NR_BATCH 64
@@ -226,7 +236,8 @@ __flush_batch(journal_t *journal, struct
* Try to flush one buffer from the checkpoint list to disk.
*
* Return 1 if something happened which requires us to abort the current
- * scan of the checkpoint list.
+ * scan of the checkpoint list. Return <0 if the buffer has failed to
+ * be written out.
*
* Called with j_list_lock held and drops it if 1 is returned
* Called under jbd_lock_bh_state(jh2bh(jh)), and drops it
@@ -256,6 +267,9 @@ static int __process_buffer(journal_t *j
log_wait_commit(journal, tid);
ret = 1;
} else if (!buffer_dirty(bh)) {
+ ret = 1;
+ if (unlikely(!buffer_uptodate(bh)))
+ ret = -EIO;
J_ASSERT_JH(jh, !buffer_jbddirty(bh));
BUFFER_TRACE(bh, "remove from checkpoint");
__journal_remove_checkpoint(jh);
@@ -263,7 +277,6 @@ static int __process_buffer(journal_t *j
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
__brelse(bh);
- ret = 1;
} else {
/*
* Important: we are about to write the buffer, and
@@ -295,6 +308,7 @@ static int __process_buffer(journal_t *j
* to disk. We submit larger chunks of data at once.
*
* The journal should be locked before calling this function.
+ * Called with j_checkpoint_mutex held.
*/
int log_do_checkpoint(journal_t *journal)
{
@@ -318,6 +332,7 @@ int log_do_checkpoint(journal_t *journal
* OK, we need to start writing disk blocks. Take one transaction
* and write it.
*/
+ result = 0;
spin_lock(&journal->j_list_lock);
if (!journal->j_checkpoint_transactions)
goto out;
@@ -334,7 +349,7 @@ restart:
int batch_count = 0;
struct buffer_head *bhs[NR_BATCH];
struct journal_head *jh;
- int retry = 0;
+ int retry = 0, err;

while (!retry && transaction->t_checkpoint_list) {
struct buffer_head *bh;
@@ -347,6 +362,8 @@ restart:
break;
}
retry = __process_buffer(journal, jh, bhs,&batch_count);
+ if (retry < 0 && !result)
+ result = retry;
if (!retry && (need_resched() ||
spin_needbreak(&journal->j_list_lock))) {
spin_unlock(&journal->j_list_lock);
@@ -371,14 +388,18 @@ restart:
* Now we have cleaned up the first transaction's checkpoint
* list. Let's clean up the second one
*/
- __wait_cp_io(journal, transaction);
+ err = __wait_cp_io(journal, transaction);
+ if (!result)
+ result = err;
}
out:
spin_unlock(&journal->j_list_lock);
- result = cleanup_journal_tail(journal);
if (result < 0)
- return result;
- return 0;
+ journal_abort(journal, result);
+ else
+ result = cleanup_journal_tail(journal);
+
+ return (result < 0) ? result : 0;
}

/*
@@ -394,8 +415,9 @@ out:
* This is the only part of the journaling code which really needs to be
* aware of transaction aborts. Checkpointing involves writing to the
* main filesystem area rather than to the journal, so it can proceed
- * even in abort state, but we must not update the journal superblock if
- * we have an abort error outstanding.
+ * even in abort state, but we must not update the super block if
+ * checkpointing may have failed. Otherwise, we would lose some metadata
+ * buffers which should be written-back to the filesystem.
*/

int cleanup_journal_tail(journal_t *journal)
@@ -404,6 +426,9 @@ int cleanup_journal_tail(journal_t *jour
tid_t first_tid;
unsigned long blocknr, freed;

+ if (is_journal_aborted(journal))
+ return 1;
+
/* OK, work out the oldest transaction remaining in the log, and
* the log block it starts at.
*
Index: linux-2.6.26-rc4/fs/jbd/journal.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/journal.c
+++ linux-2.6.26-rc4/fs/jbd/journal.c
@@ -1122,9 +1122,12 @@ recovery_error:
*
* Release a journal_t structure once it is no longer in use by the
* journaled object.
+ * Return <0 if we couldn't clean up the journal.
*/
-void journal_destroy(journal_t *journal)
+int journal_destroy(journal_t *journal)
{
+ int err = 0;
+
/* Wait for the commit thread to wake up and die. */
journal_kill_thread(journal);

@@ -1147,11 +1150,16 @@ void journal_destroy(journal_t *journal)
J_ASSERT(journal->j_checkpoint_transactions == NULL);
spin_unlock(&journal->j_list_lock);

- /* We can now mark the journal as empty. */
- journal->j_tail = 0;
- journal->j_tail_sequence = ++journal->j_transaction_sequence;
if (journal->j_sb_buffer) {
- journal_update_superblock(journal, 1);
+ if (!is_journal_aborted(journal)) {
+ /* We can now mark the journal as empty. */
+ journal->j_tail = 0;
+ journal->j_tail_sequence =
+ ++journal->j_transaction_sequence;
+ journal_update_superblock(journal, 1);
+ } else {
+ err = -EIO;
+ }
brelse(journal->j_sb_buffer);
}

@@ -1161,6 +1169,8 @@ void journal_destroy(journal_t *journal)
journal_destroy_revoke(journal);
kfree(journal->j_wbuf);
kfree(journal);
+
+ return err;
}


@@ -1360,10 +1370,16 @@ int journal_flush(journal_t *journal)
spin_lock(&journal->j_list_lock);
while (!err && journal->j_checkpoint_transactions != NULL) {
spin_unlock(&journal->j_list_lock);
+ mutex_lock(&journal->j_checkpoint_mutex);
err = log_do_checkpoint(journal);
+ mutex_unlock(&journal->j_checkpoint_mutex);
spin_lock(&journal->j_list_lock);
}
spin_unlock(&journal->j_list_lock);
+
+ if (is_journal_aborted(journal))
+ return -EIO;
+
cleanup_journal_tail(journal);

/* Finally, mark the journal as really needing no recovery.
@@ -1385,7 +1401,7 @@ int journal_flush(journal_t *journal)
J_ASSERT(journal->j_head == journal->j_tail);
J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence);
spin_unlock(&journal->j_state_lock);
- return err;
+ return 0;
}

/**
Index: linux-2.6.26-rc4/fs/jbd/recovery.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/recovery.c
+++ linux-2.6.26-rc4/fs/jbd/recovery.c
@@ -223,7 +223,7 @@ do { \
*/
int journal_recover(journal_t *journal)
{
- int err;
+ int err, err2;
journal_superblock_t * sb;

struct recovery_info info;
@@ -261,7 +261,10 @@ int journal_recover(journal_t *journal)
journal->j_transaction_sequence = ++info.end_transaction;

journal_clear_revoke(journal);
- sync_blockdev(journal->j_fs_dev);
+ err2 = sync_blockdev(journal->j_fs_dev);
+ if (!err)
+ err = err2;
+
return err;
}

Index: linux-2.6.26-rc4/include/linux/jbd.h
===================================================================
--- linux-2.6.26-rc4.orig/include/linux/jbd.h
+++ linux-2.6.26-rc4/include/linux/jbd.h
@@ -908,7 +908,7 @@ extern int journal_set_features
(journal_t *, unsigned long, unsigned long, unsigned long);
extern int journal_create (journal_t *);
extern int journal_load (journal_t *journal);
-extern void journal_destroy (journal_t *);
+extern int journal_destroy (journal_t *);
extern int journal_recover (journal_t *journal);
extern int journal_wipe (journal_t *, int);
extern int journal_skip_recovery (journal_t *);

2008-06-03 05:12:00

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

Hidehiro Kawai wrote:

> Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io

Sorry, I forgot to remove the subject at the first line...

--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

2008-06-03 05:22:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

On Tue, 03 Jun 2008 14:11:44 +0900 Hidehiro Kawai <[email protected]> wrote:

> Hidehiro Kawai wrote:
>
> > Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io
>
> Sorry, I forgot to remove the subject at the first line...

Is OK ;)

Sometimes it helps, when people use WeirdRandomStuff in their patch
titles, or when the patch comes in reply to some email thread which has
an irrelevant Subject:.

2008-06-03 08:02:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

On Tue 03-06-08 13:40:25, Hidehiro Kawai wrote:
> Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io
>
> When a checkpointing IO fails, current JBD code doesn't check the
> error and continue journaling. This means latest metadata can be
> lost from both the journal and filesystem.
>
> This patch leaves the failed metadata blocks in the journal space
> and aborts journaling in the case of log_do_checkpoint().
> To achieve this, we need to do:
>
> 1. don't remove the failed buffer from the checkpoint list where in
> the case of __try_to_free_cp_buf() because it may be released or
> overwritten by a later transaction
> 2. log_do_checkpoint() is the last chance, remove the failed buffer
> from the checkpoint list and abort the journal
> 3. when checkpointing fails, don't update the journal super block to
> prevent the journaled contents from being cleaned. For safety,
> don't update j_tail and j_tail_sequence either
> 4. when checkpointing fails, notify this error to the ext3 layer so
> that ext3 don't clear the needs_recovery flag, otherwise the
> journaled contents are ignored and cleaned in the recovery phase
> 5. if the recovery fails, keep the needs_recovery flag
> 6. prevent cleanup_journal_tail() from being called between
> __journal_drop_transaction() and journal_abort() (a race issue
> between journal_flush() and __log_wait_for_space()
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
You can add:
Acked-by: Jan Kara <[email protected]>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-03 22:32:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Mon, 02 Jun 2008 19:43:57 +0900
Hidehiro Kawai <[email protected]> wrote:

>
> In ordered mode, we should abort journaling when an I/O error has
> occurred on a file data buffer in the committing transaction.

Why should we do that?

2008-06-03 22:34:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/5] jbd: ordered data integrity fix

On Mon, 02 Jun 2008 19:45:04 +0900
Hidehiro Kawai <[email protected]> wrote:

> In ordered mode, if a buffer being dirtied exists in the committing

I changed this text to read "if a file data buffer being dirtied".
Please do be careful when describing all these buffers, else it gets
more confusing that it already is.


> transaction, we write the buffer to the disk, move it from the
> committing transaction to the running transaction, then dirty it.
> But we don't have to remove the buffer from the committing
> transaction when the buffer couldn't be written out, otherwise it
> would miss the error and the committing transaction would not abort.
>
> This patch adds an error check before removing the buffer from the
> committing transaction.

2008-06-03 22:37:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/5] jbd: abort when failed to log metadata buffers

On Mon, 02 Jun 2008 19:46:02 +0900
Hidehiro Kawai <[email protected]> wrote:

> Subject: [PATCH 3/5] jbd: abort when failed to log metadata buffers
>
> If we failed to write metadata buffers to the journal space and
> succeeded to write the commit record, stale data can be written
> back to the filesystem as metadata in the recovery phase.
>
> To avoid this, when we failed to write out metadata buffers,
> abort the journal before writing the commit record.
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> ---
> fs/jbd/commit.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> Index: linux-2.6.26-rc4/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.26-rc4.orig/fs/jbd/commit.c
> +++ linux-2.6.26-rc4/fs/jbd/commit.c
> @@ -734,6 +734,9 @@ wait_for_iobuf:
> /* AKPM: bforget here */
> }
>
> + if (err)
> + journal_abort(journal, err);
> +
> jbd_debug(3, "JBD: commit phase 6\n");
>
> if (journal_write_commit_record(journal, commit_transaction))
>

I assume this has all been tested?

How are you finding these problems and testing the fixes? Fault
injection?

Does it make sense to proceed into phase 6 here, or should we bale out
of commit at this point?

2008-06-04 10:19:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Tue 03-06-08 15:30:50, Andrew Morton wrote:
> On Mon, 02 Jun 2008 19:43:57 +0900
> Hidehiro Kawai <[email protected]> wrote:
>
> >
> > In ordered mode, we should abort journaling when an I/O error has
> > occurred on a file data buffer in the committing transaction.
>
> Why should we do that?
I see two reasons:
1) If fs below us is returning IO errors, we don't really know how severe
it is so it's safest to stop accepting writes. Also user notices the
problem early this way. I agree that with the growing size of disks and
thus probability of seeing IO error, we should probably think of something
cleverer than this but aborting seems better than just doing nothing.

2) If the IO error is just transient (i.e., link to NAS is disconnected for
a while), we would silently break ordering mode guarantees (user could be
able to see old / uninitialized data).

Honza

PS: Changed Andreas's address in the email to the new one...
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-04 10:54:25

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

Hi,

Andrew Morton wrote:

> On Mon, 02 Jun 2008 19:43:57 +0900
> Hidehiro Kawai <[email protected]> wrote:
>
>>In ordered mode, we should abort journaling when an I/O error has
>>occurred on a file data buffer in the committing transaction.
>
> Why should we do that?

Aborting the journal leads the filesystem to read-only state, and it
can prevent inconsistent state in file data from diffusing furthermore.
I think it makes sense for ordered mode users; they don't want to
lose file updates as much as possible, but they don't want to do fsync.

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

2008-06-04 10:55:40

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 2/5] jbd: ordered data integrity fix

Hi,

Andrew Morton wrote:

> On Mon, 02 Jun 2008 19:45:04 +0900
> Hidehiro Kawai <[email protected]> wrote:
>
>>In ordered mode, if a buffer being dirtied exists in the committing
>
> I changed this text to read "if a file data buffer being dirtied".
> Please do be careful when describing all these buffers, else it gets
> more confusing that it already is.

Sorry for my misleading description. I'll be careful hereafter.

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

2008-06-04 10:58:16

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 3/5] jbd: abort when failed to log metadata buffers

Hi,

Andrew Morton wrote:

> On Mon, 02 Jun 2008 19:46:02 +0900
> Hidehiro Kawai <[email protected]> wrote:
>
>>Subject: [PATCH 3/5] jbd: abort when failed to log metadata buffers
>>
>>If we failed to write metadata buffers to the journal space and
>>succeeded to write the commit record, stale data can be written
>>back to the filesystem as metadata in the recovery phase.
>>
>>To avoid this, when we failed to write out metadata buffers,
>>abort the journal before writing the commit record.
>>
>>Signed-off-by: Hidehiro Kawai <[email protected]>
>>---
>> fs/jbd/commit.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>>Index: linux-2.6.26-rc4/fs/jbd/commit.c
>>===================================================================
>>--- linux-2.6.26-rc4.orig/fs/jbd/commit.c
>>+++ linux-2.6.26-rc4/fs/jbd/commit.c
>>@@ -734,6 +734,9 @@ wait_for_iobuf:
>> /* AKPM: bforget here */
>> }
>>
>>+ if (err)
>>+ journal_abort(journal, err);
>>+
>> jbd_debug(3, "JBD: commit phase 6\n");
>>
>> if (journal_write_commit_record(journal, commit_transaction))
>>
>
>
> I assume this has all been tested?

Yes, I tested all cases except for the following case (related to
PATCH 4/5):

> o journal_flush() uses j_checkpoint_mutex to avoid a race with
> __log_wait_for_space()
>
> The last item targets a newly found problem. journal_flush() can be
> called while processing __log_wait_for_space(). In this case,
> cleanup_journal_tail() can be called between
> __journal_drop_transaction() and journal_abort(), then
> the transaction with checkpointing failure is lost from the journal.
> Using j_checkpoint_mutex which is used by __log_wait_for_space(),
> we should avoid the race condition. But the test is not so sufficient
> because it is very difficult to produce this race. So I hope that
> this locking is reviewed carefully (including a possibility of
> deadlock.)

I caused invocations of journal_flush() and __log_wait_for_space() and
a write error simultaneously, but I haven't confirmed the race had
occurred.

> How are you finding these problems and testing the fixes? Fault
> injection?

I found these problems by reading souce codes, then tested them
by the fault injection approach. To inject a fault, I used a
SystemTap script at the bottom of this mail.

> Does it make sense to proceed into phase 6 here, or should we bale out
> of commit at this point?

What I really want to do is that don't write the commit record when
metadata buffers couldn't be written to the journal.
It should be no problem in the case of writing revoke records failure
because the recovery process detects the invalid control block with
a noncontiguous sequence number.
But it is nonsense to write the commit record even though we failed
to write control blocks to the journal. So I think it makes sense
to catch errors for all writes to the journal here and abort the
journal to avoid writing the commit record.

* * * * * *

The following SystemTap script was used to inject a fault.
Please don't use this script without changing. It is hard-coded
for my environment.


global target_inode_block = 64
/*
* Inject a fault when a particular metadata buffer is journaled.
*/

%{
#include <linux/buffer_head.h>
#include <linux/jbd.h>
#include <scsi/scsi.h>
#include <scsi/scsi_cmnd.h>

enum fi_state_bits {
BH_Faulty = BH_Unshadow + 1,
};
%}

function fault_inject (scmd: long) %{
struct scsi_cmnd *cmd = (void *)((unsigned long)THIS->scmd);
cmd->cmnd[0] |= (7 << 5);
cmd->cmd_len = 255;
%}

global do_fault_inject
global faulty_sector
probe module("jbd").function("journal_write_metadata_buffer") {
if ($jh_in->b_bh->b_blocknr == target_inode_block) {
do_fault_inject[tid()] = 1
}
}
probe module("jbd").function("journal_write_metadata_buffer").return {
do_fault_inject[tid()] = 0
}

probe module("jbd").function("journal_file_buffer") {
if (do_fault_inject[tid()] && $jlist == 4 /* BJ_IO */) {
faulty_sector[$jh->b_bh->b_blocknr * 8 + 63] = 1
printf("mark faulty @ sector=%d\n",
$jh->b_bh->b_blocknr * 8 + 63)
}
}

probe kernel.function("scsi_dispatch_cmd") {
host = $cmd->device->host->host_no
id = $cmd->device->id
lun = $cmd->device->lun
ch = $cmd->device->channel
sector = $cmd->request->bio->bi_sector
len = $cmd->transfersize / 512

if (id != 1) {
next
}
printf("%d:%d:%d:%d, #%d+%d\n", host, ch, id, lun, sector, len)
if ($cmd->request->cmd_flags & 1 == 1 && faulty_sector[sector]) {
delete faulty_sector[sector]
fault_inject($cmd)
printf("fault injected\n")
}
}

--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

2008-06-04 18:32:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Wed, 4 Jun 2008 12:19:25 +0200 Jan Kara <[email protected]> wrote:

> On Tue 03-06-08 15:30:50, Andrew Morton wrote:
> > On Mon, 02 Jun 2008 19:43:57 +0900
> > Hidehiro Kawai <[email protected]> wrote:
> >
> > >
> > > In ordered mode, we should abort journaling when an I/O error has
> > > occurred on a file data buffer in the committing transaction.
> >
> > Why should we do that?
> I see two reasons:
> 1) If fs below us is returning IO errors, we don't really know how severe
> it is so it's safest to stop accepting writes. Also user notices the
> problem early this way. I agree that with the growing size of disks and
> thus probability of seeing IO error, we should probably think of something
> cleverer than this but aborting seems better than just doing nothing.
>
> 2) If the IO error is just transient (i.e., link to NAS is disconnected for
> a while), we would silently break ordering mode guarantees (user could be
> able to see old / uninitialized data).
>

Does any other filesystem driver turn the fs read-only on the first
write-IO-error?

It seems like a big policy change to me. For a lot of applications
it's effectively a complete outage and people might get a bit upset if
this happens on the first blip from their NAS.

<waves vigorously at linux-ext4 people>

2008-06-04 21:23:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Wed, Jun 04, 2008 at 11:19:11AM -0700, Andrew Morton wrote:
> Does any other filesystem driver turn the fs read-only on the first
> write-IO-error?
>
> It seems like a big policy change to me. For a lot of applications
> it's effectively a complete outage and people might get a bit upset if
> this happens on the first blip from their NAS.

As I told Kawai-san when I met with him and his colleagues in Tokyo
last week, it is the responsibility of the storage stack to retry
errors as appropriate. From the filesystem perspective, a read or a
write operation succeeds, or fails. A read or write operation could
take a long time before returning, but the storage stack doesn't get
to return a "fail, but try again at some point; maybe we'll succeed
later, or if you try writing to a different block". The only sane
thing for a filesystem to do is to treat any failure as a hard
failure.

It is similarly insane to ask a filesystem to figure out that a newly
plugged in USB stick is the same one that the user had accidentally
unplugged 30 seconds ago. We don't want to put that kind of low-level
knowlede about storage details in each different filesystem.

A much better place to put that kind of smarts is in a multipath
module which sits in between the device and the filesystem. It can
retry writes from a transient failure, if a path goes down or if a
iSCSI device temporarily drops off the network. But if a filesystem
gets a write failure, it has to assume that the write failure is
permanent.

The question though is what should you do if you have a write failure
in various different parts of the disk? If you have a write failure
in a data block, you can return -EIO to the user. You could try
reallocating to find another block, and try writing to that alternate
location (although with modern filesystems that do block remapping,
this is largely pointless, since an EIO failure on write probably
means you've lost connectivity to the disk or the disk as run out of
spare blocks). But for a failure to write to the a critical part of
the filesystem, like the inode table, or failure to write to the
journal, what the heck can you do? Remounting read-only is probably
the best thing you can do.

In theory, if it is a failure to write to the journal, you could fall
back to no-journaled operation, and if ext3 could support running w/o
a journal, that is possibly an option --- but again, it's very likely
that the disk is totally gone (i.e., the user pulled the USB stick
without unmounting), or the disk is out of spare blocks in its bad
block remapping pool, and the system is probably going to be in deep
trouble --- and the next failure to write some data might be critical
application data. You probably *are* better off failing the system
hard, and letting the HA system swap in the hot spare backup, if this
is some critical service.

That being said, ext3 can be tuned (and it is the default today,
although I should probably change the default to be remount-ro), so
that its behaviour on write errors is, "don't worry, be happy", and
just leave the filesystem mounted read/write. That's actually quite
dangerous for a critical production server, however.....

- Ted

2008-06-04 21:58:57

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Jun 04, 2008 11:19 -0700, Andrew Morton wrote:
> On Wed, 4 Jun 2008 12:19:25 +0200 Jan Kara <[email protected]> wrote:
>
> > On Tue 03-06-08 15:30:50, Andrew Morton wrote:
> > > On Mon, 02 Jun 2008 19:43:57 +0900
> > > Hidehiro Kawai <[email protected]> wrote:
> > >
> > > >
> > > > In ordered mode, we should abort journaling when an I/O error has
> > > > occurred on a file data buffer in the committing transaction.
> > >
> > > Why should we do that?
> > I see two reasons:
> > 1) If fs below us is returning IO errors, we don't really know how severe
> > it is so it's safest to stop accepting writes. Also user notices the
> > problem early this way. I agree that with the growing size of disks and
> > thus probability of seeing IO error, we should probably think of something
> > cleverer than this but aborting seems better than just doing nothing.
> >
> > 2) If the IO error is just transient (i.e., link to NAS is disconnected for
> > a while), we would silently break ordering mode guarantees (user could be
> > able to see old / uninitialized data).
> >
>
> Does any other filesystem driver turn the fs read-only on the first
> write-IO-error?
>
> It seems like a big policy change to me. For a lot of applications
> it's effectively a complete outage and people might get a bit upset if
> this happens on the first blip from their NAS.

I agree with Andrew. The historical policy of ext2/3/4 is that write
errors for FILE DATA propagate to the application via EIO, regardless
of whether ordered mode is active or not. If filesystem METADATA is
involved, yes this should cause an ext3_error() to be called and the
policy for what to do next is controlled by the administrator.

If the journal is aborted for a file data write error, the node maybe
panics (errors=panic) or is rebooted by the administrator, then e2fsck
is run and no problem is found (which it will not because e2fsck does
not check file data), then all that has been accomplished is to reboot
the node.

Applications should check the error return codes from their writes,
and call fsync on the file before closing it if they are really worried
about whether the data made it to disk safely, otherwise even a read-only
filesystem will not cause the application to notice anything.

Now, if we have a problem where the write error from the journal checkpoint
is not being propagated back to the original buffer, that is a different
issue. For ordered-mode data I don't _think_ that the data buffers are
mirrored when a transaction is closed, so as long as the !buffer_uptodate()
error is propagated back to the caller on fsync() that is enough.

We might also consider having a separate mechanism to handle write failures,
but I don't think that that should be intermixed with the ext3 error handling
for metadata errors.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2008-06-04 21:59:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Wed, 4 Jun 2008 17:22:02 -0400
Theodore Tso <[email protected]> wrote:

> On Wed, Jun 04, 2008 at 11:19:11AM -0700, Andrew Morton wrote:
> > Does any other filesystem driver turn the fs read-only on the first
> > write-IO-error?

^^ this?

> > It seems like a big policy change to me. For a lot of applications
> > it's effectively a complete outage and people might get a bit upset if
> > this happens on the first blip from their NAS.
>
> As I told Kawai-san when I met with him and his colleagues in Tokyo
> last week, it is the responsibility of the storage stack to retry
> errors as appropriate. From the filesystem perspective, a read or a
> write operation succeeds, or fails. A read or write operation could
> take a long time before returning, but the storage stack doesn't get
> to return a "fail, but try again at some point; maybe we'll succeed
> later, or if you try writing to a different block". The only sane
> thing for a filesystem to do is to treat any failure as a hard
> failure.
>
> It is similarly insane to ask a filesystem to figure out that a newly
> plugged in USB stick is the same one that the user had accidentally
> unplugged 30 seconds ago. We don't want to put that kind of low-level
> knowlede about storage details in each different filesystem.
>
> A much better place to put that kind of smarts is in a multipath
> module which sits in between the device and the filesystem. It can
> retry writes from a transient failure, if a path goes down or if a
> iSCSI device temporarily drops off the network.

That's fine in theory, but we don't do any if that right now, do we?

> But if a filesystem
> gets a write failure, it has to assume that the write failure is
> permanent.

To that sector, yes. But to the entire partition?

(Well, if the entire partition became unwriteable then we don't have a
problem. It's "parts are writeable" or "it became writeable again
later which is the problem).


> The question though is what should you do if you have a write failure
> in various different parts of the disk? If you have a write failure
> in a data block, you can return -EIO to the user.

Absolutely.

But afaict this patch changes things so that if we get a write failure
in a data block we make the entire fs read-only. Which, as I said, is
often "dead box".

This seems like a quite major policy change to me.

> You could try
> reallocating to find another block, and try writing to that alternate
> location (although with modern filesystems that do block remapping,
> this is largely pointless, since an EIO failure on write probably
> means you've lost connectivity to the disk or the disk as run out of
> spare blocks). But for a failure to write to the a critical part of
> the filesystem, like the inode table, or failure to write to the
> journal, what the heck can you do? Remounting read-only is probably
> the best thing you can do.

Ah, yes, I agree, a write error to the journal or to metadata is a
quite different thing. An unwriteable journal block is surely
terminal. And a write error during metadata checkpointing is pretty
horrid ebcause we've (potentially) already told userspace that the
write was successful.

But should we treat plain old data blocks in the same way?

> In theory, if it is a failure to write to the journal, you could fall
> back to no-journaled operation, and if ext3 could support running w/o
> a journal, that is possibly an option --- but again, it's very likely
> that the disk is totally gone (i.e., the user pulled the USB stick
> without unmounting), or the disk is out of spare blocks in its bad
> block remapping pool, and the system is probably going to be in deep
> trouble --- and the next failure to write some data might be critical
> application data. You probably *are* better off failing the system
> hard, and letting the HA system swap in the hot spare backup, if this
> is some critical service.
>
> That being said, ext3 can be tuned (and it is the default today,
> although I should probably change the default to be remount-ro), so
> that its behaviour on write errors is, "don't worry, be happy", and
> just leave the filesystem mounted read/write. That's actually quite
> dangerous for a critical production server, however.....
>

2008-06-04 22:52:20

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Wed, Jun 04, 2008 at 02:58:48PM -0700, Andrew Morton wrote:
> On Wed, 4 Jun 2008 17:22:02 -0400
> Theodore Tso <[email protected]> wrote:
>
> > On Wed, Jun 04, 2008 at 11:19:11AM -0700, Andrew Morton wrote:
> > > Does any other filesystem driver turn the fs read-only on the first
> > > write-IO-error?
>

For failures to write to data blocks, I don't think any other
filesystems turn the filesystem read-only. Not that many other
filesystems do the remount on read-only thing in general; remounting
read/only is something that might be unique to ext2/3/4.

> > It is similarly insane to ask a filesystem to figure out that a newly
> > plugged in USB stick is the same one that the user had accidentally
> > unplugged 30 seconds ago. We don't want to put that kind of low-level
> > knowlede about storage details in each different filesystem.
> >
> > A much better place to put that kind of smarts is in a multipath
> > module which sits in between the device and the filesystem. It can
> > retry writes from a transient failure, if a path goes down or if a
> > iSCSI device temporarily drops off the network.
>
> That's fine in theory, but we don't do any if that right now, do we?

No, but I think it's insane to put any of this into the filesystem. I
don't want to put words in other people's mouths, but Mingming and I
were chatting with Chris Mason the last two days at a BTRFS workshop
(which is why we've been a bit slow on responding), and when discussed
this thread informally, he agreed that it was really bad idea to try
to put this kind of retry logic in an individual filesystem.

> > But if a filesystem
> > gets a write failure, it has to assume that the write failure is
> > permanent.
>
> To that sector, yes. But to the entire partition?

I agree it's a bad idea. OTOH, we really need a good way of notifying
a system daemon or administrator, and not just rely on the application
to DTRT when it receives an -EIO. Probably what we should do is (1)
return -EIO, (2) send a uevent that includes as much information as
possible. At the minimum, the block that had the write (or read)
error, and if available the file name involved, and application and/or
pid involved. That way, the policy of what should happen in case of a
data I/O error can be informed by what write just failed. It might be
that if the failure is to some critical application data, the right
thing to do is to kill of the application server and let the HA system
bring up the hotbackup. Or if the failure is writing to a log file,
some other recovery procedure is the right thing.

But forcing the entire filesystem read-only just because of a write
failure to a data block is probably not the best way to go.

> But afaict this patch changes things so that if we get a write failure
> in a data block we make the entire fs read-only. Which, as I said, is
> often "dead box".
>
> This seems like a quite major policy change to me.

Agreed, and it's not appropriate. I could imagine that for some
setups it is the right policy, but the kernel should not be setting
policy like this. Maybe as a new tunable in the superblock, or maybe
via a round-trip to userspace via a uevent, but certainly not as the
new default behavior.

Apologies, I'm still catching up on patches sent in the past few days,
so I haven't had a chance to do a detailed review on Kawai-san's
patches yet. I do agree that if this patch is forcing the entire
filesystem read-only on a write-failure to a data block, it's probably
not appropriate.

- Ted

2008-06-05 03:28:20

by Mike Snitzer

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Wed, Jun 4, 2008 at 5:22 PM, Theodore Tso <[email protected]> wrote:

> That being said, ext3 can be tuned (and it is the default today,
> although I should probably change the default to be remount-ro), so
> that its behaviour on write errors is, "don't worry, be happy", and
> just leave the filesystem mounted read/write. That's actually quite
> dangerous for a critical production server, however.....

I'm not completely sure which "it" you mean in "it is the default
today" (seems like you mean errors=continue) but isn't remount-ro
already the ext3 default?

see commit: 1eca93f9cafdec4a332ace9b0fc0d3886d430c28

2008-06-05 09:35:51

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Wed 04-06-08 18:51:55, Theodore Tso wrote:
> On Wed, Jun 04, 2008 at 02:58:48PM -0700, Andrew Morton wrote:
> > On Wed, 4 Jun 2008 17:22:02 -0400
> > Theodore Tso <[email protected]> wrote:
> >
> > > On Wed, Jun 04, 2008 at 11:19:11AM -0700, Andrew Morton wrote:
> > > > Does any other filesystem driver turn the fs read-only on the first
> > > > write-IO-error?
> >
>
> For failures to write to data blocks, I don't think any other
> filesystems turn the filesystem read-only. Not that many other
> filesystems do the remount on read-only thing in general; remounting
> read/only is something that might be unique to ext2/3/4.
>
> > > It is similarly insane to ask a filesystem to figure out that a newly
> > > plugged in USB stick is the same one that the user had accidentally
> > > unplugged 30 seconds ago. We don't want to put that kind of low-level
> > > knowlede about storage details in each different filesystem.
> > >
> > > A much better place to put that kind of smarts is in a multipath
> > > module which sits in between the device and the filesystem. It can
> > > retry writes from a transient failure, if a path goes down or if a
> > > iSCSI device temporarily drops off the network.
> >
> > That's fine in theory, but we don't do any if that right now, do we?
>
> No, but I think it's insane to put any of this into the filesystem. I
> don't want to put words in other people's mouths, but Mingming and I
> were chatting with Chris Mason the last two days at a BTRFS workshop
> (which is why we've been a bit slow on responding), and when discussed
> this thread informally, he agreed that it was really bad idea to try
> to put this kind of retry logic in an individual filesystem.
>
> > > But if a filesystem
> > > gets a write failure, it has to assume that the write failure is
> > > permanent.
> >
> > To that sector, yes. But to the entire partition?
>
> I agree it's a bad idea. OTOH, we really need a good way of notifying
> a system daemon or administrator, and not just rely on the application
> to DTRT when it receives an -EIO. Probably what we should do is (1)
> return -EIO, (2) send a uevent that includes as much information as
> possible. At the minimum, the block that had the write (or read)
> error, and if available the file name involved, and application and/or
> pid involved. That way, the policy of what should happen in case of a
> data I/O error can be informed by what write just failed. It might be
> that if the failure is to some critical application data, the right
> thing to do is to kill of the application server and let the HA system
> bring up the hotbackup. Or if the failure is writing to a log file,
> some other recovery procedure is the right thing.
>
> But forcing the entire filesystem read-only just because of a write
> failure to a data block is probably not the best way to go.
OK, you've convinced me that aborting journal because of EIO on data
block isn't the best thing to do. But it is a bit problematic to return EIO
to userspace - we usually find this in commit code so we work on behalf of
kjournald and user already thinks the write has succeeded (which is
basically problem of any cached write). So the best thing we can do is set
AS_EIO on the mapping and hope userspace will notice (we report such error
when user does fsync/sync/sync_file_range).
Sending some event to userspace is certainly possible but can hardly be
more specific than "IO error at block %lu of file %s".

> > But afaict this patch changes things so that if we get a write failure
> > in a data block we make the entire fs read-only. Which, as I said, is
> > often "dead box".
> >
> > This seems like a quite major policy change to me.
>
> Agreed, and it's not appropriate. I could imagine that for some
> setups it is the right policy, but the kernel should not be setting
> policy like this. Maybe as a new tunable in the superblock, or maybe
> via a round-trip to userspace via a uevent, but certainly not as the
> new default behavior.
Yes, I believe a tunable in superblock controlling how do we behave on
EIO error in data block would be the best solution.

> Apologies, I'm still catching up on patches sent in the past few days,
> so I haven't had a chance to do a detailed review on Kawai-san's
> patches yet. I do agree that if this patch is forcing the entire
> filesystem read-only on a write-failure to a data block, it's probably
> not appropriate.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-05 11:33:46

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers


Jan Kara wrote:

> On Wed 04-06-08 18:51:55, Theodore Tso wrote:
>
>>On Wed, Jun 04, 2008 at 02:58:48PM -0700, Andrew Morton wrote:
>>
>>>On Wed, 4 Jun 2008 17:22:02 -0400
>>>Theodore Tso <[email protected]> wrote:
>>>
>>>>On Wed, Jun 04, 2008 at 11:19:11AM -0700, Andrew Morton wrote:

>>>But afaict this patch changes things so that if we get a write failure
>>>in a data block we make the entire fs read-only. Which, as I said, is
>>>often "dead box".
>>>
>>>This seems like a quite major policy change to me.

My patch doesn't change the policy. JBD aborts the journal when
it detects I/O error in file data since 2.6.11. Perhaps this patch:
http://marc.info/?l=linux-kernel&m=110483888632225
I just added missing error checkings.

>>Agreed, and it's not appropriate. I could imagine that for some
>>setups it is the right policy, but the kernel should not be setting
>>policy like this. Maybe as a new tunable in the superblock, or maybe
>>via a round-trip to userspace via a uevent, but certainly not as the
>>new default behavior.
>
> Yes, I believe a tunable in superblock controlling how do we behave on
> EIO error in data block would be the best solution.

I agree. I understood that there is a case where we don't want to
make the fs read-only when writing file data failed. OTOH there are
people who want to make the fs read-only to avoid the damage from
expanding. Introducing the tunable would be better.
I'm going to send a patch to make this behavior tunable if some of you
agree on this way.

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

2008-06-05 14:30:22

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Thu, Jun 05, 2008 at 08:33:27PM +0900, Hidehiro Kawai wrote:
>
> My patch doesn't change the policy. JBD aborts the journal when
> it detects I/O error in file data since 2.6.11. Perhaps this patch:
> http://marc.info/?l=linux-kernel&m=110483888632225
> I just added missing error checkings.
>

Looking at the code paths touched by patch you referenced, you are
correct. And Andrew even signed off on it. :-)

But if someone was only examining the patch, it wasn't obvious that
the journal was getting aborted when the JBD layer was forcing buffers
from t_sync_datalist to disk. So I suspect the change went in without
proper consideration of the net effect. You just called it out
explicitly in the subject line, which caused Andrew to ask some good
questions; questions that weren't asked in 2005.

> I agree. I understood that there is a case where we don't want to
> make the fs read-only when writing file data failed. OTOH there are
> people who want to make the fs read-only to avoid the damage from
> expanding. Introducing the tunable would be better.
> I'm going to send a patch to make this behavior tunable if some of you
> agree on this way.

Note that doing this right may be tricky, since in the case where we
aren't aborting the journal, we need to set the appropriate flags in
the page cache so that when the user calls fsync() or close(), that
they get the EIO error.

- Ted

2008-06-05 16:21:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Thu, 5 Jun 2008 10:29:48 -0400 Theodore Tso <[email protected]> wrote:

> On Thu, Jun 05, 2008 at 08:33:27PM +0900, Hidehiro Kawai wrote:
> >
> > My patch doesn't change the policy. JBD aborts the journal when
> > it detects I/O error in file data since 2.6.11. Perhaps this patch:
> > http://marc.info/?l=linux-kernel&m=110483888632225
> > I just added missing error checkings.
> >
>
> Looking at the code paths touched by patch you referenced, you are
> correct. And Andrew even signed off on it. :-)
>
> But if someone was only examining the patch, it wasn't obvious that
> the journal was getting aborted when the JBD layer was forcing buffers
> from t_sync_datalist to disk. So I suspect the change went in without
> proper consideration of the net effect. You just called it out
> explicitly in the subject line, which caused Andrew to ask some good
> questions; questions that weren't asked in 2005.

Sigh. An object lesson in the value of good changelogging :(

I guess we need to undo this. And yes, propagating errors into AS_EIO
is the way. I guess that's safe without holding lock_page(), as long
as the bh is pinned.

2008-06-05 18:49:59

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Jun 05, 2008 09:20 -0700, Andrew Morton wrote:
> On Thu, 5 Jun 2008 10:29:48 -0400 Theodore Tso <[email protected]> wrote:
> > On Thu, Jun 05, 2008 at 08:33:27PM +0900, Hidehiro Kawai wrote:
> > >
> > > My patch doesn't change the policy. JBD aborts the journal when
> > > it detects I/O error in file data since 2.6.11. Perhaps this patch:
> > > http://marc.info/?l=linux-kernel&m=110483888632225
> > > I just added missing error checkings.
> >
> > Looking at the code paths touched by patch you referenced, you are
> > correct. And Andrew even signed off on it. :-)
> >
> > But if someone was only examining the patch, it wasn't obvious that
> > the journal was getting aborted when the JBD layer was forcing buffers
> > from t_sync_datalist to disk. So I suspect the change went in without
> > proper consideration of the net effect. You just called it out
> > explicitly in the subject line, which caused Andrew to ask some good
> > questions; questions that weren't asked in 2005.
>
> Sigh. An object lesson in the value of good changelogging :(

... and the value of "diff -p" so it is clear what function is being changed.

> I guess we need to undo this. And yes, propagating errors into AS_EIO
> is the way. I guess that's safe without holding lock_page(), as long
> as the bh is pinned.

Something like the following instead if -EIO and journal abort:

if (!buffer_uptodate(bh)) {
set_bit(AS_EIO, &bh->b_page->mapping->flags);
SetPageError(bh->b_page);
}

It seems end_buffer_async_write() does this already, but
journal_do_submit_data() uses end_buffer_write_sync() and it does not
do either of those operations.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2008-06-09 10:09:47

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

Andreas Dilger wrote:
> On Jun 05, 2008 09:20 -0700, Andrew Morton wrote:

>>I guess we need to undo this. And yes, propagating errors into AS_EIO
>>is the way. I guess that's safe without holding lock_page(), as long
>>as the bh is pinned.
>
> Something like the following instead if -EIO and journal abort:
>
> if (!buffer_uptodate(bh)) {
> set_bit(AS_EIO, &bh->b_page->mapping->flags);
> SetPageError(bh->b_page);
> }
>
> It seems end_buffer_async_write() does this already, but
> journal_do_submit_data() uses end_buffer_write_sync() and it does not
> do either of those operations.

Thank you for your suggestion. I wrote an additional patch to do
that below. Please apply it as the 6th patch of this patch series.

BTW, I'm developing a patch which makes "abort the journal if a file
data buffer has an error" tunable. I'll send it in another thread
because it's not a bug fix patch.

Regards,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center


Subject: JBD: don't abort if flushing file data failed

In ordered mode, it is not appropriate behavior to abort the journal
when we failed to write file data. This patch calls printk()
instead of aborting the journal. Additionally, set AS_EIO into
the address_space object of the buffer which is written out by
journal_do_submit_data() and failed so that fsync() can get -EIO.

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/jbd/commit.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

Index: linux-2.6.26-rc4/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc4.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc4/fs/jbd/commit.c
@@ -432,8 +432,11 @@ void journal_commit_transaction(journal_
wait_on_buffer(bh);
spin_lock(&journal->j_list_lock);
}
- if (unlikely(!buffer_uptodate(bh)))
+ if (unlikely(!buffer_uptodate(bh))) {
+ set_bit(AS_EIO, &bh->b_page->mapping->flags);
+ SetPageError(bh->b_page);
err = -EIO;
+ }
if (!inverted_lock(journal, bh)) {
put_bh(bh);
spin_lock(&journal->j_list_lock);
@@ -452,8 +455,11 @@ void journal_commit_transaction(journal_
}
spin_unlock(&journal->j_list_lock);

- if (err)
- journal_abort(journal, err);
+ if (err) {
+ printk(KERN_WARNING
+ "JBD: Detected IO errors during flushing file data\n");
+ err = 0;
+ }

journal_write_revoke_records(journal, commit_transaction);



2008-06-11 12:36:12

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

On Mon 09-06-08 19:09:25, Hidehiro Kawai wrote:
> Andreas Dilger wrote:
> > On Jun 05, 2008 09:20 -0700, Andrew Morton wrote:
>
> >>I guess we need to undo this. And yes, propagating errors into AS_EIO
> >>is the way. I guess that's safe without holding lock_page(), as long
> >>as the bh is pinned.
> >
> > Something like the following instead if -EIO and journal abort:
> >
> > if (!buffer_uptodate(bh)) {
> > set_bit(AS_EIO, &bh->b_page->mapping->flags);
> > SetPageError(bh->b_page);
> > }
> >
> > It seems end_buffer_async_write() does this already, but
> > journal_do_submit_data() uses end_buffer_write_sync() and it does not
> > do either of those operations.
>
> Thank you for your suggestion. I wrote an additional patch to do
> that below. Please apply it as the 6th patch of this patch series.
>
> BTW, I'm developing a patch which makes "abort the journal if a file
> data buffer has an error" tunable. I'll send it in another thread
> because it's not a bug fix patch.
>
> --
> Hidehiro Kawai
> Hitachi, Systems Development Laboratory
> Linux Technology Center
>
>
> Subject: JBD: don't abort if flushing file data failed
>
> In ordered mode, it is not appropriate behavior to abort the journal
> when we failed to write file data. This patch calls printk()
> instead of aborting the journal. Additionally, set AS_EIO into
> the address_space object of the buffer which is written out by
> journal_do_submit_data() and failed so that fsync() can get -EIO.
>
> Signed-off-by: Hidehiro Kawai <[email protected]>
> ---
> fs/jbd/commit.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> Index: linux-2.6.26-rc4/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.26-rc4.orig/fs/jbd/commit.c
> +++ linux-2.6.26-rc4/fs/jbd/commit.c
> @@ -432,8 +432,11 @@ void journal_commit_transaction(journal_
> wait_on_buffer(bh);
> spin_lock(&journal->j_list_lock);
> }
> - if (unlikely(!buffer_uptodate(bh)))
> + if (unlikely(!buffer_uptodate(bh))) {
> + set_bit(AS_EIO, &bh->b_page->mapping->flags);
> + SetPageError(bh->b_page);
> err = -EIO;
> + }
Actually, you should be more careful here because if the data buffer has
been truncated in the currently running transaction, it can happen that
b_page->mapping is NULL. It is a question how to safely access
page->mapping - probably you'll need page lock for that...

> if (!inverted_lock(journal, bh)) {
> put_bh(bh);
> spin_lock(&journal->j_list_lock);
> @@ -452,8 +455,11 @@ void journal_commit_transaction(journal_
> }
> spin_unlock(&journal->j_list_lock);
>
> - if (err)
> - journal_abort(journal, err);
> + if (err) {
> + printk(KERN_WARNING
> + "JBD: Detected IO errors during flushing file data\n");
> + err = 0;
> + }
>
> journal_write_revoke_records(journal, commit_transaction);

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-12 13:20:14

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: strictly check for write errors on data buffers

Hi,

Jan Kara wrote:

> On Mon 09-06-08 19:09:25, Hidehiro Kawai wrote:

>>Index: linux-2.6.26-rc4/fs/jbd/commit.c
>>===================================================================
>>--- linux-2.6.26-rc4.orig/fs/jbd/commit.c
>>+++ linux-2.6.26-rc4/fs/jbd/commit.c
>>@@ -432,8 +432,11 @@ void journal_commit_transaction(journal_
>> wait_on_buffer(bh);
>> spin_lock(&journal->j_list_lock);
>> }
>>- if (unlikely(!buffer_uptodate(bh)))
>>+ if (unlikely(!buffer_uptodate(bh))) {
>>+ set_bit(AS_EIO, &bh->b_page->mapping->flags);
>>+ SetPageError(bh->b_page);
>> err = -EIO;
>>+ }
>
> Actually, you should be more careful here because if the data buffer has
> been truncated in the currently running transaction, it can happen that
> b_page->mapping is NULL. It is a question how to safely access
> page->mapping - probably you'll need page lock for that...

Thank you for pointing out this problem. I confirmed that
b_page->mapping can be NULL. I'm making sure that the locking page
approach works out well.

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

2008-06-23 11:15:20

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

Hi,

I noticed a problem of this patch. Please see below.

Jan Kara wrote:

> On Tue 03-06-08 13:40:25, Hidehiro Kawai wrote:
>
>>Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io
>>
>>When a checkpointing IO fails, current JBD code doesn't check the
>>error and continue journaling. This means latest metadata can be
>>lost from both the journal and filesystem.
>>
>>This patch leaves the failed metadata blocks in the journal space
>>and aborts journaling in the case of log_do_checkpoint().
>>To achieve this, we need to do:
>>
>>1. don't remove the failed buffer from the checkpoint list where in
>> the case of __try_to_free_cp_buf() because it may be released or
>> overwritten by a later transaction
>>2. log_do_checkpoint() is the last chance, remove the failed buffer
>> from the checkpoint list and abort the journal
>>3. when checkpointing fails, don't update the journal super block to
>> prevent the journaled contents from being cleaned. For safety,
>> don't update j_tail and j_tail_sequence either

3. is implemented as described below.
(1) if log_do_checkpoint() detects an I/O error during
checkpointing, it calls journal_abort() to abort the journal
(2) if the journal has aborted, don't update s_start and s_sequence
in the on-disk journal superblock

So, if the journal aborts, journaled data will be replayed on the
next mount.

Now, please remember that some dirty metadata buffers are written
back to the filesystem without journaling if the journal aborted.
We are happy if all dirty metadata buffers are written to the disk,
the integrity of the filesystem will be kept. However, replaying
the journaled data can overwrite the latest on-disk metadata blocks
partly with old data. It would break the filesystem.

My idea to resolve this problem is that we don't write out metadata
buffers which belong to uncommitted transactions if journal has
aborted. Although the latest filesystem updates will be lost,
we can ensure the integrity. It will also be effective for the
kernel panic in the middle of writing metadata buffers without
journaling (this would occur in the `mount -o errors=panic' case.)

Which integrity or latest state should we choose?

Signed-off-by: Hidehiro Kawai <[email protected]>
---
fs/jbd/commit.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6.26-rc5-mm3/fs/jbd/commit.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/fs/jbd/commit.c
+++ linux-2.6.26-rc5-mm3/fs/jbd/commit.c
@@ -486,9 +486,10 @@ void journal_commit_transaction(journal_
jh = commit_transaction->t_buffers;

/* If we're in abort mode, we just un-journal the buffer and
- release it for background writing. */
+ release it. */

if (is_journal_aborted(journal)) {
+ clear_buffer_jbddirty(jh2bh(jh));
JBUFFER_TRACE(jh, "journal is aborting: refile");
journal_refile_buffer(journal, jh);
/* If that was the last one, we need to clean up
@@ -823,6 +824,8 @@ restart_loop:
if (buffer_jbddirty(bh)) {
JBUFFER_TRACE(jh, "add to new checkpointing trans");
__journal_insert_checkpoint(jh, commit_transaction);
+ if (is_journal_aborted(journal))
+ clear_buffer_jbddirty(bh);
JBUFFER_TRACE(jh, "refile for checkpoint writeback");
__journal_refile_buffer(jh);
jbd_unlock_bh_state(bh);

2008-06-23 12:22:50

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

On Mon 23-06-08 20:14:54, Hidehiro Kawai wrote:
> Hi,
>
> I noticed a problem of this patch. Please see below.
>
> Jan Kara wrote:
>
> > On Tue 03-06-08 13:40:25, Hidehiro Kawai wrote:
> >
> >>Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io
> >>
> >>When a checkpointing IO fails, current JBD code doesn't check the
> >>error and continue journaling. This means latest metadata can be
> >>lost from both the journal and filesystem.
> >>
> >>This patch leaves the failed metadata blocks in the journal space
> >>and aborts journaling in the case of log_do_checkpoint().
> >>To achieve this, we need to do:
> >>
> >>1. don't remove the failed buffer from the checkpoint list where in
> >> the case of __try_to_free_cp_buf() because it may be released or
> >> overwritten by a later transaction
> >>2. log_do_checkpoint() is the last chance, remove the failed buffer
> >> from the checkpoint list and abort the journal
> >>3. when checkpointing fails, don't update the journal super block to
> >> prevent the journaled contents from being cleaned. For safety,
> >> don't update j_tail and j_tail_sequence either
>
> 3. is implemented as described below.
> (1) if log_do_checkpoint() detects an I/O error during
> checkpointing, it calls journal_abort() to abort the journal
> (2) if the journal has aborted, don't update s_start and s_sequence
> in the on-disk journal superblock
>
> So, if the journal aborts, journaled data will be replayed on the
> next mount.
>
> Now, please remember that some dirty metadata buffers are written
> back to the filesystem without journaling if the journal aborted.
> We are happy if all dirty metadata buffers are written to the disk,
> the integrity of the filesystem will be kept. However, replaying
> the journaled data can overwrite the latest on-disk metadata blocks
> partly with old data. It would break the filesystem.
Yes, it would. But how do you think it can happen that a metadata buffer
will be written back to the filesystem when it is a part of running
transaction? Note that checkpointing code specifically checks whether the
buffer being written back is part of a running transaction and if so, it
waits for commit before writing back the buffer. So I don't think this can
happen but maybe I miss something...

> My idea to resolve this problem is that we don't write out metadata
> buffers which belong to uncommitted transactions if journal has
> aborted. Although the latest filesystem updates will be lost,
> we can ensure the integrity. It will also be effective for the
> kernel panic in the middle of writing metadata buffers without
> journaling (this would occur in the `mount -o errors=panic' case.)
>
> Which integrity or latest state should we choose?

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-24 11:53:29

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

Jan Kara wrote:

> On Mon 23-06-08 20:14:54, Hidehiro Kawai wrote:
>
>>Hi,
>>
>>I noticed a problem of this patch. Please see below.
>>
>>Jan Kara wrote:
>>
>>
>>>On Tue 03-06-08 13:40:25, Hidehiro Kawai wrote:
>>>
>>>
>>>>Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io
>>>>
>>>>When a checkpointing IO fails, current JBD code doesn't check the
>>>>error and continue journaling. This means latest metadata can be
>>>>lost from both the journal and filesystem.
>>>>
>>>>This patch leaves the failed metadata blocks in the journal space
>>>>and aborts journaling in the case of log_do_checkpoint().
>>>>To achieve this, we need to do:
>>>>
>>>>1. don't remove the failed buffer from the checkpoint list where in
>>>> the case of __try_to_free_cp_buf() because it may be released or
>>>> overwritten by a later transaction
>>>>2. log_do_checkpoint() is the last chance, remove the failed buffer
>>>> from the checkpoint list and abort the journal
>>>>3. when checkpointing fails, don't update the journal super block to
>>>> prevent the journaled contents from being cleaned. For safety,
>>>> don't update j_tail and j_tail_sequence either
>>
>>3. is implemented as described below.
>> (1) if log_do_checkpoint() detects an I/O error during
>> checkpointing, it calls journal_abort() to abort the journal
>> (2) if the journal has aborted, don't update s_start and s_sequence
>> in the on-disk journal superblock
>>
>>So, if the journal aborts, journaled data will be replayed on the
>>next mount.
>>
>>Now, please remember that some dirty metadata buffers are written
>>back to the filesystem without journaling if the journal aborted.
>>We are happy if all dirty metadata buffers are written to the disk,
>>the integrity of the filesystem will be kept. However, replaying
>>the journaled data can overwrite the latest on-disk metadata blocks
>>partly with old data. It would break the filesystem.
>
> Yes, it would. But how do you think it can happen that a metadata buffer
> will be written back to the filesystem when it is a part of running
> transaction? Note that checkpointing code specifically checks whether the
> buffer being written back is part of a running transaction and if so, it
> waits for commit before writing back the buffer. So I don't think this can
> happen but maybe I miss something...

Checkpointing code checks it and may call log_wait_commit(), but this
problem is caused by transactions which have not started checkpointing.

For example, the tail transaction has an old update for block_B and
the running transaction has a new update for block_B. Then, the
committing transaction fails to write the commit record, it aborts the
journal, and new block_B will be written back to the file system without
journaling. Because this patch doesn't separate between normal abort
and checkpointing related abort, the tail transaction is left in the
journal space. So by replaying the tail transaction, new block_B is
overwritten with old one.

It can happen in the case of the checkpointing related abort.
For example, assuming the tail transaction has an update for block_A,
the next transaction has an old update for block_B, and the running
transaction has a new update for block_B.
Now, the running transaction needs more log space, and it calls
log_do_checkpoint(). But it aborts the journal because it detected
write error on block_A. In this case, new block_B will be
overwritten when the old block_B in the second transaction to the tail
is replayed.

Does this answer your question?

>>My idea to resolve this problem is that we don't write out metadata
>>buffers which belong to uncommitted transactions if journal has
>>aborted. Although the latest filesystem updates will be lost,
>>we can ensure the integrity. It will also be effective for the
>>kernel panic in the middle of writing metadata buffers without
>>journaling (this would occur in the `mount -o errors=panic' case.)
>>
>>Which integrity or latest state should we choose?

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

2008-06-24 13:33:21

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

On Tue 24-06-08 20:52:59, Hidehiro Kawai wrote:
> >>>>Subject: [PATCH 4/5] jbd: fix error handling for checkpoint io
> >>>>
> >>>>When a checkpointing IO fails, current JBD code doesn't check the
> >>>>error and continue journaling. This means latest metadata can be
> >>>>lost from both the journal and filesystem.
> >>>>
> >>>>This patch leaves the failed metadata blocks in the journal space
> >>>>and aborts journaling in the case of log_do_checkpoint().
> >>>>To achieve this, we need to do:
> >>>>
> >>>>1. don't remove the failed buffer from the checkpoint list where in
> >>>> the case of __try_to_free_cp_buf() because it may be released or
> >>>> overwritten by a later transaction
> >>>>2. log_do_checkpoint() is the last chance, remove the failed buffer
> >>>> from the checkpoint list and abort the journal
> >>>>3. when checkpointing fails, don't update the journal super block to
> >>>> prevent the journaled contents from being cleaned. For safety,
> >>>> don't update j_tail and j_tail_sequence either
> >>
> >>3. is implemented as described below.
> >> (1) if log_do_checkpoint() detects an I/O error during
> >> checkpointing, it calls journal_abort() to abort the journal
> >> (2) if the journal has aborted, don't update s_start and s_sequence
> >> in the on-disk journal superblock
> >>
> >>So, if the journal aborts, journaled data will be replayed on the
> >>next mount.
> >>
> >>Now, please remember that some dirty metadata buffers are written
> >>back to the filesystem without journaling if the journal aborted.
> >>We are happy if all dirty metadata buffers are written to the disk,
> >>the integrity of the filesystem will be kept. However, replaying
> >>the journaled data can overwrite the latest on-disk metadata blocks
> >>partly with old data. It would break the filesystem.
> >
> > Yes, it would. But how do you think it can happen that a metadata buffer
> > will be written back to the filesystem when it is a part of running
> > transaction? Note that checkpointing code specifically checks whether the
> > buffer being written back is part of a running transaction and if so, it
> > waits for commit before writing back the buffer. So I don't think this can
> > happen but maybe I miss something...
>
> Checkpointing code checks it and may call log_wait_commit(), but this
> problem is caused by transactions which have not started checkpointing.
>
> For example, the tail transaction has an old update for block_B and
> the running transaction has a new update for block_B. Then, the
> committing transaction fails to write the commit record, it aborts the
> journal, and new block_B will be written back to the file system without
> journaling. Because this patch doesn't separate between normal abort
> and checkpointing related abort, the tail transaction is left in the
> journal space. So by replaying the tail transaction, new block_B is
> overwritten with old one.
Yes, and this is expected an correct. When we cannot properly finish a
transaction, we have to discard everything in it. A bug would be (and I
think it could currently happen) if we already checkpointed the previous
transaction and then written over block_B new data from the uncommitted
transaction. I think we have to avoid that - i.e., in case we abort the
journal we should not mark buffers dirty when processing the forget loop.
But this is not too serious since fsck has to be run anyway and it will
fix the problems.

> It can happen in the case of the checkpointing related abort.
> For example, assuming the tail transaction has an update for block_A,
> the next transaction has an old update for block_B, and the running
> transaction has a new update for block_B.
> Now, the running transaction needs more log space, and it calls
> log_do_checkpoint(). But it aborts the journal because it detected
> write error on block_A. In this case, new block_B will be
> overwritten when the old block_B in the second transaction to the tail
> is replayed.
Well, the scenario has to be a bit different (if we need more space than
there is in the journal, we commit the running transaction, do checkpoint
and start a new transaction) but something like what you describe could
happen. But again I think that this is a correct behavior - i.e., discard
all the data in the running transaction when the journal is aborted before
the transaction is properly committed.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-27 08:08:52

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

Hi Jan,
Thank you for your reply.

Jan Kara wrote:

> On Tue 24-06-08 20:52:59, Hidehiro Kawai wrote:

>>>>3. is implemented as described below.
>>>> (1) if log_do_checkpoint() detects an I/O error during
>>>> checkpointing, it calls journal_abort() to abort the journal
>>>> (2) if the journal has aborted, don't update s_start and s_sequence
>>>> in the on-disk journal superblock
>>>>
>>>>So, if the journal aborts, journaled data will be replayed on the
>>>>next mount.
>>>>
>>>>Now, please remember that some dirty metadata buffers are written
>>>>back to the filesystem without journaling if the journal aborted.
>>>>We are happy if all dirty metadata buffers are written to the disk,
>>>>the integrity of the filesystem will be kept. However, replaying
>>>>the journaled data can overwrite the latest on-disk metadata blocks
>>>>partly with old data. It would break the filesystem.
>>>
>>> Yes, it would. But how do you think it can happen that a metadata buffer
>>>will be written back to the filesystem when it is a part of running
>>>transaction? Note that checkpointing code specifically checks whether the
>>>buffer being written back is part of a running transaction and if so, it
>>>waits for commit before writing back the buffer. So I don't think this can
>>>happen but maybe I miss something...
>>
>>Checkpointing code checks it and may call log_wait_commit(), but this
>>problem is caused by transactions which have not started checkpointing.
>>
>>For example, the tail transaction has an old update for block_B and
>>the running transaction has a new update for block_B. Then, the
>>committing transaction fails to write the commit record, it aborts the
>>journal, and new block_B will be written back to the file system without
>>journaling. Because this patch doesn't separate between normal abort
>>and checkpointing related abort, the tail transaction is left in the
>>journal space. So by replaying the tail transaction, new block_B is
>>overwritten with old one.
>
> Yes, and this is expected an correct. When we cannot properly finish a
> transaction, we have to discard everything in it. A bug would be (and I
> think it could currently happen) if we already checkpointed the previous
> transaction and then written over block_B new data from the uncommitted
> transaction. I think we have to avoid that - i.e., in case we abort the
> journal we should not mark buffers dirty when processing the forget loop.

Yes.

> But this is not too serious since fsck has to be run anyway and it will
> fix the problems.

Yes. The filesystem should be marked with an error, so fsck will check
and recover the filesystem on boot. But this means the filesystem loses
some latest updates even if it was cleanly unmounted (although some file
data has been lost.) I'm a bit afraid that some people would think of
this as a regression due to this PATCH 4/5. At least, to avoid
undesirable replay, we had better keep journaled data only when the
journal has been aborted for checkpointing related reason.

>>It can happen in the case of the checkpointing related abort.
>>For example, assuming the tail transaction has an update for block_A,
>>the next transaction has an old update for block_B, and the running
>>transaction has a new update for block_B.
>>Now, the running transaction needs more log space, and it calls
>>log_do_checkpoint(). But it aborts the journal because it detected
>>write error on block_A. In this case, new block_B will be
>>overwritten when the old block_B in the second transaction to the tail
>>is replayed.
>
> Well, the scenario has to be a bit different (if we need more space than
> there is in the journal, we commit the running transaction, do checkpoint
> and start a new transaction) but something like what you describe could
> happen. But again I think that this is a correct behavior - i.e., discard
> all the data in the running transaction when the journal is aborted before
> the transaction is properly committed.

I think it is correct behavior to discard _all_ metadata updates
in the running transaction on abort. But, we don't hope some of
metadata updates are often discarded, do we?

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

2008-06-27 10:24:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

Hi,

On Fri 27-06-08 17:06:56, Hidehiro Kawai wrote:
> Jan Kara wrote:
>
> > On Tue 24-06-08 20:52:59, Hidehiro Kawai wrote:
>
> >>>>3. is implemented as described below.
> >>>> (1) if log_do_checkpoint() detects an I/O error during
> >>>> checkpointing, it calls journal_abort() to abort the journal
> >>>> (2) if the journal has aborted, don't update s_start and s_sequence
> >>>> in the on-disk journal superblock
> >>>>
> >>>>So, if the journal aborts, journaled data will be replayed on the
> >>>>next mount.
> >>>>
> >>>>Now, please remember that some dirty metadata buffers are written
> >>>>back to the filesystem without journaling if the journal aborted.
> >>>>We are happy if all dirty metadata buffers are written to the disk,
> >>>>the integrity of the filesystem will be kept. However, replaying
> >>>>the journaled data can overwrite the latest on-disk metadata blocks
> >>>>partly with old data. It would break the filesystem.
> >>>
> >>> Yes, it would. But how do you think it can happen that a metadata buffer
> >>>will be written back to the filesystem when it is a part of running
> >>>transaction? Note that checkpointing code specifically checks whether the
> >>>buffer being written back is part of a running transaction and if so, it
> >>>waits for commit before writing back the buffer. So I don't think this can
> >>>happen but maybe I miss something...
> >>
> >>Checkpointing code checks it and may call log_wait_commit(), but this
> >>problem is caused by transactions which have not started checkpointing.
> >>
> >>For example, the tail transaction has an old update for block_B and
> >>the running transaction has a new update for block_B. Then, the
> >>committing transaction fails to write the commit record, it aborts the
> >>journal, and new block_B will be written back to the file system without
> >>journaling. Because this patch doesn't separate between normal abort
> >>and checkpointing related abort, the tail transaction is left in the
> >>journal space. So by replaying the tail transaction, new block_B is
> >>overwritten with old one.
> >
> > Yes, and this is expected an correct. When we cannot properly finish a
> > transaction, we have to discard everything in it. A bug would be (and I
> > think it could currently happen) if we already checkpointed the previous
> > transaction and then written over block_B new data from the uncommitted
> > transaction. I think we have to avoid that - i.e., in case we abort the
> > journal we should not mark buffers dirty when processing the forget loop.
>
> Yes.
>
> > But this is not too serious since fsck has to be run anyway and it will
> > fix the problems.
>
> Yes. The filesystem should be marked with an error, so fsck will check
> and recover the filesystem on boot. But this means the filesystem loses
> some latest updates even if it was cleanly unmounted (although some file
> data has been lost.) I'm a bit afraid that some people would think of
> this as a regression due to this PATCH 4/5. At least, to avoid
> undesirable replay, we had better keep journaled data only when the
> journal has been aborted for checkpointing related reason.
I don't think this makes any difference. Look: We have transaction A
modifying block B fully committed to the journal. Now there is a running
(or committing, it does not really matter) transaction R also modifying block
B. Until R gets fully committed, no block modified by R is checkpointed
to the device - checkpointing code takes care of that and it must be so
to satisfy journaling guarantees.
So if we abort journal (for whatever reason) before R is fully committed,
no change in R will be seen on the filesystem regardless whether you
cleanup the journal or not.
And you cannot do much differenly - the principle of journaling is that
either all changes in the transaction get to disk or none of them. So until
the transaction is properly committed, you have the only way to satisfy
this condition - take the "none of them" choice.
Now fsck could of course be clever and try to use updates in not fully
committed transaction but personally I don't think it's worth the effort.
Please correct me if I just misunderstood your point...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-06-30 05:09:45

by Hidehiro Kawai

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

Hi,

Jan Kara wrote:

> On Fri 27-06-08 17:06:56, Hidehiro Kawai wrote:
>
>>Jan Kara wrote:
>>
>>
>>>On Tue 24-06-08 20:52:59, Hidehiro Kawai wrote:
>>
>>>>>>3. is implemented as described below.
>>>>>>(1) if log_do_checkpoint() detects an I/O error during
>>>>>> checkpointing, it calls journal_abort() to abort the journal
>>>>>>(2) if the journal has aborted, don't update s_start and s_sequence
>>>>>> in the on-disk journal superblock
>>>>>>
>>>>>>So, if the journal aborts, journaled data will be replayed on the
>>>>>>next mount.
>>>>>>
>>>>>>Now, please remember that some dirty metadata buffers are written
>>>>>>back to the filesystem without journaling if the journal aborted.
>>>>>>We are happy if all dirty metadata buffers are written to the disk,
>>>>>>the integrity of the filesystem will be kept. However, replaying
>>>>>>the journaled data can overwrite the latest on-disk metadata blocks
>>>>>>partly with old data. It would break the filesystem.
>>>>>
>>>>> Yes, it would. But how do you think it can happen that a metadata buffer
>>>>>will be written back to the filesystem when it is a part of running
>>>>>transaction? Note that checkpointing code specifically checks whether the
>>>>>buffer being written back is part of a running transaction and if so, it
>>>>>waits for commit before writing back the buffer. So I don't think this can
>>>>>happen but maybe I miss something...
>>>>
>>>>Checkpointing code checks it and may call log_wait_commit(), but this
>>>>problem is caused by transactions which have not started checkpointing.
>>>>
>>>>For example, the tail transaction has an old update for block_B and
>>>>the running transaction has a new update for block_B. Then, the
>>>>committing transaction fails to write the commit record, it aborts the
>>>>journal, and new block_B will be written back to the file system without
>>>>journaling. Because this patch doesn't separate between normal abort
>>>>and checkpointing related abort, the tail transaction is left in the
>>>>journal space. So by replaying the tail transaction, new block_B is
>>>>overwritten with old one.
>>>
>>> Yes, and this is expected an correct. When we cannot properly finish a
>>>transaction, we have to discard everything in it. A bug would be (and I
>>>think it could currently happen) if we already checkpointed the previous
>>>transaction and then written over block_B new data from the uncommitted
>>>transaction. I think we have to avoid that - i.e., in case we abort the
>>>journal we should not mark buffers dirty when processing the forget loop.
>>
>>Yes.
>>
>>
>>>But this is not too serious since fsck has to be run anyway and it will
>>>fix the problems.
>>
>>Yes. The filesystem should be marked with an error, so fsck will check
>>and recover the filesystem on boot. But this means the filesystem loses
>>some latest updates even if it was cleanly unmounted (although some file
>>data has been lost.) I'm a bit afraid that some people would think of
>>this as a regression due to this PATCH 4/5. At least, to avoid
>>undesirable replay, we had better keep journaled data only when the
>>journal has been aborted for checkpointing related reason.
>
> I don't think this makes any difference. Look: We have transaction A
> modifying block B fully committed to the journal. Now there is a running
> (or committing, it does not really matter) transaction R also modifying block
> B. Until R gets fully committed, no block modified by R is checkpointed
> to the device - checkpointing code takes care of that and it must be so
> to satisfy journaling guarantees.
> So if we abort journal (for whatever reason) before R is fully committed,
> no change in R will be seen on the filesystem regardless whether you
> cleanup the journal or not.

No, changes in R will be seen on the filesystem.
The metadata buffer for block B is marked as dirty when it is unfiled
whether the journal has aborted or not. Eventually the buffer will be
written-back to the filesystem by pdflush. Actually I have confirmed
this behavior by using SystemTap. So if both journal abort and
system crash happen at the same time, the filesystem would become
inconsistent state. As I stated, replaying the journaled block B in
transaction A may also corrupt the filesystem, because changes in
transaction R are reflected halfway. That is why I sent a patch to
prevent metadata buffers from being dirtied on abort.

> And you cannot do much differenly - the principle of journaling is that
> either all changes in the transaction get to disk or none of them. So until
> the transaction is properly committed, you have the only way to satisfy
> this condition - take the "none of them" choice.
> Now fsck could of course be clever and try to use updates in not fully
> committed transaction but personally I don't think it's worth the effort.
> Please correct me if I just misunderstood your point...

Thanks,
--
Hidehiro Kawai
Hitachi, Systems Development Laboratory
Linux Technology Center

2008-07-07 10:07:22

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 4/5] jbd: fix error handling for checkpoint io

Hello,

On Mon 30-06-08 14:09:09, Hidehiro Kawai wrote:
> Jan Kara wrote:
>
> > On Fri 27-06-08 17:06:56, Hidehiro Kawai wrote:
> >
> >>Jan Kara wrote:
> >>
> >>
> >>>On Tue 24-06-08 20:52:59, Hidehiro Kawai wrote:
> >>
> >>>>>>3. is implemented as described below.
> >>>>>>(1) if log_do_checkpoint() detects an I/O error during
> >>>>>> checkpointing, it calls journal_abort() to abort the journal
> >>>>>>(2) if the journal has aborted, don't update s_start and s_sequence
> >>>>>> in the on-disk journal superblock
> >>>>>>
> >>>>>>So, if the journal aborts, journaled data will be replayed on the
> >>>>>>next mount.
> >>>>>>
> >>>>>>Now, please remember that some dirty metadata buffers are written
> >>>>>>back to the filesystem without journaling if the journal aborted.
> >>>>>>We are happy if all dirty metadata buffers are written to the disk,
> >>>>>>the integrity of the filesystem will be kept. However, replaying
> >>>>>>the journaled data can overwrite the latest on-disk metadata blocks
> >>>>>>partly with old data. It would break the filesystem.
> >>>>>
> >>>>> Yes, it would. But how do you think it can happen that a metadata buffer
> >>>>>will be written back to the filesystem when it is a part of running
> >>>>>transaction? Note that checkpointing code specifically checks whether the
> >>>>>buffer being written back is part of a running transaction and if so, it
> >>>>>waits for commit before writing back the buffer. So I don't think this can
> >>>>>happen but maybe I miss something...
> >>>>
> >>>>Checkpointing code checks it and may call log_wait_commit(), but this
> >>>>problem is caused by transactions which have not started checkpointing.
> >>>>
> >>>>For example, the tail transaction has an old update for block_B and
> >>>>the running transaction has a new update for block_B. Then, the
> >>>>committing transaction fails to write the commit record, it aborts the
> >>>>journal, and new block_B will be written back to the file system without
> >>>>journaling. Because this patch doesn't separate between normal abort
> >>>>and checkpointing related abort, the tail transaction is left in the
> >>>>journal space. So by replaying the tail transaction, new block_B is
> >>>>overwritten with old one.
> >>>
> >>> Yes, and this is expected an correct. When we cannot properly finish a
> >>>transaction, we have to discard everything in it. A bug would be (and I
> >>>think it could currently happen) if we already checkpointed the previous
> >>>transaction and then written over block_B new data from the uncommitted
> >>>transaction. I think we have to avoid that - i.e., in case we abort the
> >>>journal we should not mark buffers dirty when processing the forget loop.
> >>
> >>Yes.
> >>
> >>
> >>>But this is not too serious since fsck has to be run anyway and it will
> >>>fix the problems.
> >>
> >>Yes. The filesystem should be marked with an error, so fsck will check
> >>and recover the filesystem on boot. But this means the filesystem loses
> >>some latest updates even if it was cleanly unmounted (although some file
> >>data has been lost.) I'm a bit afraid that some people would think of
> >>this as a regression due to this PATCH 4/5. At least, to avoid
> >>undesirable replay, we had better keep journaled data only when the
> >>journal has been aborted for checkpointing related reason.
> >
> > I don't think this makes any difference. Look: We have transaction A
> > modifying block B fully committed to the journal. Now there is a running
> > (or committing, it does not really matter) transaction R also modifying block
> > B. Until R gets fully committed, no block modified by R is checkpointed
> > to the device - checkpointing code takes care of that and it must be so
> > to satisfy journaling guarantees.
> > So if we abort journal (for whatever reason) before R is fully committed,
> > no change in R will be seen on the filesystem regardless whether you
> > cleanup the journal or not.
>
> No, changes in R will be seen on the filesystem.
> The metadata buffer for block B is marked as dirty when it is unfiled
> whether the journal has aborted or not. Eventually the buffer will be
> written-back to the filesystem by pdflush. Actually I have confirmed
> this behavior by using SystemTap. So if both journal abort and
> system crash happen at the same time, the filesystem would become
> inconsistent state. As I stated, replaying the journaled block B in
> transaction A may also corrupt the filesystem, because changes in
> transaction R are reflected halfway. That is why I sent a patch to
> prevent metadata buffers from being dirtied on abort.
Ah, ok. Thanks for explanation. Yes, we should not mark buffers dirty
when journal is aborted and we unfile them. That should fix the whole
problem.
Bye
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR