2008-03-06 02:05:59

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

At present, as discussed in this LKML thread,
http://marc.info/?l=linux-kernel&m=117607695406580, when a dirty ext3
filesystem is mounted read-only it writes to the disk while replaying the
journal log and cleaning up the orphan list. This behaviour may surprise users
and can potentially cause data corruption/loss (e.g. if a system is suspended,
booted into a different OS, then resumed).

This patch series attempts to address this by using a block translation table
instead of replaying the journal on a read-only filesystem.

Patches 1-3 are independent cleanups/bug-fixes for things I came across while
working on this. They could be submitted separately and are not required for
following patches.

Patch 4 is a refactoring change that simplifies the code prior to later
substantive changes.

Patch 5 introduces the translation table and support for a truly read-only
journal into jbd.

Patch 6 uses the facility introduced in patch 5 to add support for true
read-only ext3.

For testing I've been using qemu VMs to create and mount dirtied filesystems. I
have a set of scripts that fully automates creating a dirty filesystem then
checking mounting read-only and read-write produces consistent results. On my
system it can get through around ~30 iteration overnight. If anyone is
interested in the scripts please let me know. Any suggestions for additional
tests or enhancements that could be made to the scripts would be gratefully
received.

TODO:
* Add R/W remount support
* Port to ext4

Cheers,
Duane Griffin.

> git diff --stat origin
fs/ext3/balloc.c | 2 +-
fs/ext3/ialloc.c | 2 +-
fs/ext3/inode.c | 8 +-
fs/ext3/resize.c | 2 +-
fs/ext3/super.c | 123 ++++++++++-----
fs/ext3/xattr.c | 8 +-
fs/jbd/checkpoint.c | 2 +-
fs/jbd/commit.c | 2 +-
fs/jbd/journal.c | 68 +++++---
fs/jbd/recovery.c | 402 +++++++++++++++++++++++++++++++++--------------
fs/jbd/revoke.c | 133 ++++++----------
fs/ocfs2/journal.c | 4 +-
include/linux/ext3_fs.h | 7 +
include/linux/jbd.h | 41 +++++-
14 files changed, 516 insertions(+), 288 deletions(-)



2008-03-06 02:06:01

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 1/6] jbd: eliminate duplicated code in revocation table init/destroy functions

The revocation table initialisation/destruction code is repeated for each of
the two revocation tables stored in the journal. Refactoring the duplicated
code into functions is tidier, simplifies the logic in initialisation in
particular, and slightly reduces the code size.

There should not be any functional change.

Signed-off-by: Duane Griffin <[email protected]>
---

This change is an independent cleanup which is not required by following
patches in this series. Also it should perhaps be two separate patches?

fs/jbd/revoke.c | 126 +++++++++++++++++++++++--------------------------------
1 files changed, 52 insertions(+), 74 deletions(-)

diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
index ad2eacf..5f64df4 100644
--- a/fs/jbd/revoke.c
+++ b/fs/jbd/revoke.c
@@ -195,108 +195,86 @@ void journal_destroy_revoke_caches(void)
revoke_table_cache = NULL;
}

-/* Initialise the revoke table for a given journal to a given size. */
-
-int journal_init_revoke(journal_t *journal, int hash_size)
+static int journal_init_revoke_table(struct jbd_revoke_table_s *table, int size)
{
- int shift, tmp;
-
- J_ASSERT (journal->j_revoke_table[0] == NULL);
-
- shift = 0;
- tmp = hash_size;
+ int shift = 0;
+ int tmp = size;
while((tmp >>= 1UL) != 0UL)
shift++;

- journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
- if (!journal->j_revoke_table[0])
- return -ENOMEM;
- journal->j_revoke = journal->j_revoke_table[0];
-
- /* Check that the hash_size is a power of two */
- J_ASSERT(is_power_of_2(hash_size));
-
- journal->j_revoke->hash_size = hash_size;
-
- journal->j_revoke->hash_shift = shift;
-
- journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
- if (!journal->j_revoke->hash_table) {
- kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
- journal->j_revoke = NULL;
+ table->hash_size = size;
+ table->hash_shift = shift;
+ table->hash_table = kmalloc(
+ size * sizeof(struct list_head), GFP_KERNEL);
+ if (!table->hash_table)
return -ENOMEM;
- }
-
- for (tmp = 0; tmp < hash_size; tmp++)
- INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);

- journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
- if (!journal->j_revoke_table[1]) {
- kfree(journal->j_revoke_table[0]->hash_table);
- kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
- return -ENOMEM;
- }
+ for (tmp = 0; tmp < size; tmp++)
+ INIT_LIST_HEAD(&table->hash_table[tmp]);

- journal->j_revoke = journal->j_revoke_table[1];
+ return 0;
+}

- /* Check that the hash_size is a power of two */
+/* Initialise the revoke table for a given journal to a given size. */
+int journal_init_revoke(journal_t *journal, int hash_size)
+{
+ J_ASSERT(journal->j_revoke_table[0] == NULL);
J_ASSERT(is_power_of_2(hash_size));

- journal->j_revoke->hash_size = hash_size;
+ journal->j_revoke_table[0] = kmem_cache_alloc(
+ revoke_table_cache, GFP_KERNEL);
+ if (!journal->j_revoke_table[0])
+ goto failed_alloc1;
+ if (journal_init_revoke_table(journal->j_revoke_table[0], hash_size))
+ goto failed_init1;

- journal->j_revoke->hash_shift = shift;
+ journal->j_revoke_table[1] = kmem_cache_alloc(
+ revoke_table_cache, GFP_KERNEL);
+ if (!journal->j_revoke_table[1])
+ goto failed_alloc2;
+ if (journal_init_revoke_table(journal->j_revoke_table[1], hash_size))
+ goto failed_init2;

- journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
- if (!journal->j_revoke->hash_table) {
- kfree(journal->j_revoke_table[0]->hash_table);
- kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
- kmem_cache_free(revoke_table_cache, journal->j_revoke_table[1]);
- journal->j_revoke = NULL;
- return -ENOMEM;
- }
-
- for (tmp = 0; tmp < hash_size; tmp++)
- INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
+ journal->j_revoke = journal->j_revoke_table[1];

spin_lock_init(&journal->j_revoke_lock);

return 0;
-}

-/* Destoy a journal's revoke table. The table must already be empty! */
+failed_init2:
+ kmem_cache_free(revoke_table_cache, journal->j_revoke_table[1]);
+failed_alloc2:
+ kfree(journal->j_revoke_table[0]->hash_table);
+failed_init1:
+ kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
+failed_alloc1:
+ return -ENOMEM;
+}

-void journal_destroy_revoke(journal_t *journal)
+static void journal_destroy_revoke_table(struct jbd_revoke_table_s *table)
{
- struct jbd_revoke_table_s *table;
- struct list_head *hash_list;
int i;
-
- table = journal->j_revoke_table[0];
- if (!table)
- return;
-
- for (i=0; i<table->hash_size; i++) {
+ struct list_head *hash_list;
+ for (i = 0; i < table->hash_size; i++) {
hash_list = &table->hash_table[i];
- J_ASSERT (list_empty(hash_list));
+ J_ASSERT(list_empty(hash_list));
}

kfree(table->hash_table);
kmem_cache_free(revoke_table_cache, table);
- journal->j_revoke = NULL;
+}

- table = journal->j_revoke_table[1];
- if (!table)
+/* Destroy a journal's revoke table. The table must already be empty! */
+void journal_destroy_revoke(journal_t *journal)
+{
+ if (!journal->j_revoke_table[0])
return;
+ journal_destroy_revoke_table(journal->j_revoke_table[0]);
+ journal->j_revoke = NULL;

- for (i=0; i<table->hash_size; i++) {
- hash_list = &table->hash_table[i];
- J_ASSERT (list_empty(hash_list));
- }

2008-03-06 02:06:04

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 2/6] jbd: replace potentially false assertion with if block

If an error occurs during jbd cache initialisation it is possible for the
journal_head_cache to be NULL when journal_destroy_journal_head_cache is
called. Replace the J_ASSERT with an if block to handle the situation
correctly.

Note that even with this fix things will break badly if ext3 is statically
compiled in and cache initialisation fails.

Signed-off-by: Duane Griffin <[email protected]>
---

At the moment the cache destruction methods are inconsistent in whether they
set the cache pointers to NULL after destroying them. To be precise,
journal_destroy_handle_cache doesn't, the others do. I haven't changed that,
although I was sorely tempted. I think it would be better to be consistent and
that NULLing the pointers is preferable. Any objections?

This patch is an independent bug fix and following patches in this series do
not depend on it.

fs/jbd/journal.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 3943a89..3f334a3 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -1635,9 +1635,10 @@ static int journal_init_journal_head_cache(void)

static void journal_destroy_journal_head_cache(void)
{
- J_ASSERT(journal_head_cache != NULL);
- kmem_cache_destroy(journal_head_cache);
- journal_head_cache = NULL;
+ if (journal_head_cache) {
+ kmem_cache_destroy(journal_head_cache);
+ journal_head_cache = NULL;
+ }
}

/*
--
1.5.3.7


2008-03-06 02:06:04

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 3/6] jbd: only create debugfs entries if cache initialisation is successful

JBD debugfs entries should only be created if cache initialisation is
successful. At the moment they are being created unconditionally which will
leave them dangling if cache (and hence module) initialisation fails.

Signed-off-by: Duane Griffin <[email protected]>
---

This patch is an independent bug fix and following patches in this series do
not depend on it.

fs/jbd/journal.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 3f334a3..a868277 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -1942,9 +1942,10 @@ static int __init journal_init(void)
BUILD_BUG_ON(sizeof(struct journal_superblock_s) != 1024);

ret = journal_init_caches();
- if (ret != 0)
+ if (ret == 0)
+ jbd_create_debugfs_entry();
+ else
journal_destroy_caches();
- jbd_create_debugfs_entry();
return ret;
}

--
1.5.3.7


2008-03-06 02:06:06

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 4/6] jbd: refactor nested journal log recovery loop into separate functions

The do_one_pass function iterates through the jbd log operating on blocks
depending on the pass. During the replay pass descriptor blocks are processed
by an inner loop which replays them. The nesting makes the code difficult to
follow or to modify. Splitting the inner loop and replay code into separate
functions simplifies matters.

The refactored code no longer unnecessarily reads revoked data blocks, so
potential read errors from such blocks will cause differing behaviour. Aside
from that there should be no functional effect.

Signed-off-by: Duane Griffin <[email protected]>
---

Note the TODO before the block read in the outer loop below. We could possibly
attempt to continue after a failed read as we do when replaying descriptor
blocks. However if it was a descriptor block we'd likely bomb out on the next
iteration anyway, so I'm not sure it would be worth it. Thoughts?

fs/jbd/recovery.c | 243 ++++++++++++++++++++++++++++-------------------------
1 files changed, 127 insertions(+), 116 deletions(-)

diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
index 2b8edf4..e2ac78f 100644
--- a/fs/jbd/recovery.c
+++ b/fs/jbd/recovery.c
@@ -307,6 +307,101 @@ int journal_skip_recovery(journal_t *journal)
return err;
}

+static int replay_data_block(
+ journal_t *journal, struct buffer_head *obh, char *data,
+ int flags, unsigned long blocknr)
+{
+ struct buffer_head *nbh;
+
+ /* Find a buffer for the new data being restored */
+ nbh = __getblk(journal->j_fs_dev, blocknr, journal->j_blocksize);
+ if (nbh == NULL)
+ return -ENOMEM;
+
+ lock_buffer(nbh);
+ memcpy(nbh->b_data, obh->b_data, journal->j_blocksize);
+ if (flags & JFS_FLAG_ESCAPE)
+ *((__be32 *)data) = cpu_to_be32(JFS_MAGIC_NUMBER);
+
+ BUFFER_TRACE(nbh, "marking dirty");
+ set_buffer_uptodate(nbh);
+ mark_buffer_dirty(nbh);
+ BUFFER_TRACE(nbh, "marking uptodate");
+ /* ll_rw_block(WRITE, 1, &nbh); */
+ unlock_buffer(nbh);
+ brelse(nbh);
+
+ return 0;
+}
+
+/* Replay a descriptor block: write all of the data blocks. */
+static int replay_descriptor_block(
+ journal_t *journal, char *data,
+ unsigned int next_commit_ID,
+ unsigned long *next_log_block,
+ struct recovery_info *info)
+{
+ int err;
+ int success = 0;
+ char *tagp;
+ unsigned long io_block;
+ struct buffer_head *obh;
+ unsigned long next = *next_log_block;
+
+ tagp = &data[sizeof(journal_header_t)];
+ while ((tagp - data + sizeof(journal_block_tag_t)) <=
+ journal->j_blocksize) {
+ unsigned long blocknr;
+ journal_block_tag_t *tag = (journal_block_tag_t *) tagp;
+ int flags = be32_to_cpu(tag->t_flags);
+
+ io_block = next++;
+ wrap(journal, next);
+ blocknr = be32_to_cpu(tag->t_blocknr);
+
+ /* If the block has been revoked, then we're all done here. */
+ if (journal_test_revoke(journal, blocknr, next_commit_ID)) {
+ ++info->nr_revoke_hits;
+ goto skip_write;
+ }
+
+ err = jread(&obh, journal, io_block);
+ if (err) {
+
+ /* Recover what we can, but
+ * report failure at the end. */
+ success = err;
+ printk(KERN_ERR "JBD: IO error %d recovering "
+ "block %ld in log\n", err, io_block);
+ continue;
+ }
+ J_ASSERT(obh != NULL);
+
+ err = replay_data_block(journal, obh, data, flags, blocknr);
+ if (err) {
+ brelse(obh);
+ printk(KERN_ERR "JBD: Out of memory during recovery\n");
+ success = err;
+ goto failed;
+ }
+
+ ++info->nr_replays;
+ brelse(obh);
+
+skip_write:
+ tagp += sizeof(journal_block_tag_t);
+ if (!(flags & JFS_FLAG_SAME_UUID))
+ tagp += 16;
+
+ if (flags & JFS_FLAG_LAST_TAG)
+ break;
+ }
+
+ *next_log_block = next;
+failed:
+ return success;
+}
+
static int do_one_pass(journal_t *journal,
struct recovery_info *info, enum passtype pass)
{
@@ -316,8 +411,6 @@ static int do_one_pass(journal_t *journal,
journal_superblock_t * sb;
journal_header_t * tmp;
struct buffer_head * bh;
- unsigned int sequence;
- int blocktype;

/* Precompute the maximum metadata descriptors in a descriptor block */
int MAX_BLOCKS_PER_DESC;
@@ -348,11 +441,8 @@ static int do_one_pass(journal_t *journal,
*/

while (1) {
- int flags;
- char * tagp;
- journal_block_tag_t * tag;
- struct buffer_head * obh;
- struct buffer_head * nbh;
+ int blocktype;
+ unsigned int sequence;

cond_resched();

@@ -371,10 +461,13 @@ static int do_one_pass(journal_t *journal,
* either the next descriptor block or the final commit
* record. */

+ /* TODO: Should we continue on error? */
jbd_debug(3, "JBD: checking block %ld\n", next_log_block);
err = jread(&bh, journal, next_log_block);
- if (err)
+ if (err) {
+ success = err;
goto failed;
+ }

next_log_block++;
wrap(journal, next_log_block);
@@ -406,137 +499,57 @@ static int do_one_pass(journal_t *journal,
* all of the sequence number checks. What are we going
* to do with it? That depends on the pass... */

- switch(blocktype) {
+ switch (blocktype) {
case JFS_DESCRIPTOR_BLOCK:
/* If it is a valid descriptor block, replay it
* in pass REPLAY; otherwise, just skip over the
* blocks it describes. */
- if (pass != PASS_REPLAY) {
+ if (pass == PASS_REPLAY) {
+ err = replay_descriptor_block(
+ journal,
+ bh->b_data,
+ next_commit_ID,
+ &next_log_block,
+ info);
+ } else {
next_log_block +=
count_tags(bh, journal->j_blocksize);
wrap(journal, next_log_block);
- brelse(bh);
- continue;
}
-
- /* A descriptor block: we can now write all of
- * the data blocks. Yay, useful work is finally
- * getting done here! */
-
- tagp = &bh->b_data[sizeof(journal_header_t)];
- while ((tagp - bh->b_data +sizeof(journal_block_tag_t))
- <= journal->j_blocksize) {
- unsigned long io_block;
-
- tag = (journal_block_tag_t *) tagp;
- flags = be32_to_cpu(tag->t_flags);
-
- io_block = next_log_block++;
- wrap(journal, next_log_block);
- err = jread(&obh, journal, io_block);
- if (err) {
- /* Recover what we can, but
- * report failure at the end. */
- success = err;
- printk (KERN_ERR
- "JBD: IO error %d recovering "
- "block %ld in log\n",
- err, io_block);
- } else {
- unsigned long blocknr;
-
- J_ASSERT(obh != NULL);
- blocknr = be32_to_cpu(tag->t_blocknr);
-
- /* If the block has been
- * revoked, then we're all done
- * here. */
- if (journal_test_revoke
- (journal, blocknr,
- next_commit_ID)) {
- brelse(obh);
- ++info->nr_revoke_hits;
- goto skip_write;
- }
-
- /* Find a buffer for the new
- * data being restored */
- nbh = __getblk(journal->j_fs_dev,
- blocknr,
- journal->j_blocksize);
- if (nbh == NULL) {
- printk(KERN_ERR
- "JBD: Out of memory "
- "during recovery.\n");
- err = -ENOMEM;
- brelse(bh);
- brelse(obh);
- goto failed;
- }
-
- lock_buffer(nbh);
- memcpy(nbh->b_data, obh->b_data,
- journal->j_blocksize);
- if (flags & JFS_FLAG_ESCAPE) {
- *((__be32 *)bh->b_data) =
- cpu_to_be32(JFS_MAGIC_NUMBER);
- }
-
- BUFFER_TRACE(nbh, "marking dirty");
- set_buffer_uptodate(nbh);
- mark_buffer_dirty(nbh);
- BUFFER_TRACE(nbh, "marking uptodate");
- ++info->nr_replays;
- /* ll_rw_block(WRITE, 1, &nbh); */
- unlock_buffer(nbh);
- brelse(obh);
- brelse(nbh);
- }
-
- skip_write:
- tagp += sizeof(journal_block_tag_t);
- if (!(flags & JFS_FLAG_SAME_UUID))
- tagp += 16;
-
- if (flags & JFS_FLAG_LAST_TAG)
- break;
- }
-
- brelse(bh);
- continue;
+ break;

case JFS_COMMIT_BLOCK:
/* Found an expected commit block: not much to
* do other than move on to the next sequence
* number. */
- brelse(bh);
next_commit_ID++;
- continue;
+ break;

case JFS_REVOKE_BLOCK:
/* If we aren't in the REVOKE pass, then we can
* just skip over this block. */
- if (pass != PASS_REVOKE) {
- brelse(bh);
- continue;
+ if (pass == PASS_REVOKE) {
+ err = scan_revoke_records(
+ journal, bh, next_commit_ID, info);
}
-
- err = scan_revoke_records(journal, bh,
- next_commit_ID, info);
- brelse(bh);
- if (err)
- goto failed;
- continue;
+ break;

default:
jbd_debug(3, "Unrecognised magic %d, end of scan.\n",
blocktype);
- brelse(bh);
- goto done;
+ break;
}
+
+ brelse(bh);
+
+ /* Immediately fail on OOM; on other errors try to recover
+ * as much as we can, but report the failure at the end. */
+ if (err)
+ success = err;
+ if (err == -ENOMEM)
+ goto failed;
}

- done:
/*
* We broke out of the log scan loop: either we came to the
* known end of the log or we found an unexpected block in the
@@ -558,10 +571,8 @@ static int do_one_pass(journal_t *journal,
}
}

- return success;

2008-03-06 02:06:11

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 5/6] jbd: add support for read-only log recovery

Read-only log recovery allows a dirty journalled filesystem to be mounted and
provide a consistent view of its data without writing to the disk. Instead of
replaying the log a mapping is constructed between modified filesystem blocks
and the journal blocks containing their data.

The mapping is stored in a hashtable that is created and populated during
journal recovery. The hash function used is the same one used by the journal
revocation code. The size of the hashtable is currently being arbitrarily set
to 256 entries. Given that we know the number of block mappings when we create
the table it may be worth dynamically calculating an appropriate size instead
of hard-coding it.

Signed-off-by: Duane Griffin <[email protected]>
---

I'm tempted to add warnings on attempts to modify a read-only journal, does
that sound useful? I'd also be grateful to anyone with suggestions for better
member names. :)

fs/ext3/super.c | 2 +-
fs/jbd/checkpoint.c | 2 +-
fs/jbd/commit.c | 2 +-
fs/jbd/journal.c | 56 ++++++++++------
fs/jbd/recovery.c | 187 +++++++++++++++++++++++++++++++++++++++++++++------
fs/jbd/revoke.c | 7 +--
fs/ocfs2/journal.c | 4 +-
include/linux/jbd.h | 41 ++++++++++-
8 files changed, 246 insertions(+), 55 deletions(-)

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 18769cc..4b9ff65 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2168,7 +2168,7 @@ static int ext3_load_journal(struct super_block *sb,
if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER))
err = journal_wipe(journal, !really_read_only);
if (!err)
- err = journal_load(journal);
+ err = journal_load(journal, !really_read_only);

if (err) {
printk(KERN_ERR "EXT3-fs: error loading journal.\n");
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index a5432bb..84d4579 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -454,7 +454,7 @@ int cleanup_journal_tail(journal_t *journal)
journal->j_tail = blocknr;
spin_unlock(&journal->j_state_lock);
if (!(journal->j_flags & JFS_ABORT))
- journal_update_superblock(journal, 1);
+ journal_update_superblock(journal, 1, 1);
return 0;
}

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index a38c718..c63598b 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -310,7 +310,7 @@ void journal_commit_transaction(journal_t *journal)
/* Do we need to erase the effects of a prior journal_flush? */
if (journal->j_flags & JFS_FLUSHED) {
jbd_debug(3, "super block updated\n");
- journal_update_superblock(journal, 1);
+ journal_update_superblock(journal, 1, 1);
} else {
jbd_debug(3, "superblock not updated\n");
}
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index a868277..b2723a0 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -822,7 +822,7 @@ static void journal_fail_superblock (journal_t *journal)
* subsequent use.
*/

-static int journal_reset(journal_t *journal)
+static int journal_reset(journal_t *journal, int write)
{
journal_superblock_t *sb = journal->j_superblock;
unsigned long first, last;
@@ -843,8 +843,8 @@ static int journal_reset(journal_t *journal)

journal->j_max_transaction_buffers = journal->j_maxlen / 4;

- /* Add the dynamic fields and write it to disk. */
- journal_update_superblock(journal, 1);
+ /* Add the dynamic fields and write it to disk if not read-only. */
+ journal_update_superblock(journal, write, 1);
return journal_start_thread(journal);
}

@@ -916,21 +916,21 @@ int journal_create(journal_t *journal)
journal->j_flags &= ~JFS_ABORT;
journal->j_format_version = 2;

- return journal_reset(journal);
+ return journal_reset(journal, 1);
}

/**
- * void journal_update_superblock() - Update journal sb on disk.
+ * void journal_update_superblock() - Update journal sb in memory and on disk.
* @journal: The journal to update.
+ * @write: Set to '0' if you don't want to write to the disk.
* @wait: Set to '0' if you don't want to wait for IO completion.
*
* Update a journal's dynamic superblock fields and write it to disk,
* optionally waiting for the IO to complete.
*/
-void journal_update_superblock(journal_t *journal, int wait)
+void journal_update_superblock(journal_t *journal, int write, int wait)
{
journal_superblock_t *sb = journal->j_superblock;
- struct buffer_head *bh = journal->j_sb_buffer;

/*
* As a special case, if the on-disk copy is already marked as needing
@@ -957,12 +957,16 @@ void journal_update_superblock(journal_t *journal, int wait)
sb->s_errno = cpu_to_be32(journal->j_errno);
spin_unlock(&journal->j_state_lock);

- BUFFER_TRACE(bh, "marking dirty");
- mark_buffer_dirty(bh);
- if (wait)
- sync_dirty_buffer(bh);
- else
- ll_rw_block(SWRITE, 1, &bh);
+ if (write) {
+ struct buffer_head *bh = journal->j_sb_buffer;
+
+ BUFFER_TRACE(bh, "marking dirty");
+ mark_buffer_dirty(bh);
+ if (wait)
+ sync_dirty_buffer(bh);
+ else
+ ll_rw_block(SWRITE, 1, &bh);
+ }

out:
/* If we have just flushed the log (by marking s_start==0), then
@@ -1066,12 +1070,13 @@ static int load_superblock(journal_t *journal)
/**
* int journal_load() - Read journal from disk.
* @journal: Journal to act on.
+ * @write: Whether the journal may be modified on-disk.
*
* Given a journal_t structure which tells us which disk blocks contain
* a journal, read the journal from disk to initialise the in-memory
* structures.
*/
-int journal_load(journal_t *journal)
+int journal_load(journal_t *journal, int write)
{
int err;
journal_superblock_t *sb;
@@ -1097,13 +1102,13 @@ int journal_load(journal_t *journal)

/* Let the recovery code check whether it needs to recover any
* data from the journal. */
- if (journal_recover(journal))
+ if (journal_recover(journal, write))
goto recovery_error;

/* OK, we've finished with the dynamic journal bits:
* reinitialise the dynamic contents of the superblock in memory
* and reset them on disk. */
- if (journal_reset(journal))
+ if (journal_reset(journal, write))
goto recovery_error;

journal->j_flags &= ~JFS_ABORT;
@@ -1150,7 +1155,12 @@ void journal_destroy(journal_t *journal)
journal->j_tail = 0;
journal->j_tail_sequence = ++journal->j_transaction_sequence;
if (journal->j_sb_buffer) {
- journal_update_superblock(journal, 1);
+
+ /*
+ * Ugly, but what to do?
+ * We don't have a superblock to test for read-only.
+ */
+ journal_update_superblock(journal, !journal->j_bt_hash, 1);
brelse(journal->j_sb_buffer);
}

@@ -1158,6 +1168,8 @@ void journal_destroy(journal_t *journal)
iput(journal->j_inode);
if (journal->j_revoke)
journal_destroy_revoke(journal);
+ if (journal->j_bt_hash)
+ journal_destroy_translations(journal);
kfree(journal->j_wbuf);
kfree(journal);
}
@@ -1374,7 +1386,7 @@ int journal_flush(journal_t *journal)
old_tail = journal->j_tail;
journal->j_tail = 0;
spin_unlock(&journal->j_state_lock);
- journal_update_superblock(journal, 1);
+ journal_update_superblock(journal, 1, 1);
spin_lock(&journal->j_state_lock);
journal->j_tail = old_tail;

@@ -1420,8 +1432,7 @@ int journal_wipe(journal_t *journal, int write)
write ? "Clearing" : "Ignoring");

err = journal_skip_recovery(journal);
- if (write)
- journal_update_superblock(journal, 1);
+ journal_update_superblock(journal, write, 1);

no_recovery:
return err;
@@ -1489,7 +1500,7 @@ static void __journal_abort_soft (journal_t *journal, int errno)
__journal_abort_hard(journal);

if (errno)
- journal_update_superblock(journal, 1);
+ journal_update_superblock(journal, !journal->j_bt_hash, 1);
}

/**
@@ -1922,6 +1933,8 @@ static int __init journal_init_caches(void)

ret = journal_init_revoke_caches();
if (ret == 0)
+ ret = journal_init_recover_cache();
+ if (ret == 0)
ret = journal_init_journal_head_cache();
if (ret == 0)
ret = journal_init_handle_cache();
@@ -1931,6 +1944,7 @@ static int __init journal_init_caches(void)
static void journal_destroy_caches(void)
{
journal_destroy_revoke_caches();
+ journal_destroy_recover_cache();
journal_destroy_journal_head_cache();
journal_destroy_handle_cache();
}
diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
index e2ac78f..a73b062 100644
--- a/fs/jbd/recovery.c
+++ b/fs/jbd/recovery.c
@@ -17,12 +17,15 @@
#include "jfs_user.h"
#else
#include <linux/time.h>
+#include <linux/cache.h>
#include <linux/fs.h>
#include <linux/jbd.h>
#include <linux/errno.h>
#include <linux/slab.h>
#endif

+static struct kmem_cache *bt_record_cache;
+
/*
* Maintain information about the progress of the recovery job, so that
* the different passes can carry information between them.
@@ -123,6 +126,71 @@ failed:

#endif /* __KERNEL__ */

+int __init journal_init_recover_cache(void)
+{
+ bt_record_cache = kmem_cache_create(
+ "block_translation_record",
+ sizeof(struct journal_block_translation),
+ 0, 0, NULL);
+ if (bt_record_cache == NULL) {
+ printk(KERN_EMERG "JBD: failed to create recover cache\n");
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+void journal_destroy_recover_cache(void)
+{
+ if (bt_record_cache) {
+ kmem_cache_destroy(bt_record_cache);
+ bt_record_cache = NULL;
+ }
+}
+
+bool journal_translate_block(journal_t *journal, sector_t blocknr,
+ sector_t *retp)
+{
+ int hash;
+ struct journal_block_translation *jbt;
+ struct hlist_node *node;
+
+ if (!journal || !journal->j_bt_hash)
+ goto unmapped;
+
+ hash = jbd_hash(blocknr, journal->j_bt_hash_bits,
+ 1 << journal->j_bt_hash_bits);
+ hlist_for_each_entry(jbt, node, journal->j_bt_hash + hash, jbt_list) {
+ if (jbt->original == blocknr) {
+ *retp = jbt->target;
+ return true;
+ }
+ }
+
+unmapped:
+ *retp = blocknr;
+ return false;
+}
+
+void journal_destroy_translations(journal_t *journal)
+{
+ int ii;
+ struct journal_block_translation *jbt;
+ struct hlist_node *tmp;
+ struct hlist_node *node;
+
+ if (!journal->j_bt_hash)
+ return;
+
+ for (ii = 0; ii < (1 << journal->j_bt_hash_bits); ii++) {
+ hlist_for_each_entry_safe(jbt, node, tmp,
+ journal->j_bt_hash + ii, jbt_list) {
+ kmem_cache_free(bt_record_cache, jbt);
+ }
+ }
+
+ kfree(journal->j_bt_hash);
+ journal->j_bt_hash = NULL;
+}

/*
* Read a block from the journal
@@ -173,6 +241,45 @@ static int jread(struct buffer_head **bhp, journal_t *journal,
return 0;
}

+static int journal_record_bt(journal_t *journal, sector_t original,
+ sector_t target)
+{
+ int err;
+ int hash;
+ struct hlist_node *node;
+ struct journal_block_translation *jbt;
+ sector_t blocknr;
+
+ /* Store the physical block to avoid repeated lookups. */
+ err = journal_bmap(journal, target, &blocknr);
+ if (err)
+ return err;
+
+ /*
+ * If the same block has been translated multiple times,
+ * just use the last.
+ */
+ hash = jbd_hash(original, journal->j_bt_hash_bits,
+ 1 << journal->j_bt_hash_bits);
+ hlist_for_each_entry(jbt, node, journal->j_bt_hash + hash, jbt_list) {
+ if (jbt->original == original) {
+ jbt->target = blocknr;
+ return 0;
+ }
+ }
+
+ jbt = kmem_cache_alloc(bt_record_cache, GFP_NOFS);
+ if (!jbt)
+ return -ENOMEM;
+
+ INIT_HLIST_NODE(&jbt->jbt_list);
+ jbt->original = original;
+ jbt->target = blocknr;
+
+ hlist_add_head(&jbt->jbt_list, journal->j_bt_hash + hash);
+
+ return 0;
+}

/*
* Count the number of in-use tags in a journal descriptor block.
@@ -201,6 +308,24 @@ static int count_tags(struct buffer_head *bh, int size)
return nr;
}

+#define JOURNAL_DEFAULT_BT_HASHBITS 8
+static int bt_hash_init(journal_t *journal)
+{
+ int ii;
+ BUG_ON(journal->j_bt_hash);
+
+ /* Vary size depending on the number of mappings/device size? */
+ journal->j_bt_hash_bits = JOURNAL_DEFAULT_BT_HASHBITS;
+ journal->j_bt_hash = kmalloc(sizeof(struct hlist_head) *
+ 1 << journal->j_bt_hash_bits, GFP_NOFS);
+ if (!journal->j_bt_hash)
+ return -ENOMEM;
+
+ for (ii = 0; ii < (1 << journal->j_bt_hash_bits); ii++)
+ INIT_HLIST_HEAD(journal->j_bt_hash + ii);
+
+ return 0;
+}

/* Make sure we wrap around the log correctly! */
#define wrap(journal, var) \
@@ -212,6 +337,7 @@ do { \
/**
* journal_recover - recovers a on-disk journal
* @journal: the journal to recover
+ * @write: whether to write to disk
*
* The primary function for recovering the log contents when mounting a
* journaled device.
@@ -220,8 +346,11 @@ do { \
* end of the log. In the second, we assemble the list of revoke
* blocks. In the third and final pass, we replay any un-revoked blocks
* in the log.
+ *
+ * If the journal is read-only a translation table will be built up instead
+ * of the journal being replayed on disk.
*/
-int journal_recover(journal_t *journal)
+int journal_recover(journal_t *journal, int write)
{
int err;
journal_superblock_t * sb;
@@ -244,6 +373,16 @@ int journal_recover(journal_t *journal)
return 0;
}

+ if (!write) {
+ err = bt_hash_init(journal);
+ if (err) {
+ printk(KERN_ERR
+ "JBD: out of memory allocating"
+ " journal block translation hash\n");
+ return err;
+ }
+ }
+
err = do_one_pass(journal, &info, PASS_SCAN);
if (!err)
err = do_one_pass(journal, &info, PASS_REVOKE);
@@ -256,12 +395,15 @@ int journal_recover(journal_t *journal)
jbd_debug(1, "JBD: Replayed %d and revoked %d/%d blocks\n",
info.nr_replays, info.nr_revoke_hits, info.nr_revokes);

- /* Restart the log at the next transaction ID, thus invalidating
- * any existing commit records in the log. */
- journal->j_transaction_sequence = ++info.end_transaction;
+ if (write) {
+
+ /* Restart the log at the next transaction ID, thus invalidating
+ * any existing commit records in the log. */
+ journal->j_transaction_sequence = ++info.end_transaction;
+ journal_clear_revoke(journal);
+ sync_blockdev(journal->j_fs_dev);
+ }

- journal_clear_revoke(journal);
- sync_blockdev(journal->j_fs_dev);
return err;
}

@@ -345,7 +487,6 @@ static int replay_descriptor_block(
int success = 0;
char *tagp;
unsigned long io_block;
- struct buffer_head *obh;
unsigned long next = *next_log_block;

tagp = &data[sizeof(journal_header_t)];
@@ -365,28 +506,34 @@ static int replay_descriptor_block(
goto skip_write;
}

- err = jread(&obh, journal, io_block);
- if (err) {
+ if (journal->j_bt_hash) {
+ err = journal_record_bt(journal, blocknr, io_block);
+ } else {
+ struct buffer_head *obh;

- /* Recover what we can, but
- * report failure at the end. */
- success = err;
- printk(KERN_ERR "JBD: IO error %d recovering "
- "block %ld in log\n", err, io_block);
- continue;
- }
- J_ASSERT(obh != NULL);
+ err = jread(&obh, journal, io_block);
+ if (err) {

- err = replay_data_block(journal, obh, data, flags, blocknr);
- if (err) {
+ /* Recover what we can, but
+ * report failure at the end. */
+ success = err;
+ printk(KERN_ERR "JBD: IO error %d recovering "
+ "block %ld in log\n", err, io_block);
+ continue;
+ }
+ J_ASSERT(obh != NULL);
+
+ err = replay_data_block(
+ journal, obh, data, flags, blocknr);
brelse(obh);
+ }
+ if (err) {
printk(KERN_ERR "JBD: Out of memory during recovery\n");
success = err;
goto failed;
}

++info->nr_replays;
- brelse(obh);

skip_write:
tagp += sizeof(journal_block_tag_t);
diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
index 5f64df4..7f57d03 100644
--- a/fs/jbd/revoke.c
+++ b/fs/jbd/revoke.c
@@ -105,15 +105,10 @@ static void flush_descriptor(journal_t *, struct journal_head *, int);

/* Utility functions to maintain the revoke table */

-/* Borrowed from buffer.c: this is a tried and tested block hash function */
static inline int hash(journal_t *journal, unsigned long block)
{
struct jbd_revoke_table_s *table = journal->j_revoke;
- int hash_shift = table->hash_shift;
-
- return ((block << (hash_shift - 6)) ^
- (block >> 13) ^
- (block << (hash_shift - 12))) & (table->hash_size - 1);
+ return jbd_hash(block, table->hash_shift, table->hash_size);
}

static int insert_revoke_hash(journal_t *journal, unsigned long blocknr,
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index f31c7e8..cbd53aa 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -591,7 +591,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local)

osb = journal->j_osb;

- status = journal_load(journal->j_journal);
+ status = journal_load(journal->j_journal, true);
if (status < 0) {
mlog(ML_ERROR, "Failed to load journal!\n");
goto done;
@@ -1014,7 +1014,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
goto done;
}

- status = journal_load(journal);
+ status = journal_load(journal, 1);
if (status < 0) {
mlog_errno(status);
if (!igrab(inode))
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index b18fd3b..6154082 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -558,6 +558,18 @@ struct transaction_s
};

/**
+ * struct journal_block_translation - hash bucket recording block translations.
+ * @jbt_list: list of translated blocks in this bucket.
+ * @original: the (logical?) filesystem block being translation.
+ * @target: the physical journal block holding the original's data.
+ */
+struct journal_block_translation {
+ struct hlist_node jbt_list;
+ sector_t original;
+ sector_t target;
+};
+
+/**
* struct journal_s - The journal_s type is the concrete type associated with
* journal_t.
* @j_flags: General journaling state flags
@@ -618,6 +630,8 @@ struct transaction_s
* @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the
* number that will fit in j_blocksize
* @j_last_sync_writer: most recent pid which did a synchronous write
+ * @j_bt_hash: hash table mapping uncommitted blocks to their targets
+ * @j_bt_hash_bits: number of bits in block translation hash table
* @j_private: An opaque pointer to fs-private information.
*/

@@ -810,6 +824,12 @@ struct journal_s
pid_t j_last_sync_writer;

/*
+ * Remap uncommitted transactions in a read-only filesystem.
+ */
+ struct hlist_head *j_bt_hash;
+ int j_bt_hash_bits;
+
+ /*
* An opaque pointer to fs-private information. ext3 puts its
* superblock pointer here
*/
@@ -916,12 +936,12 @@ extern int journal_check_available_features
extern int journal_set_features
(journal_t *, unsigned long, unsigned long, unsigned long);
extern int journal_create (journal_t *);
-extern int journal_load (journal_t *journal);
+extern int journal_load (journal_t *journal, int);
extern void journal_destroy (journal_t *);
-extern int journal_recover (journal_t *journal);
+extern int journal_recover (journal_t *journal, int);
extern int journal_wipe (journal_t *, int);
extern int journal_skip_recovery (journal_t *);
-extern void journal_update_superblock (journal_t *, int);
+extern void journal_update_superblock (journal_t *, int, int);
extern void journal_abort (journal_t *, int);
extern int journal_errno (journal_t *);
extern void journal_ack_err (journal_t *);
@@ -970,6 +990,12 @@ extern int journal_test_revoke(journal_t *, unsigned long, tid_t);
extern void journal_clear_revoke(journal_t *);
extern void journal_switch_revoke_table(journal_t *journal);

+extern bool journal_translate_block(journal_t *journal, sector_t blocknr,
+ sector_t *retp);
+extern void journal_destroy_translations(journal_t *journal);
+extern int journal_init_recover_cache(void);
+extern void journal_destroy_recover_cache(void);
+
/*
* The log thread user interface:
*
@@ -989,6 +1015,15 @@ void __log_wait_for_space(journal_t *journal);
extern void __journal_drop_transaction(journal_t *, transaction_t *);
extern int cleanup_journal_tail(journal_t *);

+/* Borrowed from buffer.c: this is a tried and tested block hash function */
+static inline int jbd_hash(unsigned long block, int shift, int size)
+{
+ return ((block << (shift - 6)) ^
+ (block >> 13) ^
+ (block << (shift - 12))) &
+ (size - 1);
+}
+
/* Debugging code only: */

#define jbd_ENOSYS() \
--
1.5.3.7


2008-03-06 02:06:14

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 6/6] ext3: do not write to the disk when mounting a dirty read-only filesystem

If a filesystem is being mounted read-only then load the journal read-only
(i.e. do not replay the log) and do not process the orphan list.

At present, when a dirty ext3 filesystem is mounted read-only, it writes to the
disk while replaying the journal log and cleaning up the orphan list. This
behaviour may surprise users and can potentially cause data corruption/loss
(e.g. if a system is suspended, booted into a different OS, then resumed).

Introduce block accessor wrapper functions that check for blocks requiring
translation and access them from the journal as needed. Convert the ext3 code
to use the wrappers anywhere that may be called on a read-only filesystem. Code
that is only called when writing to the filesystem is not converted.

For a discussion of the need for this feature, see:
http://marc.info/?l=linux-kernel&m=117607695406580

Tested by creating dirty filesystems, mounting a copy read-write, taking the
md5sum of all its files, mounting a different copy read-only, confirming the
md5sums match, unmounting it, then confirming the block device mounted
read-only has not been modified. Testing has been done with both internal and
external journals.

NOTE: For now I'm simply preventing filesystems requiring recovery from being
remounted read-write. This breaks booting with an uncleanly mounted root
filesystem!

Signed-off-by: Duane Griffin <[email protected]>
---

I went back and forth on converting all code to use the accessors. In the end
it felt silly to shunt write code through a wrapper that was only useful in a
read-only filesystem. On the other hand, the inconsistency bothers me and
presumably increases the risks of future changes directly operating on
filesysem blocks when they need to go through the wrappers. I'd appreciate
second opinions on this point, in particular.

Also, would it perhaps be better to split this into separate patches to
introduce the accessors and add the no-replay code?

fs/ext3/balloc.c | 2 +-
fs/ext3/ialloc.c | 2 +-
fs/ext3/inode.c | 8 ++--
fs/ext3/resize.c | 2 +-
fs/ext3/super.c | 123 +++++++++++++++++++++++++++++++----------------
fs/ext3/xattr.c | 8 ++--
include/linux/ext3_fs.h | 7 +++
7 files changed, 99 insertions(+), 53 deletions(-)

diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index da0cb2c..1b42154 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -145,7 +145,7 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group)
if (!desc)
return NULL;
bitmap_blk = le32_to_cpu(desc->bg_block_bitmap);
- bh = sb_getblk(sb, bitmap_blk);
+ bh = ext3_sb_getblk(sb, bitmap_blk);
if (unlikely(!bh)) {
ext3_error(sb, __FUNCTION__,
"Cannot read block bitmap - "
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 4f4020c..7de791c 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -60,7 +60,7 @@ read_inode_bitmap(struct super_block * sb, unsigned long block_group)
if (!desc)
goto error_out;

- bh = sb_bread(sb, le32_to_cpu(desc->bg_inode_bitmap));
+ bh = ext3_sb_bread(sb, le32_to_cpu(desc->bg_inode_bitmap));
if (!bh)
ext3_error(sb, "read_inode_bitmap",
"Cannot read inode bitmap - "
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index eb95670..75fcf9e 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -364,7 +364,7 @@ static Indirect *ext3_get_branch(struct inode *inode, int depth, int *offsets,
if (!p->key)
goto no_block;
while (--depth) {
- bh = sb_bread(sb, le32_to_cpu(p->key));
+ bh = ext3_sb_bread(sb, le32_to_cpu(p->key));
if (!bh)
goto failure;
/* Reader: pointers */
@@ -1009,7 +1009,7 @@ struct buffer_head *ext3_getblk(handle_t *handle, struct inode *inode,
*errp = err;
if (!err && buffer_mapped(&dummy)) {
struct buffer_head *bh;
- bh = sb_getblk(inode->i_sb, dummy.b_blocknr);
+ bh = ext3_sb_getblk(inode->i_sb, dummy.b_blocknr);
if (!bh) {
*errp = -EIO;
goto err;
@@ -2514,7 +2514,7 @@ static int __ext3_get_inode_loc(struct inode *inode,
if (!block)
return -EIO;

- bh = sb_getblk(inode->i_sb, block);
+ bh = ext3_sb_getblk(inode->i_sb, block);
if (!bh) {
ext3_error (inode->i_sb, "ext3_get_inode_loc",
"unable to read inode block - "
@@ -2557,7 +2557,7 @@ static int __ext3_get_inode_loc(struct inode *inode,
if (!desc)
goto make_io;

- bitmap_bh = sb_getblk(inode->i_sb,
+ bitmap_bh = ext3_sb_getblk(inode->i_sb,
le32_to_cpu(desc->bg_inode_bitmap));
if (!bitmap_bh)
goto make_io;
diff --git a/fs/ext3/resize.c b/fs/ext3/resize.c
index 9397d77..2f9799b 100644
--- a/fs/ext3/resize.c
+++ b/fs/ext3/resize.c
@@ -236,7 +236,7 @@ static int setup_new_group_blocks(struct super_block *sb,
if (err)
goto exit_bh;

- gdb = sb_getblk(sb, block);
+ gdb = ext3_sb_getblk(sb, block);
if (!gdb) {
err = -EIO;
goto exit_bh;
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 4b9ff65..c333dac 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -64,6 +64,40 @@ static void ext3_unlockfs(struct super_block *sb);
static void ext3_write_super (struct super_block * sb);
static void ext3_write_super_lockfs(struct super_block *sb);

+struct buffer_head *ext3_sb_bread(struct super_block *sb, sector_t block)
+{
+ sector_t blocknr;
+ journal_t *journal = EXT3_SB(sb)->s_journal;
+ if (journal_translate_block(journal, block, &blocknr))
+ return __bread(journal->j_dev, blocknr, journal->j_blocksize);
+ else
+ return sb_bread(sb, block);
+}
+
+struct buffer_head *ext3_sb_getblk(struct super_block *sb, sector_t block)
+{
+ sector_t blocknr;
+ journal_t *journal = EXT3_SB(sb)->s_journal;
+ if (journal_translate_block(journal, block, &blocknr))
+ return __getblk(journal->j_dev, blocknr, journal->j_blocksize);
+ else
+ return sb_getblk(sb, block);
+}
+
+void ext3_map_bh(struct buffer_head *bh, struct super_block *sb, sector_t block)
+{
+ sector_t blocknr;
+ journal_t *journal = EXT3_SB(sb)->s_journal;
+ if (journal_translate_block(journal, block, &blocknr)) {
+ set_buffer_mapped(bh);
+ bh->b_bdev = journal->j_dev;
+ bh->b_blocknr = blocknr;
+ bh->b_size = journal->j_blocksize;
+ } else {
+ map_bh(bh, sb, block);
+ }
+}
+
/*
* Wrappers for journal_start/end.
*
@@ -1328,7 +1362,6 @@ static int ext3_check_descriptors(struct super_block *sb)
static void ext3_orphan_cleanup (struct super_block * sb,
struct ext3_super_block * es)
{
- unsigned int s_flags = sb->s_flags;
int nr_orphans = 0, nr_truncates = 0;
#ifdef CONFIG_QUOTA
int i;
@@ -1338,9 +1371,9 @@ static void ext3_orphan_cleanup (struct super_block * sb,
return;
}

- if (bdev_read_only(sb->s_bdev)) {
- printk(KERN_ERR "EXT3-fs: write access "
- "unavailable, skipping orphan cleanup.\n");
+ if (sb->s_flags & MS_RDONLY) {
+ printk(KERN_INFO "EXT3-fs: %s: skipping orphan cleanup on "
+ "readonly fs\n", sb->s_id);
return;
}

@@ -1353,11 +1386,6 @@ static void ext3_orphan_cleanup (struct super_block * sb,
return;
}

- if (s_flags & MS_RDONLY) {
- printk(KERN_INFO "EXT3-fs: %s: orphan cleanup on readonly fs\n",
- sb->s_id);
- sb->s_flags &= ~MS_RDONLY;
- }
#ifdef CONFIG_QUOTA
/* Needed for iput() to work correctly and not trash data */
sb->s_flags |= MS_ACTIVE;
@@ -1418,7 +1446,6 @@ static void ext3_orphan_cleanup (struct super_block * sb,
vfs_quota_off(sb, i);
}
#endif
- sb->s_flags = s_flags; /* Restore MS_RDONLY status */
}

/*
@@ -1838,8 +1865,8 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
goto failed_mount3;
}

- /* We have now updated the journal if required, so we can
- * validate the data journaling mode. */
+ /* We have now updated the journal or mapped unrecovered blocks as
+ * required, so we can validate the data journaling mode. */
switch (test_opt(sb, DATA_FLAGS)) {
case 0:
/* No mode set, assume a default based on the journal
@@ -1872,8 +1899,8 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
}
}
/*
- * The journal_load will have done any necessary log recovery,
- * so we can safely mount the rest of the filesystem now.
+ * The journal_load will have done any necessary log recovery or block
+ * remapping, so we can safely mount the rest of the filesystem now.
*/

root = ext3_iget(sb, EXT3_ROOT_INO);
@@ -1907,9 +1934,16 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
EXT3_SB(sb)->s_mount_state |= EXT3_ORPHAN_FS;
ext3_orphan_cleanup(sb, es);
EXT3_SB(sb)->s_mount_state &= ~EXT3_ORPHAN_FS;
- if (needs_recovery)
- printk (KERN_INFO "EXT3-fs: recovery complete.\n");
- ext3_mark_recovery_complete(sb, es);
+
+ if (needs_recovery) {
+ if (sb->s_flags & MS_RDONLY) {
+ printk(KERN_INFO "EXT3-fs: recovery skipped on "
+ "read-only filesystem.\n");
+ } else {
+ ext3_mark_recovery_complete(sb, es);
+ printk(KERN_INFO "EXT3-fs: recovery complete.\n");
+ }
+ }
printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
@@ -2110,7 +2144,6 @@ static int ext3_load_journal(struct super_block *sb,
unsigned int journal_inum = le32_to_cpu(es->s_journal_inum);
dev_t journal_dev;
int err = 0;
- int really_read_only;

if (journal_devnum &&
journal_devnum != le32_to_cpu(es->s_journal_dev)) {
@@ -2120,26 +2153,16 @@ static int ext3_load_journal(struct super_block *sb,
} else
journal_dev = new_decode_dev(le32_to_cpu(es->s_journal_dev));

- really_read_only = bdev_read_only(sb->s_bdev);
-
/*
* Are we loading a blank journal or performing recovery after a
* crash? For recovery, we need to check in advance whether we
* can get read-write access to the device.
*/

- if (EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER)) {
- if (sb->s_flags & MS_RDONLY) {
- printk(KERN_INFO "EXT3-fs: INFO: recovery "
- "required on readonly filesystem.\n");
- if (really_read_only) {
- printk(KERN_ERR "EXT3-fs: write access "
- "unavailable, cannot proceed.\n");
- return -EROFS;
- }
- printk (KERN_INFO "EXT3-fs: write access will "
- "be enabled during recovery.\n");
- }
+ if (EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER) &&
+ sb->s_flags & MS_RDONLY) {
+ printk(KERN_INFO "EXT3-fs: INFO: skipping recovery "
+ "on readonly filesystem.\n");
}

if (journal_inum && journal_dev) {
@@ -2156,7 +2179,7 @@ static int ext3_load_journal(struct super_block *sb,
return -EINVAL;
}

- if (!really_read_only && test_opt(sb, UPDATE_JOURNAL)) {
+ if (!(sb->s_flags & MS_RDONLY) && test_opt(sb, UPDATE_JOURNAL)) {
err = journal_update_format(journal);
if (err) {
printk(KERN_ERR "EXT3-fs: error updating journal.\n");
@@ -2166,9 +2189,9 @@ static int ext3_load_journal(struct super_block *sb,
}

if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER))
- err = journal_wipe(journal, !really_read_only);
+ err = journal_wipe(journal, !(sb->s_flags & MS_RDONLY));
if (!err)
- err = journal_load(journal, !really_read_only);
+ err = journal_load(journal, !(sb->s_flags & MS_RDONLY));

if (err) {
printk(KERN_ERR "EXT3-fs: error loading journal.\n");
@@ -2177,15 +2200,17 @@ static int ext3_load_journal(struct super_block *sb,
}

EXT3_SB(sb)->s_journal = journal;
- ext3_clear_journal_err(sb, es);
+ if (!(sb->s_flags & MS_RDONLY)) {
+ ext3_clear_journal_err(sb, es);

- if (journal_devnum &&
- journal_devnum != le32_to_cpu(es->s_journal_dev)) {
- es->s_journal_dev = cpu_to_le32(journal_devnum);
- sb->s_dirt = 1;
+ if (journal_devnum &&
+ journal_devnum != le32_to_cpu(es->s_journal_dev)) {
+ es->s_journal_dev = cpu_to_le32(journal_devnum);
+ sb->s_dirt = 1;

- /* Make sure we flush the recovery flag to disk. */
- ext3_commit_super(sb, es, 1);
+ /* Make sure we flush the recovery flag to disk. */
+ ext3_commit_super(sb, es, 1);
+ }
}

return 0;
@@ -2494,6 +2519,20 @@ static int ext3_remount (struct super_block * sb, int * flags, char * data)
}

/*
+ * If we have unrecovered journal transactions hanging
+ * around require a full umount/remount for now.
+ */
+ if (sbi->s_journal->j_bt_hash) {
+ printk(KERN_WARNING "EXT3-fs: %s: couldn't "
+ "remount RDWR because of unrecovered "
+ "journal transactions. Please "
+ "umount/remount instead.\n",
+ sb->s_id);
+ err = -EINVAL;
+ goto restore_opts;
+ }
+
+ /*
* Mounting a RDONLY partition read-write, so reread
* and store the current valid flag. (It may have
* been changed by e2fsck since we originally mounted
diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c
index fb89c29..6d0ef6f 100644
--- a/fs/ext3/xattr.c
+++ b/fs/ext3/xattr.c
@@ -226,7 +226,7 @@ ext3_xattr_block_get(struct inode *inode, int name_index, const char *name,
if (!EXT3_I(inode)->i_file_acl)
goto cleanup;
ea_idebug(inode, "reading block %u", EXT3_I(inode)->i_file_acl);
- bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
+ bh = ext3_sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
if (!bh)
goto cleanup;
ea_bdebug(bh, "b_count=%d, refcount=%d",
@@ -367,7 +367,7 @@ ext3_xattr_block_list(struct inode *inode, char *buffer, size_t buffer_size)
if (!EXT3_I(inode)->i_file_acl)
goto cleanup;
ea_idebug(inode, "reading block %u", EXT3_I(inode)->i_file_acl);
- bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
+ bh = ext3_sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
error = -EIO;
if (!bh)
goto cleanup;
@@ -641,7 +641,7 @@ ext3_xattr_block_find(struct inode *inode, struct ext3_xattr_info *i,

if (EXT3_I(inode)->i_file_acl) {
/* The inode already has an extended attribute block. */
- bs->bh = sb_bread(sb, EXT3_I(inode)->i_file_acl);
+ bs->bh = ext3_sb_bread(sb, EXT3_I(inode)->i_file_acl);
error = -EIO;
if (!bs->bh)
goto cleanup;
@@ -1213,7 +1213,7 @@ again:
goto again;
break;
}
- bh = sb_bread(inode->i_sb, ce->e_block);
+ bh = ext3_sb_bread(inode->i_sb, ce->e_block);
if (!bh) {
ext3_error(inode->i_sb, __FUNCTION__,
"inode %lu: block %lu read error",
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 36c5403..c5732e1 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -864,6 +864,13 @@ extern void ext3_abort (struct super_block *, const char *, const char *, ...)
extern void ext3_warning (struct super_block *, const char *, const char *, ...)
__attribute__ ((format (printf, 3, 4)));
extern void ext3_update_dynamic_rev (struct super_block *sb);
+extern struct buffer_head *ext3_sb_bread(struct super_block *sb,
+ sector_t block);
+extern struct buffer_head *ext3_sb_getblk(struct super_block *sb,
+ sector_t block);
+extern void ext3_map_bh(struct buffer_head *bh,
+ struct super_block *sb, sector_t block);
+

#define ext3_std_error(sb, errno) \
do { \
--
1.5.3.7


2008-03-06 01:59:09

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 1/6] jbd: eliminate duplicated code in revocation table init/destroy functions

The revocation table initialisation/destruction code is repeated for each of
the two revocation tables stored in the journal. Refactoring the duplicated
code into functions is tidier, simplifies the logic in initialisation in
particular, and slightly reduces the code size.

There should not be any functional change.

Signed-off-by: Duane Griffin <[email protected]>
---

This change is an independent cleanup which is not required by following
patches in this series. Also it should perhaps be two separate patches?

fs/jbd/revoke.c | 126 +++++++++++++++++++++++--------------------------------
1 files changed, 52 insertions(+), 74 deletions(-)

diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
index ad2eacf..5f64df4 100644
--- a/fs/jbd/revoke.c
+++ b/fs/jbd/revoke.c
@@ -195,108 +195,86 @@ void journal_destroy_revoke_caches(void)
revoke_table_cache = NULL;
}

-/* Initialise the revoke table for a given journal to a given size. */
-
-int journal_init_revoke(journal_t *journal, int hash_size)
+static int journal_init_revoke_table(struct jbd_revoke_table_s *table, int size)
{
- int shift, tmp;
-
- J_ASSERT (journal->j_revoke_table[0] == NULL);
-
- shift = 0;
- tmp = hash_size;
+ int shift = 0;
+ int tmp = size;
while((tmp >>= 1UL) != 0UL)
shift++;

- journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
- if (!journal->j_revoke_table[0])
- return -ENOMEM;
- journal->j_revoke = journal->j_revoke_table[0];
-
- /* Check that the hash_size is a power of two */
- J_ASSERT(is_power_of_2(hash_size));
-
- journal->j_revoke->hash_size = hash_size;
-
- journal->j_revoke->hash_shift = shift;
-
- journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
- if (!journal->j_revoke->hash_table) {
- kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
- journal->j_revoke = NULL;
+ table->hash_size = size;
+ table->hash_shift = shift;
+ table->hash_table = kmalloc(
+ size * sizeof(struct list_head), GFP_KERNEL);
+ if (!table->hash_table)
return -ENOMEM;
- }
-
- for (tmp = 0; tmp < hash_size; tmp++)
- INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);

- journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL);
- if (!journal->j_revoke_table[1]) {
- kfree(journal->j_revoke_table[0]->hash_table);
- kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
- return -ENOMEM;
- }
+ for (tmp = 0; tmp < size; tmp++)
+ INIT_LIST_HEAD(&table->hash_table[tmp]);

- journal->j_revoke = journal->j_revoke_table[1];
+ return 0;
+}

- /* Check that the hash_size is a power of two */
+/* Initialise the revoke table for a given journal to a given size. */
+int journal_init_revoke(journal_t *journal, int hash_size)
+{
+ J_ASSERT(journal->j_revoke_table[0] == NULL);
J_ASSERT(is_power_of_2(hash_size));

- journal->j_revoke->hash_size = hash_size;
+ journal->j_revoke_table[0] = kmem_cache_alloc(
+ revoke_table_cache, GFP_KERNEL);
+ if (!journal->j_revoke_table[0])
+ goto failed_alloc1;
+ if (journal_init_revoke_table(journal->j_revoke_table[0], hash_size))
+ goto failed_init1;

- journal->j_revoke->hash_shift = shift;
+ journal->j_revoke_table[1] = kmem_cache_alloc(
+ revoke_table_cache, GFP_KERNEL);
+ if (!journal->j_revoke_table[1])
+ goto failed_alloc2;
+ if (journal_init_revoke_table(journal->j_revoke_table[1], hash_size))
+ goto failed_init2;

- journal->j_revoke->hash_table =
- kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
- if (!journal->j_revoke->hash_table) {
- kfree(journal->j_revoke_table[0]->hash_table);
- kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
- kmem_cache_free(revoke_table_cache, journal->j_revoke_table[1]);
- journal->j_revoke = NULL;
- return -ENOMEM;
- }
-
- for (tmp = 0; tmp < hash_size; tmp++)
- INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]);
+ journal->j_revoke = journal->j_revoke_table[1];

spin_lock_init(&journal->j_revoke_lock);

return 0;
-}

-/* Destoy a journal's revoke table. The table must already be empty! */
+failed_init2:
+ kmem_cache_free(revoke_table_cache, journal->j_revoke_table[1]);
+failed_alloc2:
+ kfree(journal->j_revoke_table[0]->hash_table);
+failed_init1:
+ kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]);
+failed_alloc1:
+ return -ENOMEM;
+}

-void journal_destroy_revoke(journal_t *journal)
+static void journal_destroy_revoke_table(struct jbd_revoke_table_s *table)
{
- struct jbd_revoke_table_s *table;
- struct list_head *hash_list;
int i;
-
- table = journal->j_revoke_table[0];
- if (!table)
- return;
-
- for (i=0; i<table->hash_size; i++) {
+ struct list_head *hash_list;
+ for (i = 0; i < table->hash_size; i++) {
hash_list = &table->hash_table[i];
- J_ASSERT (list_empty(hash_list));
+ J_ASSERT(list_empty(hash_list));
}

kfree(table->hash_table);
kmem_cache_free(revoke_table_cache, table);
- journal->j_revoke = NULL;
+}

- table = journal->j_revoke_table[1];
- if (!table)
+/* Destroy a journal's revoke table. The table must already be empty! */
+void journal_destroy_revoke(journal_t *journal)
+{
+ if (!journal->j_revoke_table[0])
return;
+ journal_destroy_revoke_table(journal->j_revoke_table[0]);
+ journal->j_revoke = NULL;

- for (i=0; i<table->hash_size; i++) {
- hash_list = &table->hash_table[i];
- J_ASSERT (list_empty(hash_list));
- }
-
- kfree(table->hash_table);
- kmem_cache_free(revoke_table_cache, table);
+ if (!journal->j_revoke_table[1])
+ return;
+ journal_destroy_revoke_table(journal->j_revoke_table[1]);
journal->j_revoke = NULL;
}

--
1.5.3.7

2008-03-06 01:59:10

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 2/6] jbd: replace potentially false assertion with if block

If an error occurs during jbd cache initialisation it is possible for the
journal_head_cache to be NULL when journal_destroy_journal_head_cache is
called. Replace the J_ASSERT with an if block to handle the situation
correctly.

Note that even with this fix things will break badly if ext3 is statically
compiled in and cache initialisation fails.

Signed-off-by: Duane Griffin <[email protected]>
---

At the moment the cache destruction methods are inconsistent in whether they
set the cache pointers to NULL after destroying them. To be precise,
journal_destroy_handle_cache doesn't, the others do. I haven't changed that,
although I was sorely tempted. I think it would be better to be consistent and
that NULLing the pointers is preferable. Any objections?

This patch is an independent bug fix and following patches in this series do
not depend on it.

fs/jbd/journal.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 3943a89..3f334a3 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -1635,9 +1635,10 @@ static int journal_init_journal_head_cache(void)

static void journal_destroy_journal_head_cache(void)
{
- J_ASSERT(journal_head_cache != NULL);
- kmem_cache_destroy(journal_head_cache);
- journal_head_cache = NULL;
+ if (journal_head_cache) {
+ kmem_cache_destroy(journal_head_cache);
+ journal_head_cache = NULL;
+ }
}

/*
--
1.5.3.7

2008-03-06 01:59:11

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 3/6] jbd: only create debugfs entries if cache initialisation is successful

JBD debugfs entries should only be created if cache initialisation is
successful. At the moment they are being created unconditionally which will
leave them dangling if cache (and hence module) initialisation fails.

Signed-off-by: Duane Griffin <[email protected]>
---

This patch is an independent bug fix and following patches in this series do
not depend on it.

fs/jbd/journal.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 3f334a3..a868277 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -1942,9 +1942,10 @@ static int __init journal_init(void)
BUILD_BUG_ON(sizeof(struct journal_superblock_s) != 1024);

ret = journal_init_caches();
- if (ret != 0)
+ if (ret == 0)
+ jbd_create_debugfs_entry();
+ else
journal_destroy_caches();
- jbd_create_debugfs_entry();
return ret;
}

--
1.5.3.7

2008-03-06 01:59:12

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 4/6] jbd: refactor nested journal log recovery loop into separate functions

The do_one_pass function iterates through the jbd log operating on blocks
depending on the pass. During the replay pass descriptor blocks are processed
by an inner loop which replays them. The nesting makes the code difficult to
follow or to modify. Splitting the inner loop and replay code into separate
functions simplifies matters.

The refactored code no longer unnecessarily reads revoked data blocks, so
potential read errors from such blocks will cause differing behaviour. Aside
from that there should be no functional effect.

Signed-off-by: Duane Griffin <[email protected]>
---

Note the TODO before the block read in the outer loop below. We could possibly
attempt to continue after a failed read as we do when replaying descriptor
blocks. However if it was a descriptor block we'd likely bomb out on the next
iteration anyway, so I'm not sure it would be worth it. Thoughts?

fs/jbd/recovery.c | 243 ++++++++++++++++++++++++++++-------------------------
1 files changed, 127 insertions(+), 116 deletions(-)

diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
index 2b8edf4..e2ac78f 100644
--- a/fs/jbd/recovery.c
+++ b/fs/jbd/recovery.c
@@ -307,6 +307,101 @@ int journal_skip_recovery(journal_t *journal)
return err;
}

+static int replay_data_block(
+ journal_t *journal, struct buffer_head *obh, char *data,
+ int flags, unsigned long blocknr)
+{
+ struct buffer_head *nbh;
+
+ /* Find a buffer for the new data being restored */
+ nbh = __getblk(journal->j_fs_dev, blocknr, journal->j_blocksize);
+ if (nbh == NULL)
+ return -ENOMEM;
+
+ lock_buffer(nbh);
+ memcpy(nbh->b_data, obh->b_data, journal->j_blocksize);
+ if (flags & JFS_FLAG_ESCAPE)
+ *((__be32 *)data) = cpu_to_be32(JFS_MAGIC_NUMBER);
+
+ BUFFER_TRACE(nbh, "marking dirty");
+ set_buffer_uptodate(nbh);
+ mark_buffer_dirty(nbh);
+ BUFFER_TRACE(nbh, "marking uptodate");
+ /* ll_rw_block(WRITE, 1, &nbh); */
+ unlock_buffer(nbh);
+ brelse(nbh);
+
+ return 0;
+}
+
+/* Replay a descriptor block: write all of the data blocks. */
+static int replay_descriptor_block(
+ journal_t *journal, char *data,
+ unsigned int next_commit_ID,
+ unsigned long *next_log_block,
+ struct recovery_info *info)
+{
+ int err;
+ int success = 0;
+ char *tagp;
+ unsigned long io_block;
+ struct buffer_head *obh;
+ unsigned long next = *next_log_block;
+
+ tagp = &data[sizeof(journal_header_t)];
+ while ((tagp - data + sizeof(journal_block_tag_t)) <=
+ journal->j_blocksize) {
+ unsigned long blocknr;
+ journal_block_tag_t *tag = (journal_block_tag_t *) tagp;
+ int flags = be32_to_cpu(tag->t_flags);
+
+ io_block = next++;
+ wrap(journal, next);
+ blocknr = be32_to_cpu(tag->t_blocknr);
+
+ /* If the block has been revoked, then we're all done here. */
+ if (journal_test_revoke(journal, blocknr, next_commit_ID)) {
+ ++info->nr_revoke_hits;
+ goto skip_write;
+ }
+
+ err = jread(&obh, journal, io_block);
+ if (err) {
+
+ /* Recover what we can, but
+ * report failure at the end. */
+ success = err;
+ printk(KERN_ERR "JBD: IO error %d recovering "
+ "block %ld in log\n", err, io_block);
+ continue;
+ }
+ J_ASSERT(obh != NULL);
+
+ err = replay_data_block(journal, obh, data, flags, blocknr);
+ if (err) {
+ brelse(obh);
+ printk(KERN_ERR "JBD: Out of memory during recovery\n");
+ success = err;
+ goto failed;
+ }
+
+ ++info->nr_replays;
+ brelse(obh);
+
+skip_write:
+ tagp += sizeof(journal_block_tag_t);
+ if (!(flags & JFS_FLAG_SAME_UUID))
+ tagp += 16;
+
+ if (flags & JFS_FLAG_LAST_TAG)
+ break;
+ }
+
+ *next_log_block = next;
+failed:
+ return success;
+}
+
static int do_one_pass(journal_t *journal,
struct recovery_info *info, enum passtype pass)
{
@@ -316,8 +411,6 @@ static int do_one_pass(journal_t *journal,
journal_superblock_t * sb;
journal_header_t * tmp;
struct buffer_head * bh;
- unsigned int sequence;
- int blocktype;

/* Precompute the maximum metadata descriptors in a descriptor block */
int MAX_BLOCKS_PER_DESC;
@@ -348,11 +441,8 @@ static int do_one_pass(journal_t *journal,
*/

while (1) {
- int flags;
- char * tagp;
- journal_block_tag_t * tag;
- struct buffer_head * obh;
- struct buffer_head * nbh;
+ int blocktype;
+ unsigned int sequence;

cond_resched();

@@ -371,10 +461,13 @@ static int do_one_pass(journal_t *journal,
* either the next descriptor block or the final commit
* record. */

+ /* TODO: Should we continue on error? */
jbd_debug(3, "JBD: checking block %ld\n", next_log_block);
err = jread(&bh, journal, next_log_block);
- if (err)
+ if (err) {
+ success = err;
goto failed;
+ }

next_log_block++;
wrap(journal, next_log_block);
@@ -406,137 +499,57 @@ static int do_one_pass(journal_t *journal,
* all of the sequence number checks. What are we going
* to do with it? That depends on the pass... */

- switch(blocktype) {
+ switch (blocktype) {
case JFS_DESCRIPTOR_BLOCK:
/* If it is a valid descriptor block, replay it
* in pass REPLAY; otherwise, just skip over the
* blocks it describes. */
- if (pass != PASS_REPLAY) {
+ if (pass == PASS_REPLAY) {
+ err = replay_descriptor_block(
+ journal,
+ bh->b_data,
+ next_commit_ID,
+ &next_log_block,
+ info);
+ } else {
next_log_block +=
count_tags(bh, journal->j_blocksize);
wrap(journal, next_log_block);
- brelse(bh);
- continue;
}
-
- /* A descriptor block: we can now write all of
- * the data blocks. Yay, useful work is finally
- * getting done here! */
-
- tagp = &bh->b_data[sizeof(journal_header_t)];
- while ((tagp - bh->b_data +sizeof(journal_block_tag_t))
- <= journal->j_blocksize) {
- unsigned long io_block;
-
- tag = (journal_block_tag_t *) tagp;
- flags = be32_to_cpu(tag->t_flags);
-
- io_block = next_log_block++;
- wrap(journal, next_log_block);
- err = jread(&obh, journal, io_block);
- if (err) {
- /* Recover what we can, but
- * report failure at the end. */
- success = err;
- printk (KERN_ERR
- "JBD: IO error %d recovering "
- "block %ld in log\n",
- err, io_block);
- } else {
- unsigned long blocknr;
-
- J_ASSERT(obh != NULL);
- blocknr = be32_to_cpu(tag->t_blocknr);
-
- /* If the block has been
- * revoked, then we're all done
- * here. */
- if (journal_test_revoke
- (journal, blocknr,
- next_commit_ID)) {
- brelse(obh);
- ++info->nr_revoke_hits;
- goto skip_write;
- }
-
- /* Find a buffer for the new
- * data being restored */
- nbh = __getblk(journal->j_fs_dev,
- blocknr,
- journal->j_blocksize);
- if (nbh == NULL) {
- printk(KERN_ERR
- "JBD: Out of memory "
- "during recovery.\n");
- err = -ENOMEM;
- brelse(bh);
- brelse(obh);
- goto failed;
- }
-
- lock_buffer(nbh);
- memcpy(nbh->b_data, obh->b_data,
- journal->j_blocksize);
- if (flags & JFS_FLAG_ESCAPE) {
- *((__be32 *)bh->b_data) =
- cpu_to_be32(JFS_MAGIC_NUMBER);
- }
-
- BUFFER_TRACE(nbh, "marking dirty");
- set_buffer_uptodate(nbh);
- mark_buffer_dirty(nbh);
- BUFFER_TRACE(nbh, "marking uptodate");
- ++info->nr_replays;
- /* ll_rw_block(WRITE, 1, &nbh); */
- unlock_buffer(nbh);
- brelse(obh);
- brelse(nbh);
- }
-
- skip_write:
- tagp += sizeof(journal_block_tag_t);
- if (!(flags & JFS_FLAG_SAME_UUID))
- tagp += 16;
-
- if (flags & JFS_FLAG_LAST_TAG)
- break;
- }
-
- brelse(bh);
- continue;
+ break;

case JFS_COMMIT_BLOCK:
/* Found an expected commit block: not much to
* do other than move on to the next sequence
* number. */
- brelse(bh);
next_commit_ID++;
- continue;
+ break;

case JFS_REVOKE_BLOCK:
/* If we aren't in the REVOKE pass, then we can
* just skip over this block. */
- if (pass != PASS_REVOKE) {
- brelse(bh);
- continue;
+ if (pass == PASS_REVOKE) {
+ err = scan_revoke_records(
+ journal, bh, next_commit_ID, info);
}
-
- err = scan_revoke_records(journal, bh,
- next_commit_ID, info);
- brelse(bh);
- if (err)
- goto failed;
- continue;
+ break;

default:
jbd_debug(3, "Unrecognised magic %d, end of scan.\n",
blocktype);
- brelse(bh);
- goto done;
+ break;
}
+
+ brelse(bh);
+
+ /* Immediately fail on OOM; on other errors try to recover
+ * as much as we can, but report the failure at the end. */
+ if (err)
+ success = err;
+ if (err == -ENOMEM)
+ goto failed;
}

- done:
/*
* We broke out of the log scan loop: either we came to the
* known end of the log or we found an unexpected block in the
@@ -558,10 +571,8 @@ static int do_one_pass(journal_t *journal,
}
}

- return success;
-
failed:
- return err;
+ return success;
}


--
1.5.3.7

2008-03-06 01:59:13

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 5/6] jbd: add support for read-only log recovery

Read-only log recovery allows a dirty journalled filesystem to be mounted and
provide a consistent view of its data without writing to the disk. Instead of
replaying the log a mapping is constructed between modified filesystem blocks
and the journal blocks containing their data.

The mapping is stored in a hashtable that is created and populated during
journal recovery. The hash function used is the same one used by the journal
revocation code. The size of the hashtable is currently being arbitrarily set
to 256 entries. Given that we know the number of block mappings when we create
the table it may be worth dynamically calculating an appropriate size instead
of hard-coding it.

Signed-off-by: Duane Griffin <[email protected]>
---

I'm tempted to add warnings on attempts to modify a read-only journal, does
that sound useful? I'd also be grateful to anyone with suggestions for better
member names. :)

fs/ext3/super.c | 2 +-
fs/jbd/checkpoint.c | 2 +-
fs/jbd/commit.c | 2 +-
fs/jbd/journal.c | 56 ++++++++++------
fs/jbd/recovery.c | 187 +++++++++++++++++++++++++++++++++++++++++++++------
fs/jbd/revoke.c | 7 +--
fs/ocfs2/journal.c | 4 +-
include/linux/jbd.h | 41 ++++++++++-
8 files changed, 246 insertions(+), 55 deletions(-)

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 18769cc..4b9ff65 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2168,7 +2168,7 @@ static int ext3_load_journal(struct super_block *sb,
if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER))
err = journal_wipe(journal, !really_read_only);
if (!err)
- err = journal_load(journal);
+ err = journal_load(journal, !really_read_only);

if (err) {
printk(KERN_ERR "EXT3-fs: error loading journal.\n");
diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
index a5432bb..84d4579 100644
--- a/fs/jbd/checkpoint.c
+++ b/fs/jbd/checkpoint.c
@@ -454,7 +454,7 @@ int cleanup_journal_tail(journal_t *journal)
journal->j_tail = blocknr;
spin_unlock(&journal->j_state_lock);
if (!(journal->j_flags & JFS_ABORT))
- journal_update_superblock(journal, 1);
+ journal_update_superblock(journal, 1, 1);
return 0;
}

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index a38c718..c63598b 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -310,7 +310,7 @@ void journal_commit_transaction(journal_t *journal)
/* Do we need to erase the effects of a prior journal_flush? */
if (journal->j_flags & JFS_FLUSHED) {
jbd_debug(3, "super block updated\n");
- journal_update_superblock(journal, 1);
+ journal_update_superblock(journal, 1, 1);
} else {
jbd_debug(3, "superblock not updated\n");
}
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index a868277..b2723a0 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -822,7 +822,7 @@ static void journal_fail_superblock (journal_t *journal)
* subsequent use.
*/

-static int journal_reset(journal_t *journal)
+static int journal_reset(journal_t *journal, int write)
{
journal_superblock_t *sb = journal->j_superblock;
unsigned long first, last;
@@ -843,8 +843,8 @@ static int journal_reset(journal_t *journal)

journal->j_max_transaction_buffers = journal->j_maxlen / 4;

- /* Add the dynamic fields and write it to disk. */
- journal_update_superblock(journal, 1);
+ /* Add the dynamic fields and write it to disk if not read-only. */
+ journal_update_superblock(journal, write, 1);
return journal_start_thread(journal);
}

@@ -916,21 +916,21 @@ int journal_create(journal_t *journal)
journal->j_flags &= ~JFS_ABORT;
journal->j_format_version = 2;

- return journal_reset(journal);
+ return journal_reset(journal, 1);
}

/**
- * void journal_update_superblock() - Update journal sb on disk.
+ * void journal_update_superblock() - Update journal sb in memory and on disk.
* @journal: The journal to update.
+ * @write: Set to '0' if you don't want to write to the disk.
* @wait: Set to '0' if you don't want to wait for IO completion.
*
* Update a journal's dynamic superblock fields and write it to disk,
* optionally waiting for the IO to complete.
*/
-void journal_update_superblock(journal_t *journal, int wait)
+void journal_update_superblock(journal_t *journal, int write, int wait)
{
journal_superblock_t *sb = journal->j_superblock;
- struct buffer_head *bh = journal->j_sb_buffer;

/*
* As a special case, if the on-disk copy is already marked as needing
@@ -957,12 +957,16 @@ void journal_update_superblock(journal_t *journal, int wait)
sb->s_errno = cpu_to_be32(journal->j_errno);
spin_unlock(&journal->j_state_lock);

- BUFFER_TRACE(bh, "marking dirty");
- mark_buffer_dirty(bh);
- if (wait)
- sync_dirty_buffer(bh);
- else
- ll_rw_block(SWRITE, 1, &bh);
+ if (write) {
+ struct buffer_head *bh = journal->j_sb_buffer;
+
+ BUFFER_TRACE(bh, "marking dirty");
+ mark_buffer_dirty(bh);
+ if (wait)
+ sync_dirty_buffer(bh);
+ else
+ ll_rw_block(SWRITE, 1, &bh);
+ }

out:
/* If we have just flushed the log (by marking s_start==0), then
@@ -1066,12 +1070,13 @@ static int load_superblock(journal_t *journal)
/**
* int journal_load() - Read journal from disk.
* @journal: Journal to act on.
+ * @write: Whether the journal may be modified on-disk.
*
* Given a journal_t structure which tells us which disk blocks contain
* a journal, read the journal from disk to initialise the in-memory
* structures.
*/
-int journal_load(journal_t *journal)
+int journal_load(journal_t *journal, int write)
{
int err;
journal_superblock_t *sb;
@@ -1097,13 +1102,13 @@ int journal_load(journal_t *journal)

/* Let the recovery code check whether it needs to recover any
* data from the journal. */
- if (journal_recover(journal))
+ if (journal_recover(journal, write))
goto recovery_error;

/* OK, we've finished with the dynamic journal bits:
* reinitialise the dynamic contents of the superblock in memory
* and reset them on disk. */
- if (journal_reset(journal))
+ if (journal_reset(journal, write))
goto recovery_error;

journal->j_flags &= ~JFS_ABORT;
@@ -1150,7 +1155,12 @@ void journal_destroy(journal_t *journal)
journal->j_tail = 0;
journal->j_tail_sequence = ++journal->j_transaction_sequence;
if (journal->j_sb_buffer) {
- journal_update_superblock(journal, 1);
+
+ /*
+ * Ugly, but what to do?
+ * We don't have a superblock to test for read-only.
+ */
+ journal_update_superblock(journal, !journal->j_bt_hash, 1);
brelse(journal->j_sb_buffer);
}

@@ -1158,6 +1168,8 @@ void journal_destroy(journal_t *journal)
iput(journal->j_inode);
if (journal->j_revoke)
journal_destroy_revoke(journal);
+ if (journal->j_bt_hash)
+ journal_destroy_translations(journal);
kfree(journal->j_wbuf);
kfree(journal);
}
@@ -1374,7 +1386,7 @@ int journal_flush(journal_t *journal)
old_tail = journal->j_tail;
journal->j_tail = 0;
spin_unlock(&journal->j_state_lock);
- journal_update_superblock(journal, 1);
+ journal_update_superblock(journal, 1, 1);
spin_lock(&journal->j_state_lock);
journal->j_tail = old_tail;

@@ -1420,8 +1432,7 @@ int journal_wipe(journal_t *journal, int write)
write ? "Clearing" : "Ignoring");

err = journal_skip_recovery(journal);
- if (write)
- journal_update_superblock(journal, 1);
+ journal_update_superblock(journal, write, 1);

no_recovery:
return err;
@@ -1489,7 +1500,7 @@ static void __journal_abort_soft (journal_t *journal, int errno)
__journal_abort_hard(journal);

if (errno)
- journal_update_superblock(journal, 1);
+ journal_update_superblock(journal, !journal->j_bt_hash, 1);
}

/**
@@ -1922,6 +1933,8 @@ static int __init journal_init_caches(void)

ret = journal_init_revoke_caches();
if (ret == 0)
+ ret = journal_init_recover_cache();
+ if (ret == 0)
ret = journal_init_journal_head_cache();
if (ret == 0)
ret = journal_init_handle_cache();
@@ -1931,6 +1944,7 @@ static int __init journal_init_caches(void)
static void journal_destroy_caches(void)
{
journal_destroy_revoke_caches();
+ journal_destroy_recover_cache();
journal_destroy_journal_head_cache();
journal_destroy_handle_cache();
}
diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
index e2ac78f..a73b062 100644
--- a/fs/jbd/recovery.c
+++ b/fs/jbd/recovery.c
@@ -17,12 +17,15 @@
#include "jfs_user.h"
#else
#include <linux/time.h>
+#include <linux/cache.h>
#include <linux/fs.h>
#include <linux/jbd.h>
#include <linux/errno.h>
#include <linux/slab.h>
#endif

+static struct kmem_cache *bt_record_cache;
+
/*
* Maintain information about the progress of the recovery job, so that
* the different passes can carry information between them.
@@ -123,6 +126,71 @@ failed:

#endif /* __KERNEL__ */

+int __init journal_init_recover_cache(void)
+{
+ bt_record_cache = kmem_cache_create(
+ "block_translation_record",
+ sizeof(struct journal_block_translation),
+ 0, 0, NULL);
+ if (bt_record_cache == NULL) {
+ printk(KERN_EMERG "JBD: failed to create recover cache\n");
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+void journal_destroy_recover_cache(void)
+{
+ if (bt_record_cache) {
+ kmem_cache_destroy(bt_record_cache);
+ bt_record_cache = NULL;
+ }
+}
+
+bool journal_translate_block(journal_t *journal, sector_t blocknr,
+ sector_t *retp)
+{
+ int hash;
+ struct journal_block_translation *jbt;
+ struct hlist_node *node;
+
+ if (!journal || !journal->j_bt_hash)
+ goto unmapped;
+
+ hash = jbd_hash(blocknr, journal->j_bt_hash_bits,
+ 1 << journal->j_bt_hash_bits);
+ hlist_for_each_entry(jbt, node, journal->j_bt_hash + hash, jbt_list) {
+ if (jbt->original == blocknr) {
+ *retp = jbt->target;
+ return true;
+ }
+ }
+
+unmapped:
+ *retp = blocknr;
+ return false;
+}
+
+void journal_destroy_translations(journal_t *journal)
+{
+ int ii;
+ struct journal_block_translation *jbt;
+ struct hlist_node *tmp;
+ struct hlist_node *node;
+
+ if (!journal->j_bt_hash)
+ return;
+
+ for (ii = 0; ii < (1 << journal->j_bt_hash_bits); ii++) {
+ hlist_for_each_entry_safe(jbt, node, tmp,
+ journal->j_bt_hash + ii, jbt_list) {
+ kmem_cache_free(bt_record_cache, jbt);
+ }
+ }
+
+ kfree(journal->j_bt_hash);
+ journal->j_bt_hash = NULL;
+}

/*
* Read a block from the journal
@@ -173,6 +241,45 @@ static int jread(struct buffer_head **bhp, journal_t *journal,
return 0;
}

+static int journal_record_bt(journal_t *journal, sector_t original,
+ sector_t target)
+{
+ int err;
+ int hash;
+ struct hlist_node *node;
+ struct journal_block_translation *jbt;
+ sector_t blocknr;
+
+ /* Store the physical block to avoid repeated lookups. */
+ err = journal_bmap(journal, target, &blocknr);
+ if (err)
+ return err;
+
+ /*
+ * If the same block has been translated multiple times,
+ * just use the last.
+ */
+ hash = jbd_hash(original, journal->j_bt_hash_bits,
+ 1 << journal->j_bt_hash_bits);
+ hlist_for_each_entry(jbt, node, journal->j_bt_hash + hash, jbt_list) {
+ if (jbt->original == original) {
+ jbt->target = blocknr;
+ return 0;
+ }
+ }
+
+ jbt = kmem_cache_alloc(bt_record_cache, GFP_NOFS);
+ if (!jbt)
+ return -ENOMEM;
+
+ INIT_HLIST_NODE(&jbt->jbt_list);
+ jbt->original = original;
+ jbt->target = blocknr;
+
+ hlist_add_head(&jbt->jbt_list, journal->j_bt_hash + hash);
+
+ return 0;
+}

/*
* Count the number of in-use tags in a journal descriptor block.
@@ -201,6 +308,24 @@ static int count_tags(struct buffer_head *bh, int size)
return nr;
}

+#define JOURNAL_DEFAULT_BT_HASHBITS 8
+static int bt_hash_init(journal_t *journal)
+{
+ int ii;
+ BUG_ON(journal->j_bt_hash);
+
+ /* Vary size depending on the number of mappings/device size? */
+ journal->j_bt_hash_bits = JOURNAL_DEFAULT_BT_HASHBITS;
+ journal->j_bt_hash = kmalloc(sizeof(struct hlist_head) *
+ 1 << journal->j_bt_hash_bits, GFP_NOFS);
+ if (!journal->j_bt_hash)
+ return -ENOMEM;
+
+ for (ii = 0; ii < (1 << journal->j_bt_hash_bits); ii++)
+ INIT_HLIST_HEAD(journal->j_bt_hash + ii);
+
+ return 0;
+}

/* Make sure we wrap around the log correctly! */
#define wrap(journal, var) \
@@ -212,6 +337,7 @@ do { \
/**
* journal_recover - recovers a on-disk journal
* @journal: the journal to recover
+ * @write: whether to write to disk
*
* The primary function for recovering the log contents when mounting a
* journaled device.
@@ -220,8 +346,11 @@ do { \
* end of the log. In the second, we assemble the list of revoke
* blocks. In the third and final pass, we replay any un-revoked blocks
* in the log.
+ *
+ * If the journal is read-only a translation table will be built up instead
+ * of the journal being replayed on disk.
*/
-int journal_recover(journal_t *journal)
+int journal_recover(journal_t *journal, int write)
{
int err;
journal_superblock_t * sb;
@@ -244,6 +373,16 @@ int journal_recover(journal_t *journal)
return 0;
}

+ if (!write) {
+ err = bt_hash_init(journal);
+ if (err) {
+ printk(KERN_ERR
+ "JBD: out of memory allocating"
+ " journal block translation hash\n");
+ return err;
+ }
+ }
+
err = do_one_pass(journal, &info, PASS_SCAN);
if (!err)
err = do_one_pass(journal, &info, PASS_REVOKE);
@@ -256,12 +395,15 @@ int journal_recover(journal_t *journal)
jbd_debug(1, "JBD: Replayed %d and revoked %d/%d blocks\n",
info.nr_replays, info.nr_revoke_hits, info.nr_revokes);

- /* Restart the log at the next transaction ID, thus invalidating
- * any existing commit records in the log. */
- journal->j_transaction_sequence = ++info.end_transaction;
+ if (write) {
+
+ /* Restart the log at the next transaction ID, thus invalidating
+ * any existing commit records in the log. */
+ journal->j_transaction_sequence = ++info.end_transaction;
+ journal_clear_revoke(journal);
+ sync_blockdev(journal->j_fs_dev);
+ }

- journal_clear_revoke(journal);
- sync_blockdev(journal->j_fs_dev);
return err;
}

@@ -345,7 +487,6 @@ static int replay_descriptor_block(
int success = 0;
char *tagp;
unsigned long io_block;
- struct buffer_head *obh;
unsigned long next = *next_log_block;

tagp = &data[sizeof(journal_header_t)];
@@ -365,28 +506,34 @@ static int replay_descriptor_block(
goto skip_write;
}

- err = jread(&obh, journal, io_block);
- if (err) {
+ if (journal->j_bt_hash) {
+ err = journal_record_bt(journal, blocknr, io_block);
+ } else {
+ struct buffer_head *obh;

- /* Recover what we can, but
- * report failure at the end. */
- success = err;
- printk(KERN_ERR "JBD: IO error %d recovering "
- "block %ld in log\n", err, io_block);
- continue;
- }
- J_ASSERT(obh != NULL);
+ err = jread(&obh, journal, io_block);
+ if (err) {

- err = replay_data_block(journal, obh, data, flags, blocknr);
- if (err) {
+ /* Recover what we can, but
+ * report failure at the end. */
+ success = err;
+ printk(KERN_ERR "JBD: IO error %d recovering "
+ "block %ld in log\n", err, io_block);
+ continue;
+ }
+ J_ASSERT(obh != NULL);
+
+ err = replay_data_block(
+ journal, obh, data, flags, blocknr);
brelse(obh);
+ }
+ if (err) {
printk(KERN_ERR "JBD: Out of memory during recovery\n");
success = err;
goto failed;
}

++info->nr_replays;
- brelse(obh);

skip_write:
tagp += sizeof(journal_block_tag_t);
diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
index 5f64df4..7f57d03 100644
--- a/fs/jbd/revoke.c
+++ b/fs/jbd/revoke.c
@@ -105,15 +105,10 @@ static void flush_descriptor(journal_t *, struct journal_head *, int);

/* Utility functions to maintain the revoke table */

-/* Borrowed from buffer.c: this is a tried and tested block hash function */
static inline int hash(journal_t *journal, unsigned long block)
{
struct jbd_revoke_table_s *table = journal->j_revoke;
- int hash_shift = table->hash_shift;
-
- return ((block << (hash_shift - 6)) ^
- (block >> 13) ^
- (block << (hash_shift - 12))) & (table->hash_size - 1);
+ return jbd_hash(block, table->hash_shift, table->hash_size);
}

static int insert_revoke_hash(journal_t *journal, unsigned long blocknr,
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index f31c7e8..cbd53aa 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -591,7 +591,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local)

osb = journal->j_osb;

- status = journal_load(journal->j_journal);
+ status = journal_load(journal->j_journal, true);
if (status < 0) {
mlog(ML_ERROR, "Failed to load journal!\n");
goto done;
@@ -1014,7 +1014,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
goto done;
}

- status = journal_load(journal);
+ status = journal_load(journal, 1);
if (status < 0) {
mlog_errno(status);
if (!igrab(inode))
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index b18fd3b..6154082 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -558,6 +558,18 @@ struct transaction_s
};

/**
+ * struct journal_block_translation - hash bucket recording block translations.
+ * @jbt_list: list of translated blocks in this bucket.
+ * @original: the (logical?) filesystem block being translation.
+ * @target: the physical journal block holding the original's data.
+ */
+struct journal_block_translation {
+ struct hlist_node jbt_list;
+ sector_t original;
+ sector_t target;
+};
+
+/**
* struct journal_s - The journal_s type is the concrete type associated with
* journal_t.
* @j_flags: General journaling state flags
@@ -618,6 +630,8 @@ struct transaction_s
* @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the
* number that will fit in j_blocksize
* @j_last_sync_writer: most recent pid which did a synchronous write
+ * @j_bt_hash: hash table mapping uncommitted blocks to their targets
+ * @j_bt_hash_bits: number of bits in block translation hash table
* @j_private: An opaque pointer to fs-private information.
*/

@@ -810,6 +824,12 @@ struct journal_s
pid_t j_last_sync_writer;

/*
+ * Remap uncommitted transactions in a read-only filesystem.
+ */
+ struct hlist_head *j_bt_hash;
+ int j_bt_hash_bits;
+
+ /*
* An opaque pointer to fs-private information. ext3 puts its
* superblock pointer here
*/
@@ -916,12 +936,12 @@ extern int journal_check_available_features
extern int journal_set_features
(journal_t *, unsigned long, unsigned long, unsigned long);
extern int journal_create (journal_t *);
-extern int journal_load (journal_t *journal);
+extern int journal_load (journal_t *journal, int);
extern void journal_destroy (journal_t *);
-extern int journal_recover (journal_t *journal);
+extern int journal_recover (journal_t *journal, int);
extern int journal_wipe (journal_t *, int);
extern int journal_skip_recovery (journal_t *);
-extern void journal_update_superblock (journal_t *, int);
+extern void journal_update_superblock (journal_t *, int, int);
extern void journal_abort (journal_t *, int);
extern int journal_errno (journal_t *);
extern void journal_ack_err (journal_t *);
@@ -970,6 +990,12 @@ extern int journal_test_revoke(journal_t *, unsigned long, tid_t);
extern void journal_clear_revoke(journal_t *);
extern void journal_switch_revoke_table(journal_t *journal);

+extern bool journal_translate_block(journal_t *journal, sector_t blocknr,
+ sector_t *retp);
+extern void journal_destroy_translations(journal_t *journal);
+extern int journal_init_recover_cache(void);
+extern void journal_destroy_recover_cache(void);
+
/*
* The log thread user interface:
*
@@ -989,6 +1015,15 @@ void __log_wait_for_space(journal_t *journal);
extern void __journal_drop_transaction(journal_t *, transaction_t *);
extern int cleanup_journal_tail(journal_t *);

+/* Borrowed from buffer.c: this is a tried and tested block hash function */
+static inline int jbd_hash(unsigned long block, int shift, int size)
+{
+ return ((block << (shift - 6)) ^
+ (block >> 13) ^
+ (block << (shift - 12))) &
+ (size - 1);
+}
+
/* Debugging code only: */

#define jbd_ENOSYS() \
--
1.5.3.7

2008-03-06 01:59:14

by Duane Griffin

[permalink] [raw]
Subject: [RFC, PATCH 6/6] ext3: do not write to the disk when mounting a dirty read-only filesystem

If a filesystem is being mounted read-only then load the journal read-only
(i.e. do not replay the log) and do not process the orphan list.

At present, when a dirty ext3 filesystem is mounted read-only, it writes to the
disk while replaying the journal log and cleaning up the orphan list. This
behaviour may surprise users and can potentially cause data corruption/loss
(e.g. if a system is suspended, booted into a different OS, then resumed).

Introduce block accessor wrapper functions that check for blocks requiring
translation and access them from the journal as needed. Convert the ext3 code
to use the wrappers anywhere that may be called on a read-only filesystem. Code
that is only called when writing to the filesystem is not converted.

For a discussion of the need for this feature, see:
http://marc.info/?l=linux-kernel&m=117607695406580

Tested by creating dirty filesystems, mounting a copy read-write, taking the
md5sum of all its files, mounting a different copy read-only, confirming the
md5sums match, unmounting it, then confirming the block device mounted
read-only has not been modified. Testing has been done with both internal and
external journals.

NOTE: For now I'm simply preventing filesystems requiring recovery from being
remounted read-write. This breaks booting with an uncleanly mounted root
filesystem!

Signed-off-by: Duane Griffin <[email protected]>
---

I went back and forth on converting all code to use the accessors. In the end
it felt silly to shunt write code through a wrapper that was only useful in a
read-only filesystem. On the other hand, the inconsistency bothers me and
presumably increases the risks of future changes directly operating on
filesysem blocks when they need to go through the wrappers. I'd appreciate
second opinions on this point, in particular.

Also, would it perhaps be better to split this into separate patches to
introduce the accessors and add the no-replay code?

fs/ext3/balloc.c | 2 +-
fs/ext3/ialloc.c | 2 +-
fs/ext3/inode.c | 8 ++--
fs/ext3/resize.c | 2 +-
fs/ext3/super.c | 123 +++++++++++++++++++++++++++++++----------------
fs/ext3/xattr.c | 8 ++--
include/linux/ext3_fs.h | 7 +++
7 files changed, 99 insertions(+), 53 deletions(-)

diff --git a/fs/ext3/balloc.c b/fs/ext3/balloc.c
index da0cb2c..1b42154 100644
--- a/fs/ext3/balloc.c
+++ b/fs/ext3/balloc.c
@@ -145,7 +145,7 @@ read_block_bitmap(struct super_block *sb, unsigned int block_group)
if (!desc)
return NULL;
bitmap_blk = le32_to_cpu(desc->bg_block_bitmap);
- bh = sb_getblk(sb, bitmap_blk);
+ bh = ext3_sb_getblk(sb, bitmap_blk);
if (unlikely(!bh)) {
ext3_error(sb, __FUNCTION__,
"Cannot read block bitmap - "
diff --git a/fs/ext3/ialloc.c b/fs/ext3/ialloc.c
index 4f4020c..7de791c 100644
--- a/fs/ext3/ialloc.c
+++ b/fs/ext3/ialloc.c
@@ -60,7 +60,7 @@ read_inode_bitmap(struct super_block * sb, unsigned long block_group)
if (!desc)
goto error_out;

- bh = sb_bread(sb, le32_to_cpu(desc->bg_inode_bitmap));
+ bh = ext3_sb_bread(sb, le32_to_cpu(desc->bg_inode_bitmap));
if (!bh)
ext3_error(sb, "read_inode_bitmap",
"Cannot read inode bitmap - "
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index eb95670..75fcf9e 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -364,7 +364,7 @@ static Indirect *ext3_get_branch(struct inode *inode, int depth, int *offsets,
if (!p->key)
goto no_block;
while (--depth) {
- bh = sb_bread(sb, le32_to_cpu(p->key));
+ bh = ext3_sb_bread(sb, le32_to_cpu(p->key));
if (!bh)
goto failure;
/* Reader: pointers */
@@ -1009,7 +1009,7 @@ struct buffer_head *ext3_getblk(handle_t *handle, struct inode *inode,
*errp = err;
if (!err && buffer_mapped(&dummy)) {
struct buffer_head *bh;
- bh = sb_getblk(inode->i_sb, dummy.b_blocknr);
+ bh = ext3_sb_getblk(inode->i_sb, dummy.b_blocknr);
if (!bh) {
*errp = -EIO;
goto err;
@@ -2514,7 +2514,7 @@ static int __ext3_get_inode_loc(struct inode *inode,
if (!block)
return -EIO;

- bh = sb_getblk(inode->i_sb, block);
+ bh = ext3_sb_getblk(inode->i_sb, block);
if (!bh) {
ext3_error (inode->i_sb, "ext3_get_inode_loc",
"unable to read inode block - "
@@ -2557,7 +2557,7 @@ static int __ext3_get_inode_loc(struct inode *inode,
if (!desc)
goto make_io;

- bitmap_bh = sb_getblk(inode->i_sb,
+ bitmap_bh = ext3_sb_getblk(inode->i_sb,
le32_to_cpu(desc->bg_inode_bitmap));
if (!bitmap_bh)
goto make_io;
diff --git a/fs/ext3/resize.c b/fs/ext3/resize.c
index 9397d77..2f9799b 100644
--- a/fs/ext3/resize.c
+++ b/fs/ext3/resize.c
@@ -236,7 +236,7 @@ static int setup_new_group_blocks(struct super_block *sb,
if (err)
goto exit_bh;

- gdb = sb_getblk(sb, block);
+ gdb = ext3_sb_getblk(sb, block);
if (!gdb) {
err = -EIO;
goto exit_bh;
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 4b9ff65..c333dac 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -64,6 +64,40 @@ static void ext3_unlockfs(struct super_block *sb);
static void ext3_write_super (struct super_block * sb);
static void ext3_write_super_lockfs(struct super_block *sb);

+struct buffer_head *ext3_sb_bread(struct super_block *sb, sector_t block)
+{
+ sector_t blocknr;
+ journal_t *journal = EXT3_SB(sb)->s_journal;
+ if (journal_translate_block(journal, block, &blocknr))
+ return __bread(journal->j_dev, blocknr, journal->j_blocksize);
+ else
+ return sb_bread(sb, block);
+}
+
+struct buffer_head *ext3_sb_getblk(struct super_block *sb, sector_t block)
+{
+ sector_t blocknr;
+ journal_t *journal = EXT3_SB(sb)->s_journal;
+ if (journal_translate_block(journal, block, &blocknr))
+ return __getblk(journal->j_dev, blocknr, journal->j_blocksize);
+ else
+ return sb_getblk(sb, block);
+}
+
+void ext3_map_bh(struct buffer_head *bh, struct super_block *sb, sector_t block)
+{
+ sector_t blocknr;
+ journal_t *journal = EXT3_SB(sb)->s_journal;
+ if (journal_translate_block(journal, block, &blocknr)) {
+ set_buffer_mapped(bh);
+ bh->b_bdev = journal->j_dev;
+ bh->b_blocknr = blocknr;
+ bh->b_size = journal->j_blocksize;
+ } else {
+ map_bh(bh, sb, block);
+ }
+}
+
/*
* Wrappers for journal_start/end.
*
@@ -1328,7 +1362,6 @@ static int ext3_check_descriptors(struct super_block *sb)
static void ext3_orphan_cleanup (struct super_block * sb,
struct ext3_super_block * es)
{
- unsigned int s_flags = sb->s_flags;
int nr_orphans = 0, nr_truncates = 0;
#ifdef CONFIG_QUOTA
int i;
@@ -1338,9 +1371,9 @@ static void ext3_orphan_cleanup (struct super_block * sb,
return;
}

- if (bdev_read_only(sb->s_bdev)) {
- printk(KERN_ERR "EXT3-fs: write access "
- "unavailable, skipping orphan cleanup.\n");
+ if (sb->s_flags & MS_RDONLY) {
+ printk(KERN_INFO "EXT3-fs: %s: skipping orphan cleanup on "
+ "readonly fs\n", sb->s_id);
return;
}

@@ -1353,11 +1386,6 @@ static void ext3_orphan_cleanup (struct super_block * sb,
return;
}

- if (s_flags & MS_RDONLY) {
- printk(KERN_INFO "EXT3-fs: %s: orphan cleanup on readonly fs\n",
- sb->s_id);
- sb->s_flags &= ~MS_RDONLY;
- }
#ifdef CONFIG_QUOTA
/* Needed for iput() to work correctly and not trash data */
sb->s_flags |= MS_ACTIVE;
@@ -1418,7 +1446,6 @@ static void ext3_orphan_cleanup (struct super_block * sb,
vfs_quota_off(sb, i);
}
#endif
- sb->s_flags = s_flags; /* Restore MS_RDONLY status */
}

/*
@@ -1838,8 +1865,8 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
goto failed_mount3;
}

- /* We have now updated the journal if required, so we can
- * validate the data journaling mode. */
+ /* We have now updated the journal or mapped unrecovered blocks as
+ * required, so we can validate the data journaling mode. */
switch (test_opt(sb, DATA_FLAGS)) {
case 0:
/* No mode set, assume a default based on the journal
@@ -1872,8 +1899,8 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
}
}
/*
- * The journal_load will have done any necessary log recovery,
- * so we can safely mount the rest of the filesystem now.
+ * The journal_load will have done any necessary log recovery or block
+ * remapping, so we can safely mount the rest of the filesystem now.
*/

root = ext3_iget(sb, EXT3_ROOT_INO);
@@ -1907,9 +1934,16 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
EXT3_SB(sb)->s_mount_state |= EXT3_ORPHAN_FS;
ext3_orphan_cleanup(sb, es);
EXT3_SB(sb)->s_mount_state &= ~EXT3_ORPHAN_FS;
- if (needs_recovery)
- printk (KERN_INFO "EXT3-fs: recovery complete.\n");
- ext3_mark_recovery_complete(sb, es);
+
+ if (needs_recovery) {
+ if (sb->s_flags & MS_RDONLY) {
+ printk(KERN_INFO "EXT3-fs: recovery skipped on "
+ "read-only filesystem.\n");
+ } else {
+ ext3_mark_recovery_complete(sb, es);
+ printk(KERN_INFO "EXT3-fs: recovery complete.\n");
+ }
+ }
printk (KERN_INFO "EXT3-fs: mounted filesystem with %s data mode.\n",
test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA ? "journal":
test_opt(sb,DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA ? "ordered":
@@ -2110,7 +2144,6 @@ static int ext3_load_journal(struct super_block *sb,
unsigned int journal_inum = le32_to_cpu(es->s_journal_inum);
dev_t journal_dev;
int err = 0;
- int really_read_only;

if (journal_devnum &&
journal_devnum != le32_to_cpu(es->s_journal_dev)) {
@@ -2120,26 +2153,16 @@ static int ext3_load_journal(struct super_block *sb,
} else
journal_dev = new_decode_dev(le32_to_cpu(es->s_journal_dev));

- really_read_only = bdev_read_only(sb->s_bdev);
-
/*
* Are we loading a blank journal or performing recovery after a
* crash? For recovery, we need to check in advance whether we
* can get read-write access to the device.
*/

- if (EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER)) {
- if (sb->s_flags & MS_RDONLY) {
- printk(KERN_INFO "EXT3-fs: INFO: recovery "
- "required on readonly filesystem.\n");
- if (really_read_only) {
- printk(KERN_ERR "EXT3-fs: write access "
- "unavailable, cannot proceed.\n");
- return -EROFS;
- }
- printk (KERN_INFO "EXT3-fs: write access will "
- "be enabled during recovery.\n");
- }
+ if (EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER) &&
+ sb->s_flags & MS_RDONLY) {
+ printk(KERN_INFO "EXT3-fs: INFO: skipping recovery "
+ "on readonly filesystem.\n");
}

if (journal_inum && journal_dev) {
@@ -2156,7 +2179,7 @@ static int ext3_load_journal(struct super_block *sb,
return -EINVAL;
}

- if (!really_read_only && test_opt(sb, UPDATE_JOURNAL)) {
+ if (!(sb->s_flags & MS_RDONLY) && test_opt(sb, UPDATE_JOURNAL)) {
err = journal_update_format(journal);
if (err) {
printk(KERN_ERR "EXT3-fs: error updating journal.\n");
@@ -2166,9 +2189,9 @@ static int ext3_load_journal(struct super_block *sb,
}

if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER))
- err = journal_wipe(journal, !really_read_only);
+ err = journal_wipe(journal, !(sb->s_flags & MS_RDONLY));
if (!err)
- err = journal_load(journal, !really_read_only);
+ err = journal_load(journal, !(sb->s_flags & MS_RDONLY));

if (err) {
printk(KERN_ERR "EXT3-fs: error loading journal.\n");
@@ -2177,15 +2200,17 @@ static int ext3_load_journal(struct super_block *sb,
}

EXT3_SB(sb)->s_journal = journal;
- ext3_clear_journal_err(sb, es);
+ if (!(sb->s_flags & MS_RDONLY)) {
+ ext3_clear_journal_err(sb, es);

- if (journal_devnum &&
- journal_devnum != le32_to_cpu(es->s_journal_dev)) {
- es->s_journal_dev = cpu_to_le32(journal_devnum);
- sb->s_dirt = 1;
+ if (journal_devnum &&
+ journal_devnum != le32_to_cpu(es->s_journal_dev)) {
+ es->s_journal_dev = cpu_to_le32(journal_devnum);
+ sb->s_dirt = 1;

- /* Make sure we flush the recovery flag to disk. */
- ext3_commit_super(sb, es, 1);
+ /* Make sure we flush the recovery flag to disk. */
+ ext3_commit_super(sb, es, 1);
+ }
}

return 0;
@@ -2494,6 +2519,20 @@ static int ext3_remount (struct super_block * sb, int * flags, char * data)
}

/*
+ * If we have unrecovered journal transactions hanging
+ * around require a full umount/remount for now.
+ */
+ if (sbi->s_journal->j_bt_hash) {
+ printk(KERN_WARNING "EXT3-fs: %s: couldn't "
+ "remount RDWR because of unrecovered "
+ "journal transactions. Please "
+ "umount/remount instead.\n",
+ sb->s_id);
+ err = -EINVAL;
+ goto restore_opts;
+ }
+
+ /*
* Mounting a RDONLY partition read-write, so reread
* and store the current valid flag. (It may have
* been changed by e2fsck since we originally mounted
diff --git a/fs/ext3/xattr.c b/fs/ext3/xattr.c
index fb89c29..6d0ef6f 100644
--- a/fs/ext3/xattr.c
+++ b/fs/ext3/xattr.c
@@ -226,7 +226,7 @@ ext3_xattr_block_get(struct inode *inode, int name_index, const char *name,
if (!EXT3_I(inode)->i_file_acl)
goto cleanup;
ea_idebug(inode, "reading block %u", EXT3_I(inode)->i_file_acl);
- bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
+ bh = ext3_sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
if (!bh)
goto cleanup;
ea_bdebug(bh, "b_count=%d, refcount=%d",
@@ -367,7 +367,7 @@ ext3_xattr_block_list(struct inode *inode, char *buffer, size_t buffer_size)
if (!EXT3_I(inode)->i_file_acl)
goto cleanup;
ea_idebug(inode, "reading block %u", EXT3_I(inode)->i_file_acl);
- bh = sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
+ bh = ext3_sb_bread(inode->i_sb, EXT3_I(inode)->i_file_acl);
error = -EIO;
if (!bh)
goto cleanup;
@@ -641,7 +641,7 @@ ext3_xattr_block_find(struct inode *inode, struct ext3_xattr_info *i,

if (EXT3_I(inode)->i_file_acl) {
/* The inode already has an extended attribute block. */
- bs->bh = sb_bread(sb, EXT3_I(inode)->i_file_acl);
+ bs->bh = ext3_sb_bread(sb, EXT3_I(inode)->i_file_acl);
error = -EIO;
if (!bs->bh)
goto cleanup;
@@ -1213,7 +1213,7 @@ again:
goto again;
break;
}
- bh = sb_bread(inode->i_sb, ce->e_block);
+ bh = ext3_sb_bread(inode->i_sb, ce->e_block);
if (!bh) {
ext3_error(inode->i_sb, __FUNCTION__,
"inode %lu: block %lu read error",
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 36c5403..c5732e1 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -864,6 +864,13 @@ extern void ext3_abort (struct super_block *, const char *, const char *, ...)
extern void ext3_warning (struct super_block *, const char *, const char *, ...)
__attribute__ ((format (printf, 3, 4)));
extern void ext3_update_dynamic_rev (struct super_block *sb);
+extern struct buffer_head *ext3_sb_bread(struct super_block *sb,
+ sector_t block);
+extern struct buffer_head *ext3_sb_getblk(struct super_block *sb,
+ sector_t block);
+extern void ext3_map_bh(struct buffer_head *bh,
+ struct super_block *sb, sector_t block);
+

#define ext3_std_error(sb, errno) \
do { \
--
1.5.3.7

2008-03-06 03:43:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

On Thu, 6 Mar 2008 01:59:08 +0000 "Duane Griffin" <[email protected]> wrote:

> At present, as discussed in this LKML thread,
> http://marc.info/?l=linux-kernel&m=117607695406580, when a dirty ext3
> filesystem is mounted read-only it writes to the disk while replaying the
> journal log and cleaning up the orphan list. This behaviour may surprise users
> and can potentially cause data corruption/loss (e.g. if a system is suspended,
> booted into a different OS, then resumed).
>
> This patch series attempts to address this by using a block translation table
> instead of replaying the journal on a read-only filesystem.
>
> Patches 1-3 are independent cleanups/bug-fixes for things I came across while
> working on this. They could be submitted separately and are not required for
> following patches.
>
> Patch 4 is a refactoring change that simplifies the code prior to later
> substantive changes.
>
> Patch 5 introduces the translation table and support for a truly read-only
> journal into jbd.
>
> Patch 6 uses the facility introduced in patch 5 to add support for true
> read-only ext3.

I'll grab the first three for now, thanks.

Someone(tm) should do the jbd2 versions..

2008-03-06 11:20:41

by Duane Griffin

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

On 06/03/2008, Andrew Morton <[email protected]> wrote:
> I'll grab the first three for now, thanks.

Great, thanks!

> Someone(tm) should do the jbd2 versions..

I'll do them tonight, unless someone beats me to it.

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan

2008-03-08 14:52:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC, PATCH 2/6] jbd: replace potentially false assertion with if block

On Thu, Mar 06, 2008 at 01:59:10AM +0000, Duane Griffin wrote:
> If an error occurs during jbd cache initialisation it is possible for the
> journal_head_cache to be NULL when journal_destroy_journal_head_cache is
> called. Replace the J_ASSERT with an if block to handle the situation
> correctly.
>
> Note that even with this fix things will break badly if ext3 is statically
> compiled in and cache initialisation fails.

Actually the whole code surrounding this is far too confusing. The
patch below converts it much simpler code with proper goto unwinding
for initialization failures:

Index: linux-2.6/fs/jbd/journal.c
===================================================================
--- linux-2.6.orig/fs/jbd/journal.c 2008-03-08 15:43:41.000000000 +0100
+++ linux-2.6/fs/jbd/journal.c 2008-03-08 15:50:44.000000000 +0100
@@ -1615,31 +1615,6 @@ static struct kmem_cache *journal_head_c
static atomic_t nr_journal_heads = ATOMIC_INIT(0);
#endif

-static int journal_init_journal_head_cache(void)
-{
- int retval;
-
- J_ASSERT(journal_head_cache == 0);
- journal_head_cache = kmem_cache_create("journal_head",
- sizeof(struct journal_head),
- 0, /* offset */
- SLAB_TEMPORARY, /* flags */
- NULL); /* ctor */
- retval = 0;
- if (journal_head_cache == 0) {
- retval = -ENOMEM;
- printk(KERN_EMERG "JBD: no memory for journal_head cache\n");
- }
- return retval;
-}
-
-static void journal_destroy_journal_head_cache(void)
-{
- J_ASSERT(journal_head_cache != NULL);
- kmem_cache_destroy(journal_head_cache);
- journal_head_cache = NULL;
-}
-
/*
* journal_head splicing and dicing
*/
@@ -1891,59 +1866,45 @@ static inline void jbd_remove_debugfs_en

struct kmem_cache *jbd_handle_cache;

-static int __init journal_init_handle_cache(void)
-{
- jbd_handle_cache = kmem_cache_create("journal_handle",
- sizeof(handle_t),
- 0, /* offset */
- SLAB_TEMPORARY, /* flags */
- NULL); /* ctor */
- if (jbd_handle_cache == NULL) {
- printk(KERN_EMERG "JBD: failed to create handle cache\n");
- return -ENOMEM;
- }
- return 0;
-}
-
-static void journal_destroy_handle_cache(void)
-{
- if (jbd_handle_cache)
- kmem_cache_destroy(jbd_handle_cache);
-}
-
/*
* Module startup and shutdown
*/

-static int __init journal_init_caches(void)
+static int __init journal_init(void)
{
int ret;

- ret = journal_init_revoke_caches();
- if (ret == 0)
- ret = journal_init_journal_head_cache();
- if (ret == 0)
- ret = journal_init_handle_cache();
- return ret;
-}
+ BUILD_BUG_ON(sizeof(struct journal_superblock_s) != 1024);

-static void journal_destroy_caches(void)
-{
- journal_destroy_revoke_caches();
- journal_destroy_journal_head_cache();
- journal_destroy_handle_cache();
-}
+ ret = journal_init_revoke_caches();
+ if (ret)
+ goto out;

-static int __init journal_init(void)
-{
- int ret;
+ ret = -ENOMEM;
+ journal_head_cache = kmem_cache_create("journal_head",
+ sizeof(struct journal_head),
+ 0, /* offset */
+ SLAB_TEMPORARY, /* flags */
+ NULL); /* ctor */
+ if (!journal_head_cache)
+ goto out_destroy_revoke_caches;

- BUILD_BUG_ON(sizeof(struct journal_superblock_s) != 1024);
+ jbd_handle_cache = kmem_cache_create("journal_handle",
+ sizeof(handle_t),
+ 0, /* offset */
+ SLAB_TEMPORARY, /* flags */
+ NULL); /* ctor */
+ if (!jbd_handle_cache)
+ goto out_destroy_head_cache;

- ret = journal_init_caches();
- if (ret != 0)
- journal_destroy_caches();
jbd_create_debugfs_entry();
+ return 0;
+
+ out_destroy_head_cache:
+ kmem_cache_destroy(journal_head_cache);
+ out_destroy_revoke_caches:
+ journal_destroy_revoke_caches();
+ out:
return ret;
}

@@ -1955,7 +1916,10 @@ static void __exit journal_exit(void)
printk(KERN_EMERG "JBD: leaked %d journal_heads!\n", n);
#endif
jbd_remove_debugfs_entry();
- journal_destroy_caches();
+
+ kmem_cache_destroy(jbd_handle_cache);
+ kmem_cache_destroy(journal_head_cache);
+ journal_destroy_revoke_caches();
}

MODULE_LICENSE("GPL");

2008-03-08 14:53:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC, PATCH 4/6] jbd: refactor nested journal log recovery loop into separate functions

On Thu, Mar 06, 2008 at 01:59:12AM +0000, Duane Griffin wrote:
> +static int replay_data_block(
> + journal_t *journal, struct buffer_head *obh, char *data,
> + int flags, unsigned long blocknr)

quite odd formatting, this should be more like:

static int replay_data_block(journal_t *journal, struct buffer_head *obh,
char *data, int flags, unsigned long blocknr)

(same for the other new helpers)


2008-03-08 18:40:31

by Duane Griffin

[permalink] [raw]
Subject: Re: [RFC, PATCH 4/6] jbd: refactor nested journal log recovery loop into separate functions

On 08/03/2008, Christoph Hellwig <[email protected]> wrote:
> On Thu, Mar 06, 2008 at 01:59:12AM +0000, Duane Griffin wrote:
> > +static int replay_data_block(
> > + journal_t *journal, struct buffer_head *obh, char *data,
> > + int flags, unsigned long blocknr)
>
> quite odd formatting, this should be more like:
>
> static int replay_data_block(journal_t *journal, struct buffer_head *obh,
> char *data, int flags, unsigned long blocknr)
>
> (same for the other new helpers)

OK, will change.

Thanks,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan

2008-03-11 14:35:51

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC, PATCH 4/6] jbd: refactor nested journal log recovery loop into separate functions

> The do_one_pass function iterates through the jbd log operating on blocks
> depending on the pass. During the replay pass descriptor blocks are processed
> by an inner loop which replays them. The nesting makes the code difficult to
> follow or to modify. Splitting the inner loop and replay code into separate
> functions simplifies matters.
>
> The refactored code no longer unnecessarily reads revoked data blocks, so
> potential read errors from such blocks will cause differing behaviour. Aside
> from that there should be no functional effect.
Hmm, if I read your patch correctly, previously we aborted journal
replay when we found a block with unknown block magic but now we
continue replaying. Why have you done such change? And similarly when
some error happened when parsing revoke records block...

Honza

>
> Signed-off-by: Duane Griffin <[email protected]>
> ---
>
> Note the TODO before the block read in the outer loop below. We could possibly
> attempt to continue after a failed read as we do when replaying descriptor
> blocks. However if it was a descriptor block we'd likely bomb out on the next
> iteration anyway, so I'm not sure it would be worth it. Thoughts?
>
> fs/jbd/recovery.c | 243 ++++++++++++++++++++++++++++-------------------------
> 1 files changed, 127 insertions(+), 116 deletions(-)
>
> diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
> index 2b8edf4..e2ac78f 100644
> --- a/fs/jbd/recovery.c
> +++ b/fs/jbd/recovery.c
> @@ -307,6 +307,101 @@ int journal_skip_recovery(journal_t *journal)
> return err;
> }
>
> +static int replay_data_block(
> + journal_t *journal, struct buffer_head *obh, char *data,
> + int flags, unsigned long blocknr)
> +{
> + struct buffer_head *nbh;
> +
> + /* Find a buffer for the new data being restored */
> + nbh = __getblk(journal->j_fs_dev, blocknr, journal->j_blocksize);
> + if (nbh == NULL)
> + return -ENOMEM;
> +
> + lock_buffer(nbh);
> + memcpy(nbh->b_data, obh->b_data, journal->j_blocksize);
> + if (flags & JFS_FLAG_ESCAPE)
> + *((__be32 *)data) = cpu_to_be32(JFS_MAGIC_NUMBER);
> +
> + BUFFER_TRACE(nbh, "marking dirty");
> + set_buffer_uptodate(nbh);
> + mark_buffer_dirty(nbh);
> + BUFFER_TRACE(nbh, "marking uptodate");
> + /* ll_rw_block(WRITE, 1, &nbh); */
> + unlock_buffer(nbh);
> + brelse(nbh);
> +
> + return 0;
> +}
> +
> +/* Replay a descriptor block: write all of the data blocks. */
> +static int replay_descriptor_block(
> + journal_t *journal, char *data,
> + unsigned int next_commit_ID,
> + unsigned long *next_log_block,
> + struct recovery_info *info)
> +{
> + int err;
> + int success = 0;
> + char *tagp;
> + unsigned long io_block;
> + struct buffer_head *obh;
> + unsigned long next = *next_log_block;
> +
> + tagp = &data[sizeof(journal_header_t)];
> + while ((tagp - data + sizeof(journal_block_tag_t)) <=
> + journal->j_blocksize) {
> + unsigned long blocknr;
> + journal_block_tag_t *tag = (journal_block_tag_t *) tagp;
> + int flags = be32_to_cpu(tag->t_flags);
> +
> + io_block = next++;
> + wrap(journal, next);
> + blocknr = be32_to_cpu(tag->t_blocknr);
> +
> + /* If the block has been revoked, then we're all done here. */
> + if (journal_test_revoke(journal, blocknr, next_commit_ID)) {
> + ++info->nr_revoke_hits;
> + goto skip_write;
> + }
> +
> + err = jread(&obh, journal, io_block);
> + if (err) {
> +
> + /* Recover what we can, but
> + * report failure at the end. */
> + success = err;
> + printk(KERN_ERR "JBD: IO error %d recovering "
> + "block %ld in log\n", err, io_block);
> + continue;
> + }
> + J_ASSERT(obh != NULL);
> +
> + err = replay_data_block(journal, obh, data, flags, blocknr);
> + if (err) {
> + brelse(obh);
> + printk(KERN_ERR "JBD: Out of memory during recovery\n");
> + success = err;
> + goto failed;
> + }
> +
> + ++info->nr_replays;
> + brelse(obh);
> +
> +skip_write:
> + tagp += sizeof(journal_block_tag_t);
> + if (!(flags & JFS_FLAG_SAME_UUID))
> + tagp += 16;
> +
> + if (flags & JFS_FLAG_LAST_TAG)
> + break;
> + }
> +
> + *next_log_block = next;
> +failed:
> + return success;
> +}
> +
> static int do_one_pass(journal_t *journal,
> struct recovery_info *info, enum passtype pass)
> {
> @@ -316,8 +411,6 @@ static int do_one_pass(journal_t *journal,
> journal_superblock_t * sb;
> journal_header_t * tmp;
> struct buffer_head * bh;
> - unsigned int sequence;
> - int blocktype;
>
> /* Precompute the maximum metadata descriptors in a descriptor block */
> int MAX_BLOCKS_PER_DESC;
> @@ -348,11 +441,8 @@ static int do_one_pass(journal_t *journal,
> */
>
> while (1) {
> - int flags;
> - char * tagp;
> - journal_block_tag_t * tag;
> - struct buffer_head * obh;
> - struct buffer_head * nbh;
> + int blocktype;
> + unsigned int sequence;
>
> cond_resched();
>
> @@ -371,10 +461,13 @@ static int do_one_pass(journal_t *journal,
> * either the next descriptor block or the final commit
> * record. */
>
> + /* TODO: Should we continue on error? */
> jbd_debug(3, "JBD: checking block %ld\n", next_log_block);
> err = jread(&bh, journal, next_log_block);
> - if (err)
> + if (err) {
> + success = err;
> goto failed;
> + }
>
> next_log_block++;
> wrap(journal, next_log_block);
> @@ -406,137 +499,57 @@ static int do_one_pass(journal_t *journal,
> * all of the sequence number checks. What are we going
> * to do with it? That depends on the pass... */
>
> - switch(blocktype) {
> + switch (blocktype) {
> case JFS_DESCRIPTOR_BLOCK:
> /* If it is a valid descriptor block, replay it
> * in pass REPLAY; otherwise, just skip over the
> * blocks it describes. */
> - if (pass != PASS_REPLAY) {
> + if (pass == PASS_REPLAY) {
> + err = replay_descriptor_block(
> + journal,
> + bh->b_data,
> + next_commit_ID,
> + &next_log_block,
> + info);
> + } else {
> next_log_block +=
> count_tags(bh, journal->j_blocksize);
> wrap(journal, next_log_block);
> - brelse(bh);
> - continue;
> }
> -
> - /* A descriptor block: we can now write all of
> - * the data blocks. Yay, useful work is finally
> - * getting done here! */
> -
> - tagp = &bh->b_data[sizeof(journal_header_t)];
> - while ((tagp - bh->b_data +sizeof(journal_block_tag_t))
> - <= journal->j_blocksize) {
> - unsigned long io_block;
> -
> - tag = (journal_block_tag_t *) tagp;
> - flags = be32_to_cpu(tag->t_flags);
> -
> - io_block = next_log_block++;
> - wrap(journal, next_log_block);
> - err = jread(&obh, journal, io_block);
> - if (err) {
> - /* Recover what we can, but
> - * report failure at the end. */
> - success = err;
> - printk (KERN_ERR
> - "JBD: IO error %d recovering "
> - "block %ld in log\n",
> - err, io_block);
> - } else {
> - unsigned long blocknr;
> -
> - J_ASSERT(obh != NULL);
> - blocknr = be32_to_cpu(tag->t_blocknr);
> -
> - /* If the block has been
> - * revoked, then we're all done
> - * here. */
> - if (journal_test_revoke
> - (journal, blocknr,
> - next_commit_ID)) {
> - brelse(obh);
> - ++info->nr_revoke_hits;
> - goto skip_write;
> - }
> -
> - /* Find a buffer for the new
> - * data being restored */
> - nbh = __getblk(journal->j_fs_dev,
> - blocknr,
> - journal->j_blocksize);
> - if (nbh == NULL) {
> - printk(KERN_ERR
> - "JBD: Out of memory "
> - "during recovery.\n");
> - err = -ENOMEM;
> - brelse(bh);
> - brelse(obh);
> - goto failed;
> - }
> -
> - lock_buffer(nbh);
> - memcpy(nbh->b_data, obh->b_data,
> - journal->j_blocksize);
> - if (flags & JFS_FLAG_ESCAPE) {
> - *((__be32 *)bh->b_data) =
> - cpu_to_be32(JFS_MAGIC_NUMBER);
> - }
> -
> - BUFFER_TRACE(nbh, "marking dirty");
> - set_buffer_uptodate(nbh);
> - mark_buffer_dirty(nbh);
> - BUFFER_TRACE(nbh, "marking uptodate");
> - ++info->nr_replays;
> - /* ll_rw_block(WRITE, 1, &nbh); */
> - unlock_buffer(nbh);
> - brelse(obh);
> - brelse(nbh);
> - }
> -
> - skip_write:
> - tagp += sizeof(journal_block_tag_t);
> - if (!(flags & JFS_FLAG_SAME_UUID))
> - tagp += 16;
> -
> - if (flags & JFS_FLAG_LAST_TAG)
> - break;
> - }
> -
> - brelse(bh);
> - continue;
> + break;
>
> case JFS_COMMIT_BLOCK:
> /* Found an expected commit block: not much to
> * do other than move on to the next sequence
> * number. */
> - brelse(bh);
> next_commit_ID++;
> - continue;
> + break;
>
> case JFS_REVOKE_BLOCK:
> /* If we aren't in the REVOKE pass, then we can
> * just skip over this block. */
> - if (pass != PASS_REVOKE) {
> - brelse(bh);
> - continue;
> + if (pass == PASS_REVOKE) {
> + err = scan_revoke_records(
> + journal, bh, next_commit_ID, info);
> }
> -
> - err = scan_revoke_records(journal, bh,
> - next_commit_ID, info);
> - brelse(bh);
> - if (err)
> - goto failed;
> - continue;
> + break;
>
> default:
> jbd_debug(3, "Unrecognised magic %d, end of scan.\n",
> blocktype);
> - brelse(bh);
> - goto done;
> + break;
> }
> +
> + brelse(bh);
> +
> + /* Immediately fail on OOM; on other errors try to recover
> + * as much as we can, but report the failure at the end. */
> + if (err)
> + success = err;
> + if (err == -ENOMEM)
> + goto failed;
> }
>
> - done:
> /*
> * We broke out of the log scan loop: either we came to the
> * known end of the log or we found an unexpected block in the
> @@ -558,10 +571,8 @@ static int do_one_pass(journal_t *journal,
> }
> }
>
> - return success;
> -
> failed:
> - return err;
> + return success;
> }
>
>
> --
> 1.5.3.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SuSE CR Labs

2008-03-11 15:05:15

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC, PATCH 5/6] jbd: add support for read-only log recovery

> Read-only log recovery allows a dirty journalled filesystem to be mounted and
> provide a consistent view of its data without writing to the disk. Instead of
> replaying the log a mapping is constructed between modified filesystem blocks
> and the journal blocks containing their data.
>
> The mapping is stored in a hashtable that is created and populated during
> journal recovery. The hash function used is the same one used by the journal
> revocation code. The size of the hashtable is currently being arbitrarily set
> to 256 entries. Given that we know the number of block mappings when we create
> the table it may be worth dynamically calculating an appropriate size instead
> of hard-coding it.
Hmm, how about setting some journal flag in memory and using that one,
instead of passing 'write' parameter through the functions. And you'd
also have a nicer solution to your problem in journal_destroy()
and a few other functions.
Also dynamic sizing of the hashtable would be useful, when you know
needed number of entries anyway...
And you have one problem you don't seem to be aware of: JBD escapes
data in the block to avoid magic number in the first four bytes of the
block (see the code replaying data from the journal). You have to do
this unescaping when reading blocks from the journal so it's not that
easy as you have it now.

Honza

> Signed-off-by: Duane Griffin <[email protected]>
> ---
>
> I'm tempted to add warnings on attempts to modify a read-only journal, does
> that sound useful? I'd also be grateful to anyone with suggestions for better
> member names. :)
>
> fs/ext3/super.c | 2 +-
> fs/jbd/checkpoint.c | 2 +-
> fs/jbd/commit.c | 2 +-
> fs/jbd/journal.c | 56 ++++++++++------
> fs/jbd/recovery.c | 187 +++++++++++++++++++++++++++++++++++++++++++++------
> fs/jbd/revoke.c | 7 +--
> fs/ocfs2/journal.c | 4 +-
> include/linux/jbd.h | 41 ++++++++++-
> 8 files changed, 246 insertions(+), 55 deletions(-)
>
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 18769cc..4b9ff65 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -2168,7 +2168,7 @@ static int ext3_load_journal(struct super_block *sb,
> if (!EXT3_HAS_INCOMPAT_FEATURE(sb, EXT3_FEATURE_INCOMPAT_RECOVER))
> err = journal_wipe(journal, !really_read_only);
> if (!err)
> - err = journal_load(journal);
> + err = journal_load(journal, !really_read_only);
>
> if (err) {
> printk(KERN_ERR "EXT3-fs: error loading journal.\n");
> diff --git a/fs/jbd/checkpoint.c b/fs/jbd/checkpoint.c
> index a5432bb..84d4579 100644
> --- a/fs/jbd/checkpoint.c
> +++ b/fs/jbd/checkpoint.c
> @@ -454,7 +454,7 @@ int cleanup_journal_tail(journal_t *journal)
> journal->j_tail = blocknr;
> spin_unlock(&journal->j_state_lock);
> if (!(journal->j_flags & JFS_ABORT))
> - journal_update_superblock(journal, 1);
> + journal_update_superblock(journal, 1, 1);
> return 0;
> }
>
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index a38c718..c63598b 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -310,7 +310,7 @@ void journal_commit_transaction(journal_t *journal)
> /* Do we need to erase the effects of a prior journal_flush? */
> if (journal->j_flags & JFS_FLUSHED) {
> jbd_debug(3, "super block updated\n");
> - journal_update_superblock(journal, 1);
> + journal_update_superblock(journal, 1, 1);
> } else {
> jbd_debug(3, "superblock not updated\n");
> }
> diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
> index a868277..b2723a0 100644
> --- a/fs/jbd/journal.c
> +++ b/fs/jbd/journal.c
> @@ -822,7 +822,7 @@ static void journal_fail_superblock (journal_t *journal)
> * subsequent use.
> */
>
> -static int journal_reset(journal_t *journal)
> +static int journal_reset(journal_t *journal, int write)
> {
> journal_superblock_t *sb = journal->j_superblock;
> unsigned long first, last;
> @@ -843,8 +843,8 @@ static int journal_reset(journal_t *journal)
>
> journal->j_max_transaction_buffers = journal->j_maxlen / 4;
>
> - /* Add the dynamic fields and write it to disk. */
> - journal_update_superblock(journal, 1);
> + /* Add the dynamic fields and write it to disk if not read-only. */
> + journal_update_superblock(journal, write, 1);
> return journal_start_thread(journal);
> }
>
> @@ -916,21 +916,21 @@ int journal_create(journal_t *journal)
> journal->j_flags &= ~JFS_ABORT;
> journal->j_format_version = 2;
>
> - return journal_reset(journal);
> + return journal_reset(journal, 1);
> }
>
> /**
> - * void journal_update_superblock() - Update journal sb on disk.
> + * void journal_update_superblock() - Update journal sb in memory and on disk.
> * @journal: The journal to update.
> + * @write: Set to '0' if you don't want to write to the disk.
> * @wait: Set to '0' if you don't want to wait for IO completion.
> *
> * Update a journal's dynamic superblock fields and write it to disk,
> * optionally waiting for the IO to complete.
> */
> -void journal_update_superblock(journal_t *journal, int wait)
> +void journal_update_superblock(journal_t *journal, int write, int wait)
> {
> journal_superblock_t *sb = journal->j_superblock;
> - struct buffer_head *bh = journal->j_sb_buffer;
>
> /*
> * As a special case, if the on-disk copy is already marked as needing
> @@ -957,12 +957,16 @@ void journal_update_superblock(journal_t *journal, int wait)
> sb->s_errno = cpu_to_be32(journal->j_errno);
> spin_unlock(&journal->j_state_lock);
>
> - BUFFER_TRACE(bh, "marking dirty");
> - mark_buffer_dirty(bh);
> - if (wait)
> - sync_dirty_buffer(bh);
> - else
> - ll_rw_block(SWRITE, 1, &bh);
> + if (write) {
> + struct buffer_head *bh = journal->j_sb_buffer;
> +
> + BUFFER_TRACE(bh, "marking dirty");
> + mark_buffer_dirty(bh);
> + if (wait)
> + sync_dirty_buffer(bh);
> + else
> + ll_rw_block(SWRITE, 1, &bh);
> + }
>
> out:
> /* If we have just flushed the log (by marking s_start==0), then
> @@ -1066,12 +1070,13 @@ static int load_superblock(journal_t *journal)
> /**
> * int journal_load() - Read journal from disk.
> * @journal: Journal to act on.
> + * @write: Whether the journal may be modified on-disk.
> *
> * Given a journal_t structure which tells us which disk blocks contain
> * a journal, read the journal from disk to initialise the in-memory
> * structures.
> */
> -int journal_load(journal_t *journal)
> +int journal_load(journal_t *journal, int write)
> {
> int err;
> journal_superblock_t *sb;
> @@ -1097,13 +1102,13 @@ int journal_load(journal_t *journal)
>
> /* Let the recovery code check whether it needs to recover any
> * data from the journal. */
> - if (journal_recover(journal))
> + if (journal_recover(journal, write))
> goto recovery_error;
>
> /* OK, we've finished with the dynamic journal bits:
> * reinitialise the dynamic contents of the superblock in memory
> * and reset them on disk. */
> - if (journal_reset(journal))
> + if (journal_reset(journal, write))
> goto recovery_error;
>
> journal->j_flags &= ~JFS_ABORT;
> @@ -1150,7 +1155,12 @@ void journal_destroy(journal_t *journal)
> journal->j_tail = 0;
> journal->j_tail_sequence = ++journal->j_transaction_sequence;
> if (journal->j_sb_buffer) {
> - journal_update_superblock(journal, 1);
> +
> + /*
> + * Ugly, but what to do?
> + * We don't have a superblock to test for read-only.
> + */
> + journal_update_superblock(journal, !journal->j_bt_hash, 1);
> brelse(journal->j_sb_buffer);
> }
>
> @@ -1158,6 +1168,8 @@ void journal_destroy(journal_t *journal)
> iput(journal->j_inode);
> if (journal->j_revoke)
> journal_destroy_revoke(journal);
> + if (journal->j_bt_hash)
> + journal_destroy_translations(journal);
> kfree(journal->j_wbuf);
> kfree(journal);
> }
> @@ -1374,7 +1386,7 @@ int journal_flush(journal_t *journal)
> old_tail = journal->j_tail;
> journal->j_tail = 0;
> spin_unlock(&journal->j_state_lock);
> - journal_update_superblock(journal, 1);
> + journal_update_superblock(journal, 1, 1);
> spin_lock(&journal->j_state_lock);
> journal->j_tail = old_tail;
>
> @@ -1420,8 +1432,7 @@ int journal_wipe(journal_t *journal, int write)
> write ? "Clearing" : "Ignoring");
>
> err = journal_skip_recovery(journal);
> - if (write)
> - journal_update_superblock(journal, 1);
> + journal_update_superblock(journal, write, 1);
>
> no_recovery:
> return err;
> @@ -1489,7 +1500,7 @@ static void __journal_abort_soft (journal_t *journal, int errno)
> __journal_abort_hard(journal);
>
> if (errno)
> - journal_update_superblock(journal, 1);
> + journal_update_superblock(journal, !journal->j_bt_hash, 1);
> }
>
> /**
> @@ -1922,6 +1933,8 @@ static int __init journal_init_caches(void)
>
> ret = journal_init_revoke_caches();
> if (ret == 0)
> + ret = journal_init_recover_cache();
> + if (ret == 0)
> ret = journal_init_journal_head_cache();
> if (ret == 0)
> ret = journal_init_handle_cache();
> @@ -1931,6 +1944,7 @@ static int __init journal_init_caches(void)
> static void journal_destroy_caches(void)
> {
> journal_destroy_revoke_caches();
> + journal_destroy_recover_cache();
> journal_destroy_journal_head_cache();
> journal_destroy_handle_cache();
> }
> diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
> index e2ac78f..a73b062 100644
> --- a/fs/jbd/recovery.c
> +++ b/fs/jbd/recovery.c
> @@ -17,12 +17,15 @@
> #include "jfs_user.h"
> #else
> #include <linux/time.h>
> +#include <linux/cache.h>
> #include <linux/fs.h>
> #include <linux/jbd.h>
> #include <linux/errno.h>
> #include <linux/slab.h>
> #endif
>
> +static struct kmem_cache *bt_record_cache;
> +
> /*
> * Maintain information about the progress of the recovery job, so that
> * the different passes can carry information between them.
> @@ -123,6 +126,71 @@ failed:
>
> #endif /* __KERNEL__ */
>
> +int __init journal_init_recover_cache(void)
> +{
> + bt_record_cache = kmem_cache_create(
> + "block_translation_record",
> + sizeof(struct journal_block_translation),
> + 0, 0, NULL);
> + if (bt_record_cache == NULL) {
> + printk(KERN_EMERG "JBD: failed to create recover cache\n");
> + return -ENOMEM;
> + }
> + return 0;
> +}
> +
> +void journal_destroy_recover_cache(void)
> +{
> + if (bt_record_cache) {
> + kmem_cache_destroy(bt_record_cache);
> + bt_record_cache = NULL;
> + }
> +}
> +
> +bool journal_translate_block(journal_t *journal, sector_t blocknr,
> + sector_t *retp)
> +{
> + int hash;
> + struct journal_block_translation *jbt;
> + struct hlist_node *node;
> +
> + if (!journal || !journal->j_bt_hash)
> + goto unmapped;
> +
> + hash = jbd_hash(blocknr, journal->j_bt_hash_bits,
> + 1 << journal->j_bt_hash_bits);
> + hlist_for_each_entry(jbt, node, journal->j_bt_hash + hash, jbt_list) {
> + if (jbt->original == blocknr) {
> + *retp = jbt->target;
> + return true;
> + }
> + }
> +
> +unmapped:
> + *retp = blocknr;
> + return false;
> +}
> +
> +void journal_destroy_translations(journal_t *journal)
> +{
> + int ii;
> + struct journal_block_translation *jbt;
> + struct hlist_node *tmp;
> + struct hlist_node *node;
> +
> + if (!journal->j_bt_hash)
> + return;
> +
> + for (ii = 0; ii < (1 << journal->j_bt_hash_bits); ii++) {
Hmm, why are you using 'ii' and not just 'i'? :) It's below as well.

> + hlist_for_each_entry_safe(jbt, node, tmp,
> + journal->j_bt_hash + ii, jbt_list) {
> + kmem_cache_free(bt_record_cache, jbt);
> + }
> + }
> +
> + kfree(journal->j_bt_hash);
> + journal->j_bt_hash = NULL;
> +}
>
> /*
> * Read a block from the journal
> @@ -173,6 +241,45 @@ static int jread(struct buffer_head **bhp, journal_t *journal,
> return 0;
> }
>
> +static int journal_record_bt(journal_t *journal, sector_t original,
> + sector_t target)
> +{
> + int err;
> + int hash;
> + struct hlist_node *node;
> + struct journal_block_translation *jbt;
> + sector_t blocknr;
> +
> + /* Store the physical block to avoid repeated lookups. */
> + err = journal_bmap(journal, target, &blocknr);
> + if (err)
> + return err;
> +
> + /*
> + * If the same block has been translated multiple times,
> + * just use the last.
> + */
> + hash = jbd_hash(original, journal->j_bt_hash_bits,
> + 1 << journal->j_bt_hash_bits);
> + hlist_for_each_entry(jbt, node, journal->j_bt_hash + hash, jbt_list) {
> + if (jbt->original == original) {
> + jbt->target = blocknr;
> + return 0;
> + }
> + }
> +
> + jbt = kmem_cache_alloc(bt_record_cache, GFP_NOFS);
> + if (!jbt)
> + return -ENOMEM;
> +
> + INIT_HLIST_NODE(&jbt->jbt_list);
> + jbt->original = original;
> + jbt->target = blocknr;
> +
> + hlist_add_head(&jbt->jbt_list, journal->j_bt_hash + hash);
> +
> + return 0;
> +}
>
> /*
> * Count the number of in-use tags in a journal descriptor block.
> @@ -201,6 +308,24 @@ static int count_tags(struct buffer_head *bh, int size)
> return nr;
> }
>
> +#define JOURNAL_DEFAULT_BT_HASHBITS 8
> +static int bt_hash_init(journal_t *journal)
> +{
> + int ii;
> + BUG_ON(journal->j_bt_hash);
> +
> + /* Vary size depending on the number of mappings/device size? */
> + journal->j_bt_hash_bits = JOURNAL_DEFAULT_BT_HASHBITS;
> + journal->j_bt_hash = kmalloc(sizeof(struct hlist_head) *
> + 1 << journal->j_bt_hash_bits, GFP_NOFS);
> + if (!journal->j_bt_hash)
> + return -ENOMEM;
> +
> + for (ii = 0; ii < (1 << journal->j_bt_hash_bits); ii++)
> + INIT_HLIST_HEAD(journal->j_bt_hash + ii);
> +
> + return 0;
> +}
>
> /* Make sure we wrap around the log correctly! */
> #define wrap(journal, var) \
> @@ -212,6 +337,7 @@ do { \
> /**
> * journal_recover - recovers a on-disk journal
> * @journal: the journal to recover
> + * @write: whether to write to disk
> *
> * The primary function for recovering the log contents when mounting a
> * journaled device.
> @@ -220,8 +346,11 @@ do { \
> * end of the log. In the second, we assemble the list of revoke
> * blocks. In the third and final pass, we replay any un-revoked blocks
> * in the log.
> + *
> + * If the journal is read-only a translation table will be built up instead
> + * of the journal being replayed on disk.
> */
> -int journal_recover(journal_t *journal)
> +int journal_recover(journal_t *journal, int write)
> {
> int err;
> journal_superblock_t * sb;
> @@ -244,6 +373,16 @@ int journal_recover(journal_t *journal)
> return 0;
> }
>
> + if (!write) {
> + err = bt_hash_init(journal);
> + if (err) {
> + printk(KERN_ERR
> + "JBD: out of memory allocating"
> + " journal block translation hash\n");
> + return err;
> + }
> + }
> +
> err = do_one_pass(journal, &info, PASS_SCAN);
> if (!err)
> err = do_one_pass(journal, &info, PASS_REVOKE);
> @@ -256,12 +395,15 @@ int journal_recover(journal_t *journal)
> jbd_debug(1, "JBD: Replayed %d and revoked %d/%d blocks\n",
> info.nr_replays, info.nr_revoke_hits, info.nr_revokes);
>
> - /* Restart the log at the next transaction ID, thus invalidating
> - * any existing commit records in the log. */
> - journal->j_transaction_sequence = ++info.end_transaction;
> + if (write) {
> +
> + /* Restart the log at the next transaction ID, thus invalidating
> + * any existing commit records in the log. */
> + journal->j_transaction_sequence = ++info.end_transaction;
> + journal_clear_revoke(journal);
> + sync_blockdev(journal->j_fs_dev);
> + }
>
> - journal_clear_revoke(journal);
> - sync_blockdev(journal->j_fs_dev);
> return err;
> }
>
> @@ -345,7 +487,6 @@ static int replay_descriptor_block(
> int success = 0;
> char *tagp;
> unsigned long io_block;
> - struct buffer_head *obh;
> unsigned long next = *next_log_block;
>
> tagp = &data[sizeof(journal_header_t)];
> @@ -365,28 +506,34 @@ static int replay_descriptor_block(
> goto skip_write;
> }
>
> - err = jread(&obh, journal, io_block);
> - if (err) {
> + if (journal->j_bt_hash) {
> + err = journal_record_bt(journal, blocknr, io_block);
> + } else {
> + struct buffer_head *obh;
>
> - /* Recover what we can, but
> - * report failure at the end. */
> - success = err;
> - printk(KERN_ERR "JBD: IO error %d recovering "
> - "block %ld in log\n", err, io_block);
> - continue;
> - }
> - J_ASSERT(obh != NULL);
> + err = jread(&obh, journal, io_block);
> + if (err) {
>
> - err = replay_data_block(journal, obh, data, flags, blocknr);
> - if (err) {
> + /* Recover what we can, but
> + * report failure at the end. */
> + success = err;
> + printk(KERN_ERR "JBD: IO error %d recovering "
> + "block %ld in log\n", err, io_block);
> + continue;
> + }
> + J_ASSERT(obh != NULL);
> +
> + err = replay_data_block(
> + journal, obh, data, flags, blocknr);
> brelse(obh);
> + }
> + if (err) {
> printk(KERN_ERR "JBD: Out of memory during recovery\n");
> success = err;
> goto failed;
> }
>
-> ++info->nr_replays;
> - brelse(obh);
>
> skip_write:
> tagp += sizeof(journal_block_tag_t);
> diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
> index 5f64df4..7f57d03 100644
> --- a/fs/jbd/revoke.c
> +++ b/fs/jbd/revoke.c
> @@ -105,15 +105,10 @@ static void flush_descriptor(journal_t *, struct journal_head *, int);
>
> /* Utility functions to maintain the revoke table */
>
> -/* Borrowed from buffer.c: this is a tried and tested block hash function */
> static inline int hash(journal_t *journal, unsigned long block)
> {
> struct jbd_revoke_table_s *table = journal->j_revoke;
> - int hash_shift = table->hash_shift;
> -
> - return ((block << (hash_shift - 6)) ^
> - (block >> 13) ^
> - (block << (hash_shift - 12))) & (table->hash_size - 1);
> + return jbd_hash(block, table->hash_shift, table->hash_size);
> }
>
> static int insert_revoke_hash(journal_t *journal, unsigned long blocknr,
> diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
> index f31c7e8..cbd53aa 100644
> --- a/fs/ocfs2/journal.c
> +++ b/fs/ocfs2/journal.c
> @@ -591,7 +591,7 @@ int ocfs2_journal_load(struct ocfs2_journal *journal, int local)
>
> osb = journal->j_osb;
>
> - status = journal_load(journal->j_journal);
> + status = journal_load(journal->j_journal, true);
> if (status < 0) {
> mlog(ML_ERROR, "Failed to load journal!\n");
> goto done;
> @@ -1014,7 +1014,7 @@ static int ocfs2_replay_journal(struct ocfs2_super *osb,
> goto done;
> }
>
> - status = journal_load(journal);
> + status = journal_load(journal, 1);
> if (status < 0) {
> mlog_errno(status);
> if (!igrab(inode))
> diff --git a/include/linux/jbd.h b/include/linux/jbd.h
> index b18fd3b..6154082 100644
> --- a/include/linux/jbd.h
> +++ b/include/linux/jbd.h
> @@ -558,6 +558,18 @@ struct transaction_s
> };
>
> /**
> + * struct journal_block_translation - hash bucket recording block translations.
> + * @jbt_list: list of translated blocks in this bucket.
> + * @original: the (logical?) filesystem block being translation.
> + * @target: the physical journal block holding the original's data.
> + */
> +struct journal_block_translation {
> + struct hlist_node jbt_list;
> + sector_t original;
> + sector_t target;
> +};
> +
> +/**
> * struct journal_s - The journal_s type is the concrete type associated with
> * journal_t.
> * @j_flags: General journaling state flags
> @@ -618,6 +630,8 @@ struct transaction_s
> * @j_wbufsize: maximum number of buffer_heads allowed in j_wbuf, the
> * number that will fit in j_blocksize
> * @j_last_sync_writer: most recent pid which did a synchronous write
> + * @j_bt_hash: hash table mapping uncommitted blocks to their targets
> + * @j_bt_hash_bits: number of bits in block translation hash table
> * @j_private: An opaque pointer to fs-private information.
> */
>
> @@ -810,6 +824,12 @@ struct journal_s
> pid_t j_last_sync_writer;
>
> /*
> + * Remap uncommitted transactions in a read-only filesystem.
> + */
> + struct hlist_head *j_bt_hash;
> + int j_bt_hash_bits;
> +
> + /*
> * An opaque pointer to fs-private information. ext3 puts its
> * superblock pointer here
> */
> @@ -916,12 +936,12 @@ extern int journal_check_available_features
> extern int journal_set_features
> (journal_t *, unsigned long, unsigned long, unsigned long);
> extern int journal_create (journal_t *);
> -extern int journal_load (journal_t *journal);
> +extern int journal_load (journal_t *journal, int);
> extern void journal_destroy (journal_t *);
> -extern int journal_recover (journal_t *journal);
> +extern int journal_recover (journal_t *journal, int);
> extern int journal_wipe (journal_t *, int);
> extern int journal_skip_recovery (journal_t *);
> -extern void journal_update_superblock (journal_t *, int);
> +extern void journal_update_superblock (journal_t *, int, int);
> extern void journal_abort (journal_t *, int);
> extern int journal_errno (journal_t *);
> extern void journal_ack_err (journal_t *);
> @@ -970,6 +990,12 @@ extern int journal_test_revoke(journal_t *, unsigned long, tid_t);
> extern void journal_clear_revoke(journal_t *);
> extern void journal_switch_revoke_table(journal_t *journal);
>
> +extern bool journal_translate_block(journal_t *journal, sector_t blocknr,
> + sector_t *retp);
> +extern void journal_destroy_translations(journal_t *journal);
> +extern int journal_init_recover_cache(void);
> +extern void journal_destroy_recover_cache(void);
> +
> /*
> * The log thread user interface:
> *
> @@ -989,6 +1015,15 @@ void __log_wait_for_space(journal_t *journal);
> extern void __journal_drop_transaction(journal_t *, transaction_t *);
> extern int cleanup_journal_tail(journal_t *);
>
> +/* Borrowed from buffer.c: this is a tried and tested block hash function */
> +static inline int jbd_hash(unsigned long block, int shift, int size)
> +{
> + return ((block << (shift - 6)) ^
> + (block >> 13) ^
> + (block << (shift - 12))) &
> + (size - 1);
> +}
> +
> /* Debugging code only: */
>
> #define jbd_ENOSYS() \
> --
> 1.5.3.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SuSE CR Labs

2008-03-12 01:02:13

by Duane Griffin

[permalink] [raw]
Subject: Re: [RFC, PATCH 4/6] jbd: refactor nested journal log recovery loop into separate functions

On Tue, Mar 11, 2008 at 03:35:50PM +0100, Jan Kara wrote:
> Hmm, if I read your patch correctly, previously we aborted journal
> replay when we found a block with unknown block magic but now we
> continue replaying. Why have you done such change? And similarly when
> some error happened when parsing revoke records block...

You are right, that was an error on my part. Thanks for your careful
review. Please find below an incremental patch. I'll roll this into the
main patch before posting the next version.

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan

diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
index 453c5fe..34db55a 100644
--- a/fs/jbd/recovery.c
+++ b/fs/jbd/recovery.c
@@ -673,16 +673,22 @@ static int do_one_pass(journal_t *journal,
case JFS_REVOKE_BLOCK:
/* If we aren't in the REVOKE pass, then we can
* just skip over this block. */
- if (pass == PASS_REVOKE) {
- err = scan_revoke_records(
- journal, bh, next_commit_ID, info);
+ if (pass != PASS_REVOKE)
+ break;
+
+ err = scan_revoke_records(journal, bh,
+ next_commit_ID, info);
+ if (err) {
+ brelse(bh);
+ goto failed;
}
break;

default:
jbd_debug(3, "Unrecognised magic %d, end of scan.\n",
blocktype);
- break;
+ brelse(bh);
+ goto done;
}

brelse(bh);
@@ -695,6 +701,7 @@ static int do_one_pass(journal_t *journal,
goto failed;
}

+done:
/*
* We broke out of the log scan loop: either we came to the
* known end of the log or we found an unexpected block in the

2008-03-12 01:41:00

by Duane Griffin

[permalink] [raw]
Subject: Re: [RFC, PATCH 5/6] jbd: add support for read-only log recovery

On Tue, Mar 11, 2008 at 04:05:14PM +0100, Jan Kara wrote:
> Hmm, how about setting some journal flag in memory and using that one,
> instead of passing 'write' parameter through the functions. And you'd
> also have a nicer solution to your problem in journal_destroy()
> and a few other functions.

That sounds promising. I'll rework it and see how it looks, thanks.

> Also dynamic sizing of the hashtable would be useful, when you know
> needed number of entries anyway...

Yes, indeed. I'll probably look at this after I get remount support
working.

> And you have one problem you don't seem to be aware of: JBD escapes
> data in the block to avoid magic number in the first four bytes of the
> block (see the code replaying data from the journal). You have to do
> this unescaping when reading blocks from the journal so it's not that
> easy as you have it now.

D'oh! I noticed that in the journal replay code, but it didn't quite
register. Thanks for pointing it out. I'll think about that and make
sure to add it to my test scripts so it doesn't get overlooked.

> > +void journal_destroy_translations(journal_t *journal)
> > +{
> > + int ii;
> > + struct journal_block_translation *jbt;
> > + struct hlist_node *tmp;
> > + struct hlist_node *node;
> > +
> > + if (!journal->j_bt_hash)
> > + return;
> > +
> > + for (ii = 0; ii < (1 << journal->j_bt_hash_bits); ii++) {
> Hmm, why are you using 'ii' and not just 'i'? :) It's below as well.

Habit :)

I usually use ii, jj, etc as they are easier to grep for. However, not
linux style, I must admit. I'll change it.

> > @@ -365,28 +506,34 @@ static int replay_descriptor_block(
> > goto skip_write;
> > }
> >
> > - err = jread(&obh, journal, io_block);
> > - if (err) {
> > + if (journal->j_bt_hash) {
> > + err = journal_record_bt(journal, blocknr, io_block);
> > + } else {
> > + struct buffer_head *obh;
> >
> > - /* Recover what we can, but
> > - * report failure at the end. */
> > - success = err;
> > - printk(KERN_ERR "JBD: IO error %d recovering "
> > - "block %ld in log\n", err, io_block);
> > - continue;
> > - }
> > - J_ASSERT(obh != NULL);
> > + err = jread(&obh, journal, io_block);
> > + if (err) {
> >
> > - err = replay_data_block(journal, obh, data, flags, blocknr);
> > - if (err) {
> > + /* Recover what we can, but
> > + * report failure at the end. */
> > + success = err;
> > + printk(KERN_ERR "JBD: IO error %d recovering "
> > + "block %ld in log\n", err, io_block);
> > + continue;
> > + }
> > + J_ASSERT(obh != NULL);
> > +
> > + err = replay_data_block(
> > + journal, obh, data, flags, blocknr);
> > brelse(obh);
> > + }
> > + if (err) {
> > printk(KERN_ERR "JBD: Out of memory during recovery\n");
> > success = err;
> > goto failed;
> > }
> >
> -> ++info->nr_replays;

Sorry, not sure if you are pointing something out or if this is a mailer
glitch. If the former, I'm afraid I don't understand.

> Jan Kara <[email protected]>
> SuSE CR Labs

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan

2008-03-12 10:50:29

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC, PATCH 4/6] jbd: refactor nested journal log recovery loop into separate functions

On Wed 12-03-08 01:02:05, Duane Griffin wrote:
> On Tue, Mar 11, 2008 at 03:35:50PM +0100, Jan Kara wrote:
> > Hmm, if I read your patch correctly, previously we aborted journal
> > replay when we found a block with unknown block magic but now we
> > continue replaying. Why have you done such change? And similarly when
> > some error happened when parsing revoke records block...
>
> You are right, that was an error on my part. Thanks for your careful
> review. Please find below an incremental patch. I'll roll this into the
> main patch before posting the next version.
Yup, looks fine.

Honza
>
> Cheers,
> Duane.
>
> --
> "I never could learn to drink that blood and call it wine" - Bob Dylan
>
> diff --git a/fs/jbd/recovery.c b/fs/jbd/recovery.c
> index 453c5fe..34db55a 100644
> --- a/fs/jbd/recovery.c
> +++ b/fs/jbd/recovery.c
> @@ -673,16 +673,22 @@ static int do_one_pass(journal_t *journal,
> case JFS_REVOKE_BLOCK:
> /* If we aren't in the REVOKE pass, then we can
> * just skip over this block. */
> - if (pass == PASS_REVOKE) {
> - err = scan_revoke_records(
> - journal, bh, next_commit_ID, info);
> + if (pass != PASS_REVOKE)
> + break;
> +
> + err = scan_revoke_records(journal, bh,
> + next_commit_ID, info);
> + if (err) {
> + brelse(bh);
> + goto failed;
> }
> break;
>
> default:
> jbd_debug(3, "Unrecognised magic %d, end of scan.\n",
> blocktype);
> - break;
> + brelse(bh);
> + goto done;
> }
>
> brelse(bh);
> @@ -695,6 +701,7 @@ static int do_one_pass(journal_t *journal,
> goto failed;
> }
>
> +done:
> /*
> * We broke out of the log scan loop: either we came to the
> * known end of the log or we found an unexpected block in the
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-03-12 10:51:22

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC, PATCH 5/6] jbd: add support for read-only log recovery

On Wed 12-03-08 01:40:56, Duane Griffin wrote:
> > > + }
> > > + if (err) {
> > > printk(KERN_ERR "JBD: Out of memory during recovery\n");
> > > success = err;
> > > goto failed;
> > > }
> > >
> > -> ++info->nr_replays;
>
> Sorry, not sure if you are pointing something out or if this is a mailer
> glitch. If the former, I'm afraid I don't understand.
Just a glitch, ignore that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2008-03-13 03:22:21

by Daniel Phillips

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

Hi Duane,

Thanks for doing this. Some perhaps not so obvious fallout from the bad
old way of doing things is that ddnap (zumastor) hits an issue in
replication. Since ddsnap allows journal replay on the downstream
server and also needs to have an unaltered snapshot to apply deltas
against, if we do not take special care, Ext3 will come along and
modify the downstream snapshot even when told not to. Our solution:
take two snapshots per replication cycle (pretty cheap) so that one can
be clean and the other can be stepped on at will by the journal replay.
Ugh.

With your hack, we can eventually drop the double snapshot, provided no
other filesystem is similarly badly behaved.

Re your page translation table: we already have a page translation
table, it is called the page cache. If you could figure out which file
(or metadata) each journal block belongs to, you could just load the
page table pages back in and presto, done. No need to replay the
journal at all, you are already back to journal+disk = consistent
state.

I probably have missed a detail or two since I haven't looked closely at
how orphan inodes work, revokes, probably other things, but there is
the basic idea. SCT, does my reasoning hold water? (In fact,
ddsnap "replays" its own journal in exactly this way. Cache state is
reconstructed and no actual journal flush is performed.)

Anyway, this is just a theoretical comment, it is in no way a suggestion
for a rewrite. The reason for that being, you do not have any
convenient way to map physical journal blocks back to files and
metadata. Maybe if we do implement reverse mapping for Ext3/4 later
(not just a pipe dream) we could revisit this and lose your extra
mapping. As it stands your solution seems well built, after a quick
readthrough. Nice looking code. I think you added about 250 lines
overall, so tight too. Thanks again.

Daniel

2008-03-13 12:35:54

by Duane Griffin

[permalink] [raw]
Subject: Re: [RFC, PATCH 0/6] ext3: do not modify data on-disk when mounting read-only filesystem

On 13/03/2008, Daniel Phillips <[email protected]> wrote:
> Hi Duane,
>
> Thanks for doing this. Some perhaps not so obvious fallout from the bad
> old way of doing things is that ddnap (zumastor) hits an issue in
> replication. Since ddsnap allows journal replay on the downstream
> server and also needs to have an unaltered snapshot to apply deltas
> against, if we do not take special care, Ext3 will come along and
> modify the downstream snapshot even when told not to. Our solution:
> take two snapshots per replication cycle (pretty cheap) so that one can
> be clean and the other can be stepped on at will by the journal replay.
> Ugh.

Ah, good to know, thanks. It looks like you folks are doing some
interesting things there.

> With your hack, we can eventually drop the double snapshot, provided no
> other filesystem is similarly badly behaved.

Excellent, I'm really pleased to hear that.

> Re your page translation table: we already have a page translation
> table, it is called the page cache. If you could figure out which file
> (or metadata) each journal block belongs to, you could just load the
> page table pages back in and presto, done. No need to replay the
> journal at all, you are already back to journal+disk = consistent
> state.

Hmm, interesting. I'll have to have a think about that, thanks.

> mapping. As it stands your solution seems well built, after a quick
> readthrough. Nice looking code. I think you added about 250 lines
> overall, so tight too. Thanks again.

Thanks very much, I appreciate it!

Cheers,
Duane.

--
"I never could learn to drink that blood and call it wine" - Bob Dylan