2009-07-21 10:04:20

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/5] ext3/jbd patches in my patch queue


Hi,

I'm sending here several patches that I carry currently and plan to merge
them. As far as I remember all of them went to the list already but some of them
some time ago so I rather resend them... Any comments welcome.

Honza


2009-07-21 10:04:21

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/5] jbd: Fail to load a journal if it is too short

Due to on disk corruption, it can happen that journal is too short. Fail
to load it in such case so that we don't oops somewhere later.

Reported-by: Nageswara R Sastry <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/jbd/journal.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 737f724..94a64a1 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -848,6 +848,12 @@ static int journal_reset(journal_t *journal)

first = be32_to_cpu(sb->s_first);
last = be32_to_cpu(sb->s_maxlen);
+ if (first + JFS_MIN_JOURNAL_BLOCKS > last + 1) {
+ printk(KERN_ERR "JBD: Journal too short (blocks %lu-%lu).\n",
+ first, last);
+ journal_fail_superblock(journal);
+ return -EINVAL;
+ }

journal->j_first = first;
journal->j_last = last;
--
1.6.0.2


2009-07-21 10:04:21

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/5] ext3: Fix truncation of symlinks after failed write

Contents of long symlinks is written via standard write methods. So when the
write fails, we add inode to orphan list. But symlinks don't have .truncate
method defined so nobody properly removes them from the orphan list (both on
disk and in memory).

Fix this by calling ext3_truncate() directly instead of calling vmtruncate()
(which is saner anyway since we don't need anything vmtruncate() does except
from calling .truncate in these paths). We also add inode to orphan list only
if ext3_can_truncate() is true (currently, it can be false for symlinks when
there are no blocks allocated) - otherwise orphan list processing will complain
and ext3_truncate() will not remove inode from on-disk orphan list.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/inode.c | 19 ++++++++++---------
1 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 5f51fed..4d7da6f 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1193,15 +1193,16 @@ write_begin_failed:
* i_size_read because we hold i_mutex.
*
* Add inode to orphan list in case we crash before truncate
- * finishes.
+ * finishes. Do this only if ext3_can_truncate() agrees so
+ * that orphan processing code is happy.
*/
- if (pos + len > inode->i_size)
+ if (pos + len > inode->i_size && ext3_can_truncate(inode))
ext3_orphan_add(handle, inode);
ext3_journal_stop(handle);
unlock_page(page);
page_cache_release(page);
if (pos + len > inode->i_size)
- vmtruncate(inode, inode->i_size);
+ ext3_truncate(inode);
}
if (ret == -ENOSPC && ext3_should_retry_alloc(inode->i_sb, &retries))
goto retry;
@@ -1287,7 +1288,7 @@ static int ext3_ordered_write_end(struct file *file,
* There may be allocated blocks outside of i_size because
* we failed to copy some data. Prepare for truncate.
*/
- if (pos + len > inode->i_size)
+ if (pos + len > inode->i_size && ext3_can_truncate(inode))
ext3_orphan_add(handle, inode);
ret2 = ext3_journal_stop(handle);
if (!ret)
@@ -1296,7 +1297,7 @@ static int ext3_ordered_write_end(struct file *file,
page_cache_release(page);

if (pos + len > inode->i_size)
- vmtruncate(inode, inode->i_size);
+ ext3_truncate(inode);
return ret ? ret : copied;
}

@@ -1315,14 +1316,14 @@ static int ext3_writeback_write_end(struct file *file,
* There may be allocated blocks outside of i_size because
* we failed to copy some data. Prepare for truncate.
*/
- if (pos + len > inode->i_size)
+ if (pos + len > inode->i_size && ext3_can_truncate(inode))
ext3_orphan_add(handle, inode);
ret = ext3_journal_stop(handle);
unlock_page(page);
page_cache_release(page);

if (pos + len > inode->i_size)
- vmtruncate(inode, inode->i_size);
+ ext3_truncate(inode);
return ret ? ret : copied;
}

@@ -1358,7 +1359,7 @@ static int ext3_journalled_write_end(struct file *file,
* There may be allocated blocks outside of i_size because
* we failed to copy some data. Prepare for truncate.
*/
- if (pos + len > inode->i_size)
+ if (pos + len > inode->i_size && ext3_can_truncate(inode))
ext3_orphan_add(handle, inode);
EXT3_I(inode)->i_state |= EXT3_STATE_JDATA;
if (inode->i_size > EXT3_I(inode)->i_disksize) {
@@ -1375,7 +1376,7 @@ static int ext3_journalled_write_end(struct file *file,
page_cache_release(page);

if (pos + len > inode->i_size)
- vmtruncate(inode, inode->i_size);
+ ext3_truncate(inode);
return ret ? ret : copied;
}

--
1.6.0.2


2009-07-21 10:04:21

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/5] jbd: Fix a race between checkpointing code and journal_get_write_access()

The following race can happen:

CPU1 CPU2
checkpointing code checks the buffer, adds
it to an array for writeback
do_get_write_access()
...
lock_buffer()
unlock_buffer()
flush_batch() submits the buffer for IO
__jbd_journal_file_buffer()

So a buffer under writeout is returned from do_get_write_access(). Since
the filesystem code relies on the fact that journaled buffers cannot be
written out, it does not take the buffer lock and so it can modify buffer
while it is under writeout. That can lead to a filesystem corruption
if we crash at the right moment. The similar problem can happen with
the journal_get_create_access() path.
We fix the problem by clearing the buffer dirty bit under buffer_lock
even if the buffer is on BJ_None list. Actually, we clear the dirty bit
regardless the list the buffer is in and warn about the fact if
the buffer is already journalled.

Thanks for spotting the problem goes to dingdinghua <[email protected]>.

Reported-by: dingdinghua <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/jbd/transaction.c | 68 +++++++++++++++++++++++++------------------------
1 files changed, 35 insertions(+), 33 deletions(-)

diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 73242ba..c03ac11 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -489,34 +489,15 @@ void journal_unlock_updates (journal_t *journal)
wake_up(&journal->j_wait_transaction_locked);
}

-/*
- * Report any unexpected dirty buffers which turn up. Normally those
- * indicate an error, but they can occur if the user is running (say)
- * tune2fs to modify the live filesystem, so we need the option of
- * continuing as gracefully as possible. #
- *
- * The caller should already hold the journal lock and
- * j_list_lock spinlock: most callers will need those anyway
- * in order to probe the buffer's journaling state safely.
- */
-static void jbd_unexpected_dirty_buffer(struct journal_head *jh)
+static void warn_dirty_buffer(struct buffer_head *bh)
{
- int jlist;
-
- /* If this buffer is one which might reasonably be dirty
- * --- ie. data, or not part of this journal --- then
- * we're OK to leave it alone, but otherwise we need to
- * move the dirty bit to the journal's own internal
- * JBDDirty bit. */
- jlist = jh->b_jlist;
+ char b[BDEVNAME_SIZE];

- if (jlist == BJ_Metadata || jlist == BJ_Reserved ||
- jlist == BJ_Shadow || jlist == BJ_Forget) {
- struct buffer_head *bh = jh2bh(jh);
-
- if (test_clear_buffer_dirty(bh))
- set_buffer_jbddirty(bh);
- }
+ printk(KERN_WARNING
+ "JBD: Spotted dirty metadata buffer (dev = %s, blocknr = %llu). "
+ "There's a risk of filesystem corruption in case of system "
+ "crash.\n",
+ bdevname(bh->b_bdev, b), (unsigned long long)bh->b_blocknr);
}

/*
@@ -583,14 +564,16 @@ repeat:
if (jh->b_next_transaction)
J_ASSERT_JH(jh, jh->b_next_transaction ==
transaction);
+ warn_dirty_buffer(bh);
}
/*
* In any case we need to clean the dirty flag and we must
* do it under the buffer lock to be sure we don't race
* with running write-out.
*/
- JBUFFER_TRACE(jh, "Unexpected dirty buffer");
- jbd_unexpected_dirty_buffer(jh);
+ JBUFFER_TRACE(jh, "Journalling dirty buffer");
+ clear_buffer_dirty(bh);
+ set_buffer_jbddirty(bh);
}

unlock_buffer(bh);
@@ -826,6 +809,15 @@ int journal_get_create_access(handle_t *handle, struct buffer_head *bh)
J_ASSERT_JH(jh, buffer_locked(jh2bh(jh)));

if (jh->b_transaction == NULL) {
+ /*
+ * Previous journal_forget() could have left the buffer
+ * with jbddirty bit set because it was being committed. When
+ * the commit finished, we've filed the buffer for
+ * checkpointing and marked it dirty. Now we are reallocating
+ * the buffer so the transaction freeing it must have
+ * committed and so it's safe to clear the dirty bit.
+ */
+ clear_buffer_dirty(jh2bh(jh));
jh->b_transaction = transaction;

/* first access by this transaction */
@@ -1782,8 +1774,13 @@ static int __dispose_buffer(struct journal_head *jh, transaction_t *transaction)

if (jh->b_cp_transaction) {
JBUFFER_TRACE(jh, "on running+cp transaction");
+ /*
+ * We don't want to write the buffer anymore, clear the
+ * bit so that we don't confuse checks in
+ * __journal_file_buffer
+ */
+ clear_buffer_dirty(bh);
__journal_file_buffer(jh, transaction, BJ_Forget);
- clear_buffer_jbddirty(bh);
may_free = 0;
} else {
JBUFFER_TRACE(jh, "on running transaction");
@@ -2041,12 +2038,17 @@ void __journal_file_buffer(struct journal_head *jh,
if (jh->b_transaction && jh->b_jlist == jlist)
return;

- /* The following list of buffer states needs to be consistent
- * with __jbd_unexpected_dirty_buffer()'s handling of dirty
- * state. */

2009-07-21 10:04:21

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/5] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle()

Get rid of extenddisksize parameter of ext3_get_blocks_handle(). This seems to
be a relict from some old days and setting disksize in this function does not
make much sence. Currently it was set only by ext3_getblk(). Since the
parameter has some effect only if create == 1, it is easy to check that the
three callers which end up calling ext3_getblk() with create == 1 (ext3_append,
ext3_quota_write, ext3_mkdir) do the right thing and set disksize themselves.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext3/dir.c | 3 +--
fs/ext3/inode.c | 13 +++----------
include/linux/ext3_fs.h | 2 +-
3 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index 3d724a9..373fa90 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -130,8 +130,7 @@ static int ext3_readdir(struct file * filp,
struct buffer_head *bh = NULL;

map_bh.b_state = 0;
- err = ext3_get_blocks_handle(NULL, inode, blk, 1,
- &map_bh, 0, 0);
+ err = ext3_get_blocks_handle(NULL, inode, blk, 1, &map_bh, 0);
if (err > 0) {
pgoff_t index = map_bh.b_blocknr >>
(PAGE_CACHE_SHIFT - inode->i_blkbits);
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 4d7da6f..b49908a 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -788,7 +788,7 @@ err_out:
int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
sector_t iblock, unsigned long maxblocks,
struct buffer_head *bh_result,
- int create, int extend_disksize)
+ int create)
{
int err = -EIO;
int offsets[4];
@@ -911,13 +911,6 @@ int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
if (!err)
err = ext3_splice_branch(handle, inode, iblock,
partial, indirect_blks, count);
- /*
- * i_disksize growing is protected by truncate_mutex. Don't forget to
- * protect it if you're about to implement concurrent
- * ext3_get_block() -bzzz
- */
- if (!err && extend_disksize && inode->i_size > ei->i_disksize)
- ei->i_disksize = inode->i_size;
mutex_unlock(&ei->truncate_mutex);
if (err)
goto cleanup;
@@ -972,7 +965,7 @@ static int ext3_get_block(struct inode *inode, sector_t iblock,
}

ret = ext3_get_blocks_handle(handle, inode, iblock,
- max_blocks, bh_result, create, 0);
+ max_blocks, bh_result, create);
if (ret > 0) {
bh_result->b_size = (ret << inode->i_blkbits);
ret = 0;
@@ -1005,7 +998,7 @@ struct buffer_head *ext3_getblk(handle_t *handle, struct inode *inode,
dummy.b_blocknr = -1000;
buffer_trace_init(&dummy.b_history);
err = ext3_get_blocks_handle(handle, inode, block, 1,
- &dummy, create, 1);
+ &dummy, create);
/*
* ext3_get_blocks_handle() returns number of blocks
* mapped. 0 in case of a HOLE.
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 634a5e5..7499b36 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -874,7 +874,7 @@ struct buffer_head * ext3_getblk (handle_t *, struct inode *, long, int, int *);
struct buffer_head * ext3_bread (handle_t *, struct inode *, int, int, int *);
int ext3_get_blocks_handle(handle_t *handle, struct inode *inode,
sector_t iblock, unsigned long maxblocks, struct buffer_head *bh_result,
- int create, int extend_disksize);
+ int create);

extern struct inode *ext3_iget(struct super_block *, unsigned long);
extern int ext3_write_inode (struct inode *, int);
--
1.6.0.2


2009-07-21 10:04:21

by Jan Kara

[permalink] [raw]
Subject: [PATCH 5/5] jbd: fix race between write_metadata_buffer and get_write_access

From: dingdinghua <[email protected]>

The function journal_write_metadata_buffer() calls jbd_unlock_bh_state(bh_in)
too early; this could potentially allow another thread to call get_write_access
on the buffer head, modify the data, and dirty it, and allowing the wrong data
to be written into the journal. Fortunately, if we lose this race, the only
time this will actually cause filesystem corruption is if there is a system
crash or other unclean shutdown of the system before the next commit can take
place.

Signed-off-by: dingdinghua <[email protected]>
Acked-by: "Theodore Ts'o" <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/jbd/journal.c | 20 +++++++++++---------
1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index 94a64a1..f96f850 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -287,6 +287,7 @@ int journal_write_metadata_buffer(transaction_t *transaction,
struct page *new_page;
unsigned int new_offset;
struct buffer_head *bh_in = jh2bh(jh_in);
+ journal_t *journal = transaction->t_journal;

/*
* The buffer really shouldn't be locked: only the current committing
@@ -300,6 +301,11 @@ int journal_write_metadata_buffer(transaction_t *transaction,
J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));

new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
+ /* keep subsequent assertions sane */
+ new_bh->b_state = 0;
+ init_buffer(new_bh, NULL, NULL);
+ atomic_set(&new_bh->b_count, 1);
+ new_jh = journal_add_journal_head(new_bh); /* This sleeps */

/*
* If a new transaction has already done a buffer copy-out, then
@@ -361,14 +367,6 @@ repeat:
kunmap_atomic(mapped_data, KM_USER0);
}

- /* keep subsequent assertions sane */
- new_bh->b_state = 0;
- init_buffer(new_bh, NULL, NULL);
- atomic_set(&new_bh->b_count, 1);
- jbd_unlock_bh_state(bh_in);
-
- new_jh = journal_add_journal_head(new_bh); /* This sleeps */

2009-07-21 16:20:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: Fail to load a journal if it is too short

On Tue, 21 Jul 2009 12:04:15 +0200 Jan Kara <[email protected]> wrote:

> Due to on disk corruption, it can happen that journal is too short. Fail
> to load it in such case so that we don't oops somewhere later.
>
> Reported-by: Nageswara R Sastry <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/jbd/journal.c | 6 ++++++
> 1 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
> index 737f724..94a64a1 100644
> --- a/fs/jbd/journal.c
> +++ b/fs/jbd/journal.c
> @@ -848,6 +848,12 @@ static int journal_reset(journal_t *journal)
>
> first = be32_to_cpu(sb->s_first);
> last = be32_to_cpu(sb->s_maxlen);
> + if (first + JFS_MIN_JOURNAL_BLOCKS > last + 1) {
> + printk(KERN_ERR "JBD: Journal too short (blocks %lu-%lu).\n",
> + first, last);
> + journal_fail_superblock(journal);
> + return -EINVAL;
> + }
>
> journal->j_first = first;
> journal->j_last = last;

It's odd that sb->s_first/s_maxlen are 32-bit and
journal->j_first/j_last are unsigned long.

These things will only ever be 32-bit unless we change the journal
superblock.

2009-07-21 16:50:45

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: Fail to load a journal if it is too short

On Jul 21, 2009 09:19 -0700, Andrew Morton wrote:
> On Tue, 21 Jul 2009 12:04:15 +0200 Jan Kara <[email protected]> wrote:
> > Due to on disk corruption, it can happen that journal is too short. Fail
> > to load it in such case so that we don't oops somewhere later.
> >
> > Reported-by: Nageswara R Sastry <[email protected]>
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/jbd/journal.c | 6 ++++++
> > 1 files changed, 6 insertions(+), 0 deletions(-)
> >
> > diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
> > index 737f724..94a64a1 100644
> > --- a/fs/jbd/journal.c
> > +++ b/fs/jbd/journal.c
> > @@ -848,6 +848,12 @@ static int journal_reset(journal_t *journal)
> >
> > first = be32_to_cpu(sb->s_first);
> > last = be32_to_cpu(sb->s_maxlen);
> > + if (first + JFS_MIN_JOURNAL_BLOCKS > last + 1) {
> > + printk(KERN_ERR "JBD: Journal too short (blocks %lu-%lu).\n",
> > + first, last);
> > + journal_fail_superblock(journal);
> > + return -EINVAL;
> > + }
> >
> > journal->j_first = first;
> > journal->j_last = last;
>
> It's odd that sb->s_first/s_maxlen are 32-bit and
> journal->j_first/j_last are unsigned long.
>
> These things will only ever be 32-bit unless we change the journal
> superblock.

The jbd on disk structure and APIs cannot handle 64-bit block numbers.
That is one of the first changes we made for jbd2 so that it is possible
to store either 32-bit or 64-bit block numbers in a transaction. I don't
think that needs to be fixed for the jbd code.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-07-22 01:08:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: Fail to load a journal if it is too short

On Tue, Jul 21, 2009 at 09:19:46AM -0700, Andrew Morton wrote:
>
> It's odd that sb->s_first/s_maxlen are 32-bit and
> journal->j_first/j_last are unsigned long.
>
> These things will only ever be 32-bit unless we change the journal
> superblock.

In general, if there is any use of "unsigned long" in fs/ext[34], it's
probably a bug. This is because ulong is 32-bits on x86, and 64-bits
on x86_64, so it just wastes memory space on 64-bit platforms. The
one exception to this is if the field in question is used by the
standard bitops functions, which only functions correctly on "unsigned
long".

- Ted

2009-07-22 09:52:22

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/5] jbd: Fail to load a journal if it is too short

On Tue 21-07-09 17:35:48, Theodore Tso wrote:
> On Tue, Jul 21, 2009 at 09:19:46AM -0700, Andrew Morton wrote:
> >
> > It's odd that sb->s_first/s_maxlen are 32-bit and
> > journal->j_first/j_last are unsigned long.
> >
> > These things will only ever be 32-bit unless we change the journal
> > superblock.
>
> In general, if there is any use of "unsigned long" in fs/ext[34], it's
> probably a bug. This is because ulong is 32-bits on x86, and 64-bits
> on x86_64, so it just wastes memory space on 64-bit platforms. The
> one exception to this is if the field in question is used by the
> standard bitops functions, which only functions correctly on "unsigned
> long".
That's a good point. I'll write a cleanup patch at least for the obvious
offenders.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR