Hi all,
Here are four patches against 3.17-rc4 to fix some minor problems
in jbd2 and ext4. None of these four depend on each other; they
fix separate small bugs.
The first patch fixes the journal_checksum feature flag handling at
mount time. This patch has been out for review on the list for a
while.
The second patch fixes external journal mounting so that the
superblock checksum (of the ext. journal) is verified if metadata_csum
is set. This is the same patch that has been out for review for a few
days.
The third patch fixes a journal_checksum_v3 replay bug -- if a block
is in a transaction, and then later revoked and written into another
transaction, and the block in the second transaction is corrupt, the
journal would fail even to write the block from the first transaction.
This would worsen the damage caused by a corrupt journal.
The fourth bug fixes an inline_data bug where we would release a page
but then keep using it, which resulted in complaints about freeing
locked pages at umount time or strange system crashes.
Patches are against 3.17-rc4, and have been xfstest'd and checked
against debugfs creating test journals. There's still a hard to
reproduce crash when ext4_destroy_inline_data_nolock tries to remove
the inline data xattr from a corrupt inode, so we'll see if I can nail
that one.
Comments and questions are, as always, welcome.
--D
Clear all three journal checksum feature flags before turning on
whichever journal checksum options we want. Rearrange the error
checking so that newer flags get complained about first.
Signed-off-by: Darrick J. Wong <[email protected]>
Reported-by: TR Reardon <[email protected]>
---
fs/ext4/super.c | 11 ++++++-----
fs/jbd2/journal.c | 16 ++++++++--------
2 files changed, 14 insertions(+), 13 deletions(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index bbf515c..7045f1d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3190,6 +3190,10 @@ static int set_journal_csum_feature_set(struct super_block *sb)
incompat = 0;
}
+ jbd2_journal_clear_features(sbi->s_journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+ JBD2_FEATURE_INCOMPAT_CSUM_V3 |
+ JBD2_FEATURE_INCOMPAT_CSUM_V2);
if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
ret = jbd2_journal_set_features(sbi->s_journal,
compat, 0,
@@ -3202,11 +3206,8 @@ static int set_journal_csum_feature_set(struct super_block *sb)
jbd2_journal_clear_features(sbi->s_journal, 0, 0,
JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
} else {
- jbd2_journal_clear_features(sbi->s_journal,
- JBD2_FEATURE_COMPAT_CHECKSUM, 0,
- JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT |
- JBD2_FEATURE_INCOMPAT_CSUM_V3 |
- JBD2_FEATURE_INCOMPAT_CSUM_V2);
+ jbd2_journal_clear_features(sbi->s_journal, 0, 0,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
}
return ret;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 19d74d8..7e70cd5 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1522,14 +1522,6 @@ static int journal_get_superblock(journal_t *journal)
goto out;
}
- if (jbd2_journal_has_csum_v2or3(journal) &&
- JBD2_HAS_COMPAT_FEATURE(journal, JBD2_FEATURE_COMPAT_CHECKSUM)) {
- /* Can't have checksum v1 and v2 on at the same time! */
- printk(KERN_ERR "JBD2: Can't enable checksumming v1 and v2 "
- "at the same time!\n");
- goto out;
- }
If the external journal device has metadata_csum enabled, verify
that the superblock checksum matches the block before we try to
mount.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/super.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 7045f1d..222ed5d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4372,6 +4372,15 @@ static journal_t *ext4_get_dev_journal(struct super_block *sb,
goto out_bdev;
}
+ if ((le32_to_cpu(es->s_feature_ro_compat) &
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
+ es->s_checksum != ext4_superblock_csum(sb, es)) {
+ ext4_msg(sb, KERN_ERR, "external journal has "
+ "corrupt superblock");
+ brelse(bh);
+ goto out_bdev;
+ }
+
if (memcmp(EXT4_SB(sb)->s_es->s_journal_uuid, es->s_uuid, 16)) {
ext4_msg(sb, KERN_ERR, "journal UUID does not match");
brelse(bh);
If, during a journal_checksum_v3 replay we encounter a block that
doesn't match its tag in the descriptor block tag, we need to restart
the replay without the revoke table in the hopes of replaying the
newest non-corrupt version of the block that we possibly can.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/jbd2/recovery.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 9b329b5..0094d8b 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -439,6 +439,7 @@ static int do_one_pass(journal_t *journal,
* block offsets): query the superblock.
*/
+restart_pass:
sb = journal->j_superblock;
next_commit_ID = be32_to_cpu(sb->s_sequence);
next_log_block = be32_to_cpu(sb->s_start);
@@ -585,7 +586,8 @@ static int do_one_pass(journal_t *journal,
/* If the block has been
* revoked, then we're all done
* here. */
- if (jbd2_journal_test_revoke
+ if (!block_error &&
+ jbd2_journal_test_revoke
(journal, blocknr,
next_commit_ID)) {
brelse(obh);
@@ -599,11 +601,24 @@ static int do_one_pass(journal_t *journal,
be32_to_cpu(tmp->h_sequence))) {
brelse(obh);
success = -EIO;
+ if (!block_error) {
+ /* If we see a corrupt
+ * block, kill the
+ * revoke list and
+ * restart the replay
+ * so that the blocks
+ * are as close to
+ * accurate as
+ * possible. */
+ jbd2_journal_clear_revoke(journal);
+ brelse(bh);
+ block_error = 1;
+ goto restart_pass;
+ }
printk(KERN_ERR "JBD2: Invalid "
"checksum recovering "
"block %llu in log\n",
blocknr);
- block_error = 1;
goto skip_write;
}
If inline->extent conversion fails (most probably due to ENOSPC) and
we release the temporary page that we allocated to transfer the file
contents, don't keep using the page pointer after releasing the page.
This occasionally leads to complaints about evicting locked pages or
hangs when blocksize > pagesize, because it's possible for the page to
get reallocated elsewhere in the meantime.
Signed-off-by: Darrick J. Wong <[email protected]>
Cc: Tao Ma <[email protected]>
---
fs/ext4/inline.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index bea662b..378aadf 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -594,6 +594,7 @@ retry:
if (ret) {
unlock_page(page);
page_cache_release(page);
+ page = NULL;
ext4_orphan_add(handle, inode);
up_write(&EXT4_I(inode)->xattr_sem);
sem_held = 0;
@@ -613,7 +614,8 @@ retry:
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
- block_commit_write(page, from, to);
+ if (page)
+ block_commit_write(page, from, to);
out:
if (page) {
unlock_page(page);
On Wed 10-09-14 17:28:38, Darrick J. Wong wrote:
> If, during a journal_checksum_v3 replay we encounter a block that
> doesn't match its tag in the descriptor block tag, we need to restart
> the replay without the revoke table in the hopes of replaying the
> newest non-corrupt version of the block that we possibly can.
Ho hum, I don't like this. If you just ignore revoke list, you'll happily
overwrite freshly allocated data blocks with older metadata. Also when
verifying the checksum, we already know the block hasn't been revoked
so what's even the benefit of ignoring the revoke list?
Honza
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/jbd2/recovery.c | 19 +++++++++++++++++--
> 1 file changed, 17 insertions(+), 2 deletions(-)
>
>
> diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> index 9b329b5..0094d8b 100644
> --- a/fs/jbd2/recovery.c
> +++ b/fs/jbd2/recovery.c
> @@ -439,6 +439,7 @@ static int do_one_pass(journal_t *journal,
> * block offsets): query the superblock.
> */
>
> +restart_pass:
> sb = journal->j_superblock;
> next_commit_ID = be32_to_cpu(sb->s_sequence);
> next_log_block = be32_to_cpu(sb->s_start);
> @@ -585,7 +586,8 @@ static int do_one_pass(journal_t *journal,
> /* If the block has been
> * revoked, then we're all done
> * here. */
> - if (jbd2_journal_test_revoke
> + if (!block_error &&
> + jbd2_journal_test_revoke
> (journal, blocknr,
> next_commit_ID)) {
> brelse(obh);
> @@ -599,11 +601,24 @@ static int do_one_pass(journal_t *journal,
> be32_to_cpu(tmp->h_sequence))) {
> brelse(obh);
> success = -EIO;
> + if (!block_error) {
> + /* If we see a corrupt
> + * block, kill the
> + * revoke list and
> + * restart the replay
> + * so that the blocks
> + * are as close to
> + * accurate as
> + * possible. */
> + jbd2_journal_clear_revoke(journal);
> + brelse(bh);
> + block_error = 1;
> + goto restart_pass;
> + }
> printk(KERN_ERR "JBD2: Invalid "
> "checksum recovering "
> "block %llu in log\n",
> blocknr);
> - block_error = 1;
> goto skip_write;
> }
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 10-09-14 17:28:45, Darrick J. Wong wrote:
> If inline->extent conversion fails (most probably due to ENOSPC) and
> we release the temporary page that we allocated to transfer the file
> contents, don't keep using the page pointer after releasing the page.
> This occasionally leads to complaints about evicting locked pages or
> hangs when blocksize > pagesize, because it's possible for the page to
> get reallocated elsewhere in the meantime.
Good catch! You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> Cc: Tao Ma <[email protected]>
> ---
> fs/ext4/inline.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
>
> diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
> index bea662b..378aadf 100644
> --- a/fs/ext4/inline.c
> +++ b/fs/ext4/inline.c
> @@ -594,6 +594,7 @@ retry:
> if (ret) {
> unlock_page(page);
> page_cache_release(page);
> + page = NULL;
> ext4_orphan_add(handle, inode);
> up_write(&EXT4_I(inode)->xattr_sem);
> sem_held = 0;
> @@ -613,7 +614,8 @@ retry:
> if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> goto retry;
>
> - block_commit_write(page, from, to);
> + if (page)
> + block_commit_write(page, from, to);
> out:
> if (page) {
> unlock_page(page);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 10-09-14 17:28:32, Darrick J. Wong wrote:
> If the external journal device has metadata_csum enabled, verify
> that the superblock checksum matches the block before we try to
> mount.
Looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
PS: On a general note the way we are checking checksums in ext4 seems to be
a bit arbitrary. It would seem more robust to just have ext4_bread() take
data type of the buffer and if the buffer doesn't have buffer_verified set,
it would run appropriate checksum check on the buffer. That way we are sure
that the buffer is checked whenever it's loaded from disk. ocfs2 and xfs
are doing it this way...
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/ext4/super.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 7045f1d..222ed5d 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4372,6 +4372,15 @@ static journal_t *ext4_get_dev_journal(struct super_block *sb,
> goto out_bdev;
> }
>
> + if ((le32_to_cpu(es->s_feature_ro_compat) &
> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
> + es->s_checksum != ext4_superblock_csum(sb, es)) {
> + ext4_msg(sb, KERN_ERR, "external journal has "
> + "corrupt superblock");
> + brelse(bh);
> + goto out_bdev;
> + }
> +
> if (memcmp(EXT4_SB(sb)->s_es->s_journal_uuid, es->s_uuid, 16)) {
> ext4_msg(sb, KERN_ERR, "journal UUID does not match");
> brelse(bh);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Sep 11, 2014 at 03:17:53PM +0200, Jan Kara wrote:
> On Wed 10-09-14 17:28:45, Darrick J. Wong wrote:
> > If inline->extent conversion fails (most probably due to ENOSPC) and
> > we release the temporary page that we allocated to transfer the file
> > contents, don't keep using the page pointer after releasing the page.
> > This occasionally leads to complaints about evicting locked pages or
> > hangs when blocksize > pagesize, because it's possible for the page to
> > get reallocated elsewhere in the meantime.
> Good catch! You can add:
> Reviewed-by: Jan Kara <[email protected]>
Applied, thanks.
- Ted
On Wed, Sep 10, 2014 at 05:28:25PM -0700, Darrick J. Wong wrote:
> Clear all three journal checksum feature flags before turning on
> whichever journal checksum options we want. Rearrange the error
> checking so that newer flags get complained about first.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> Reported-by: TR Reardon <[email protected]>
Thanks, applied.
- Ted
On Thu, Sep 11, 2014 at 03:25:54PM +0200, Jan Kara wrote:
> On Wed 10-09-14 17:28:32, Darrick J. Wong wrote:
> > If the external journal device has metadata_csum enabled, verify
> > that the superblock checksum matches the block before we try to
> > mount.
> Looks good. You can add:
> Reviewed-by: Jan Kara <[email protected]>
Thanks, applied.
- Ted
On Thu, Sep 11, 2014 at 03:25:54PM +0200, Jan Kara wrote:
> On Wed 10-09-14 17:28:32, Darrick J. Wong wrote:
> > If the external journal device has metadata_csum enabled, verify
> > that the superblock checksum matches the block before we try to
> > mount.
> Looks good. You can add:
> Reviewed-by: Jan Kara <[email protected]>
>
> Honza
>
> PS: On a general note the way we are checking checksums in ext4 seems to be
> a bit arbitrary. It would seem more robust to just have ext4_bread() take
> data type of the buffer and if the buffer doesn't have buffer_verified set,
> it would run appropriate checksum check on the buffer. That way we are sure
> that the buffer is checked whenever it's loaded from disk. ocfs2 and xfs
> are doing it this way...
I agree that the current setup is a rather ad-hoc... but so is ext4.
Directories have to use ext4_bread; the extent tree uses sb_getblk and
bh_submit_read; the bitmaps seem to use submit_bh; and xattrs, mmp, and the
superblock use sb_bread.
That said, if I ever get around to the optimization patch that defers metadata
checksum calculation until journal transactions are being flushed to disk, this
seems like an easy and appropriate cleanup to go along with it.
--D
>
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/ext4/super.c | 9 +++++++++
> > 1 file changed, 9 insertions(+)
> >
> >
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 7045f1d..222ed5d 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -4372,6 +4372,15 @@ static journal_t *ext4_get_dev_journal(struct super_block *sb,
> > goto out_bdev;
> > }
> >
> > + if ((le32_to_cpu(es->s_feature_ro_compat) &
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
> > + es->s_checksum != ext4_superblock_csum(sb, es)) {
> > + ext4_msg(sb, KERN_ERR, "external journal has "
> > + "corrupt superblock");
> > + brelse(bh);
> > + goto out_bdev;
> > + }
> > +
> > if (memcmp(EXT4_SB(sb)->s_es->s_journal_uuid, es->s_uuid, 16)) {
> > ext4_msg(sb, KERN_ERR, "journal UUID does not match");
> > brelse(bh);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 11, 2014 at 03:15:11PM +0200, Jan Kara wrote:
> On Wed 10-09-14 17:28:38, Darrick J. Wong wrote:
> > If, during a journal_checksum_v3 replay we encounter a block that
> > doesn't match its tag in the descriptor block tag, we need to restart
> > the replay without the revoke table in the hopes of replaying the
> > newest non-corrupt version of the block that we possibly can.
> Ho hum, I don't like this. If you just ignore revoke list, you'll happily
> overwrite freshly allocated data blocks with older metadata. Also when
> verifying the checksum, we already know the block hasn't been revoked
> so what's even the benefit of ignoring the revoke list?
Let's say block X contains contents B0 and the journal contains:
1. write block 1 with B1
2. revoke "write of block 1 (with B1)"
3. write block 1 with B2
Now say that B2 gets corrupt, which means that #3 won't get replayed. Because
the revoke in #2 prevented the write in #1 from being written, at the end of
replay, block 1 has contents B0, even though B1 could have been played back.
What I'm really confused about is the intent of revoke records -- do they exist
to say "don't replay older versions of this block; a new one will follow
later"? Or they mean only "don't replay this block if it exists in an earlier
transaction" either because a newer block will follow OR because that block is
now something non-journalled (i.e. file data)? I started off thinking the
first, but perhaps it's really the second.
Rather than dumping the entire revoke list, I think I can just erase the
previous revoke records for just the corrupt block and then restart the replay.
--D
>
> Honza
>
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/jbd2/recovery.c | 19 +++++++++++++++++--
> > 1 file changed, 17 insertions(+), 2 deletions(-)
> >
> >
> > diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> > index 9b329b5..0094d8b 100644
> > --- a/fs/jbd2/recovery.c
> > +++ b/fs/jbd2/recovery.c
> > @@ -439,6 +439,7 @@ static int do_one_pass(journal_t *journal,
> > * block offsets): query the superblock.
> > */
> >
> > +restart_pass:
> > sb = journal->j_superblock;
> > next_commit_ID = be32_to_cpu(sb->s_sequence);
> > next_log_block = be32_to_cpu(sb->s_start);
> > @@ -585,7 +586,8 @@ static int do_one_pass(journal_t *journal,
> > /* If the block has been
> > * revoked, then we're all done
> > * here. */
> > - if (jbd2_journal_test_revoke
> > + if (!block_error &&
> > + jbd2_journal_test_revoke
> > (journal, blocknr,
> > next_commit_ID)) {
> > brelse(obh);
> > @@ -599,11 +601,24 @@ static int do_one_pass(journal_t *journal,
> > be32_to_cpu(tmp->h_sequence))) {
> > brelse(obh);
> > success = -EIO;
> > + if (!block_error) {
> > + /* If we see a corrupt
> > + * block, kill the
> > + * revoke list and
> > + * restart the replay
> > + * so that the blocks
> > + * are as close to
> > + * accurate as
> > + * possible. */
> > + jbd2_journal_clear_revoke(journal);
> > + brelse(bh);
> > + block_error = 1;
> > + goto restart_pass;
> > + }
> > printk(KERN_ERR "JBD2: Invalid "
> > "checksum recovering "
> > "block %llu in log\n",
> > blocknr);
> > - block_error = 1;
> > goto skip_write;
> > }
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 11, 2014 at 10:30:09AM -0700, Darrick J. Wong wrote:
> On Thu, Sep 11, 2014 at 03:15:11PM +0200, Jan Kara wrote:
> > On Wed 10-09-14 17:28:38, Darrick J. Wong wrote:
> > > If, during a journal_checksum_v3 replay we encounter a block that
> > > doesn't match its tag in the descriptor block tag, we need to restart
> > > the replay without the revoke table in the hopes of replaying the
> > > newest non-corrupt version of the block that we possibly can.
> > Ho hum, I don't like this. If you just ignore revoke list, you'll happily
> > overwrite freshly allocated data blocks with older metadata. Also when
> > verifying the checksum, we already know the block hasn't been revoked
> > so what's even the benefit of ignoring the revoke list?
>
> Let's say block X contains contents B0 and the journal contains:
>
> 1. write block 1 with B1
> 2. revoke "write of block 1 (with B1)"
> 3. write block 1 with B2
>
> Now say that B2 gets corrupt, which means that #3 won't get replayed. Because
> the revoke in #2 prevented the write in #1 from being written, at the end of
> replay, block 1 has contents B0, even though B1 could have been played back.
>
> What I'm really confused about is the intent of revoke records -- do they exist
> to say "don't replay older versions of this block; a new one will follow
> later"? Or they mean only "don't replay this block if it exists in an earlier
> transaction" either because a newer block will follow OR because that block is
> now something non-journalled (i.e. file data)? I started off thinking the
> first, but perhaps it's really the second.
Ahh, I get it. Revoke records are used only to indicate that a particular
block that's in the journal has become an un-journalled block; a subsequent
re-add to the journal removes the revoke record. Therefore, we can drop the
whole patch because the scenario above is not valid.
Sorry for the churn.
--D
>
> Rather than dumping the entire revoke list, I think I can just erase the
> previous revoke records for just the corrupt block and then restart the replay.
>
> --D
>
> >
> > Honza
> >
> > > Signed-off-by: Darrick J. Wong <[email protected]>
> > > ---
> > > fs/jbd2/recovery.c | 19 +++++++++++++++++--
> > > 1 file changed, 17 insertions(+), 2 deletions(-)
> > >
> > >
> > > diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> > > index 9b329b5..0094d8b 100644
> > > --- a/fs/jbd2/recovery.c
> > > +++ b/fs/jbd2/recovery.c
> > > @@ -439,6 +439,7 @@ static int do_one_pass(journal_t *journal,
> > > * block offsets): query the superblock.
> > > */
> > >
> > > +restart_pass:
> > > sb = journal->j_superblock;
> > > next_commit_ID = be32_to_cpu(sb->s_sequence);
> > > next_log_block = be32_to_cpu(sb->s_start);
> > > @@ -585,7 +586,8 @@ static int do_one_pass(journal_t *journal,
> > > /* If the block has been
> > > * revoked, then we're all done
> > > * here. */
> > > - if (jbd2_journal_test_revoke
> > > + if (!block_error &&
> > > + jbd2_journal_test_revoke
> > > (journal, blocknr,
> > > next_commit_ID)) {
> > > brelse(obh);
> > > @@ -599,11 +601,24 @@ static int do_one_pass(journal_t *journal,
> > > be32_to_cpu(tmp->h_sequence))) {
> > > brelse(obh);
> > > success = -EIO;
> > > + if (!block_error) {
> > > + /* If we see a corrupt
> > > + * block, kill the
> > > + * revoke list and
> > > + * restart the replay
> > > + * so that the blocks
> > > + * are as close to
> > > + * accurate as
> > > + * possible. */
> > > + jbd2_journal_clear_revoke(journal);
> > > + brelse(bh);
> > > + block_error = 1;
> > > + goto restart_pass;
> > > + }
> > > printk(KERN_ERR "JBD2: Invalid "
> > > "checksum recovering "
> > > "block %llu in log\n",
> > > blocknr);
> > > - block_error = 1;
> > > goto skip_write;
> > > }
> > >
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
When loading extended attributes, check each entry's value offset to
make sure it doesn't collide with the entries.
Without this check it is easy to crash the kernel by mounting a
malicious FS containing a file with an EA wherein e_value_offs = 0 and
e_value_size > 0 and then deleting the EA, which corrupts the name
list.
(See the f_ea_value_crash test's FS image in e2fsprogs for an example.)
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/xattr.c | 34 +++++++++++++++++++++++++---------
1 file changed, 25 insertions(+), 9 deletions(-)
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e738733..c11738e 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -189,15 +189,29 @@ ext4_listxattr(struct dentry *dentry, char *buffer, size_t size)
return ext4_xattr_list(dentry, buffer, size);
}
-static int
-ext4_xattr_check_names(struct ext4_xattr_entry *entry, void *end)
+int
+ext4_xattr_check_names(struct ext4_xattr_entry *entry, void *end,
+ void *value_start)
{
- while (!IS_LAST_ENTRY(entry)) {
- struct ext4_xattr_entry *next = EXT4_XATTR_NEXT(entry);
+ struct ext4_xattr_entry *e = entry;
+
+ while (!IS_LAST_ENTRY(e)) {
+ struct ext4_xattr_entry *next = EXT4_XATTR_NEXT(e);
if ((void *)next >= end)
return -EIO;
- entry = next;
+ e = next;
}
+
+ while (!IS_LAST_ENTRY(entry)) {
+ if (entry->e_value_size != 0 &&
+ (value_start + le16_to_cpu(entry->e_value_offs) <
+ (void *)e + sizeof(__u32) ||
+ value_start + le16_to_cpu(entry->e_value_offs) +
+ le32_to_cpu(entry->e_value_size) > end))
+ return -EIO;
+ entry = EXT4_XATTR_NEXT(entry);
+ }
+
return 0;
}
@@ -214,7 +228,8 @@ ext4_xattr_check_block(struct inode *inode, struct buffer_head *bh)
return -EIO;
if (!ext4_xattr_block_csum_verify(inode, bh->b_blocknr, BHDR(bh)))
return -EIO;
- error = ext4_xattr_check_names(BFIRST(bh), bh->b_data + bh->b_size);
+ error = ext4_xattr_check_names(BFIRST(bh), bh->b_data + bh->b_size,
+ bh->b_data);
if (!error)
set_buffer_verified(bh);
return error;
@@ -331,7 +346,7 @@ ext4_xattr_ibody_get(struct inode *inode, int name_index, const char *name,
header = IHDR(inode, raw_inode);
entry = IFIRST(header);
end = (void *)raw_inode + EXT4_SB(inode->i_sb)->s_inode_size;
- error = ext4_xattr_check_names(entry, end);
+ error = ext4_xattr_check_names(entry, end, entry);
if (error)
goto cleanup;
error = ext4_xattr_find_entry(&entry, name_index, name,
@@ -463,7 +478,7 @@ ext4_xattr_ibody_list(struct dentry *dentry, char *buffer, size_t buffer_size)
raw_inode = ext4_raw_inode(&iloc);
header = IHDR(inode, raw_inode);
end = (void *)raw_inode + EXT4_SB(inode->i_sb)->s_inode_size;
- error = ext4_xattr_check_names(IFIRST(header), end);
+ error = ext4_xattr_check_names(IFIRST(header), end, IFIRST(header));
if (error)
goto cleanup;
error = ext4_xattr_list_entries(dentry, IFIRST(header),
@@ -986,7 +1001,8 @@ int ext4_xattr_ibody_find(struct inode *inode, struct ext4_xattr_info *i,
is->s.here = is->s.first;
is->s.end = (void *)raw_inode + EXT4_SB(inode->i_sb)->s_inode_size;
if (ext4_test_inode_state(inode, EXT4_STATE_XATTR)) {
- error = ext4_xattr_check_names(IFIRST(header), is->s.end);
+ error = ext4_xattr_check_names(IFIRST(header), is->s.end,
+ IFIRST(header));
if (error)
return error;
/* Find the named attribute. */
On Thu 11-09-14 10:43:29, Darrick J. Wong wrote:
> On Thu, Sep 11, 2014 at 10:30:09AM -0700, Darrick J. Wong wrote:
> > On Thu, Sep 11, 2014 at 03:15:11PM +0200, Jan Kara wrote:
> > > On Wed 10-09-14 17:28:38, Darrick J. Wong wrote:
> > > > If, during a journal_checksum_v3 replay we encounter a block that
> > > > doesn't match its tag in the descriptor block tag, we need to restart
> > > > the replay without the revoke table in the hopes of replaying the
> > > > newest non-corrupt version of the block that we possibly can.
> > > Ho hum, I don't like this. If you just ignore revoke list, you'll happily
> > > overwrite freshly allocated data blocks with older metadata. Also when
> > > verifying the checksum, we already know the block hasn't been revoked
> > > so what's even the benefit of ignoring the revoke list?
> >
> > Let's say block X contains contents B0 and the journal contains:
> >
> > 1. write block 1 with B1
> > 2. revoke "write of block 1 (with B1)"
> > 3. write block 1 with B2
> >
> > Now say that B2 gets corrupt, which means that #3 won't get replayed. Because
> > the revoke in #2 prevented the write in #1 from being written, at the end of
> > replay, block 1 has contents B0, even though B1 could have been played back.
> >
> > What I'm really confused about is the intent of revoke records -- do they exist
> > to say "don't replay older versions of this block; a new one will follow
> > later"? Or they mean only "don't replay this block if it exists in an earlier
> > transaction" either because a newer block will follow OR because that block is
> > now something non-journalled (i.e. file data)? I started off thinking the
> > first, but perhaps it's really the second.
>
> Ahh, I get it. Revoke records are used only to indicate that a particular
> block that's in the journal has become an un-journalled block; a subsequent
Yup, exactly.
> re-add to the journal removes the revoke record.
Well, not quite. Block is revoked in some transaction (and that
information is stored in that transaction in the journal). Thus we don't
replay that block in older transactions. If in your example B2 gets
corrupt, replaying B1 has no sense because the existence of revoke record
means that the block has been reused for data. So metadata in B1 is
hopelessly outdated anyway.
Honza
> > Rather than dumping the entire revoke list, I think I can just erase the
> > previous revoke records for just the corrupt block and then restart the replay.
> >
> > --D
> >
> > >
> > > Honza
> > >
> > > > Signed-off-by: Darrick J. Wong <[email protected]>
> > > > ---
> > > > fs/jbd2/recovery.c | 19 +++++++++++++++++--
> > > > 1 file changed, 17 insertions(+), 2 deletions(-)
> > > >
> > > >
> > > > diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> > > > index 9b329b5..0094d8b 100644
> > > > --- a/fs/jbd2/recovery.c
> > > > +++ b/fs/jbd2/recovery.c
> > > > @@ -439,6 +439,7 @@ static int do_one_pass(journal_t *journal,
> > > > * block offsets): query the superblock.
> > > > */
> > > >
> > > > +restart_pass:
> > > > sb = journal->j_superblock;
> > > > next_commit_ID = be32_to_cpu(sb->s_sequence);
> > > > next_log_block = be32_to_cpu(sb->s_start);
> > > > @@ -585,7 +586,8 @@ static int do_one_pass(journal_t *journal,
> > > > /* If the block has been
> > > > * revoked, then we're all done
> > > > * here. */
> > > > - if (jbd2_journal_test_revoke
> > > > + if (!block_error &&
> > > > + jbd2_journal_test_revoke
> > > > (journal, blocknr,
> > > > next_commit_ID)) {
> > > > brelse(obh);
> > > > @@ -599,11 +601,24 @@ static int do_one_pass(journal_t *journal,
> > > > be32_to_cpu(tmp->h_sequence))) {
> > > > brelse(obh);
> > > > success = -EIO;
> > > > + if (!block_error) {
> > > > + /* If we see a corrupt
> > > > + * block, kill the
> > > > + * revoke list and
> > > > + * restart the replay
> > > > + * so that the blocks
> > > > + * are as close to
> > > > + * accurate as
> > > > + * possible. */
> > > > + jbd2_journal_clear_revoke(journal);
> > > > + brelse(bh);
> > > > + block_error = 1;
> > > > + goto restart_pass;
> > > > + }
> > > > printk(KERN_ERR "JBD2: Invalid "
> > > > "checksum recovering "
> > > > "block %llu in log\n",
> > > > blocknr);
> > > > - block_error = 1;
> > > > goto skip_write;
> > > > }
> > > >
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > --
> > > Jan Kara <[email protected]>
> > > SUSE Labs, CR
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR
Trying to follow your description below, but still have some confusion.
In the most common mount case of metadata-only journalling (no data journalling), revokes are emitted when extent blocks or directory blocks are released and reused as data blocks? ie updating a metadata block in-place will never yield a revoke transaction (inodes, bitmaps etc)?
--- Original Message ---
From: "Jan Kara" <[email protected]>
Sent: September 12, 2014 5:59 AM
To: "Darrick J. Wong" <[email protected]>
Cc: "Jan Kara" <[email protected]>, [email protected], [email protected]
Subject: Re: [PATCH 3/4] jbd2: restart replay without revokes if journal block csum fails
On Thu 11-09-14 10:43:29, Darrick J. Wong wrote:
> On Thu, Sep 11, 2014 at 10:30:09AM -0700, Darrick J. Wong wrote:
> > On Thu, Sep 11, 2014 at 03:15:11PM +0200, Jan Kara wrote:
> > > On Wed 10-09-14 17:28:38, Darrick J. Wong wrote:
> > > > If, during a journal_checksum_v3 replay we encounter a block that
> > > > doesn't match its tag in the descriptor block tag, we need to restart
> > > > the replay without the revoke table in the hopes of replaying the
> > > > newest non-corrupt version of the block that we possibly can.
> > > Ho hum, I don't like this. If you just ignore revoke list, you'll happily
> > > overwrite freshly allocated data blocks with older metadata. Also when
> > > verifying the checksum, we already know the block hasn't been revoked
> > > so what's even the benefit of ignoring the revoke list?
> >
> > Let's say block X contains contents B0 and the journal contains:
> >
> > 1. write block 1 with B1
> > 2. revoke "write of block 1 (with B1)"
> > 3. write block 1 with B2
> >
> > Now say that B2 gets corrupt, which means that #3 won't get replayed. Because
> > the revoke in #2 prevented the write in #1 from being written, at the end of
> > replay, block 1 has contents B0, even though B1 could have been played back.
> >
> > What I'm really confused about is the intent of revoke records -- do they exist
> > to say "don't replay older versions of this block; a new one will follow
> > later"? Or they mean only "don't replay this block if it exists in an earlier
> > transaction" either because a newer block will follow OR because that block is
> > now something non-journalled (i.e. file data)? I started off thinking the
> > first, but perhaps it's really the second.
>
> Ahh, I get it. Revoke records are used only to indicate that a particular
> block that's in the journal has become an un-journalled block; a subsequent
Yup, exactly.
> re-add to the journal removes the revoke record.
Well, not quite. Block is revoked in some transaction (and that
information is stored in that transaction in the journal). Thus we don't
replay that block in older transactions. If in your example B2 gets
corrupt, replaying B1 has no sense because the existence of revoke record
means that the block has been reused for data. So metadata in B1 is
hopelessly outdated anyway.
Honza
> > Rather than dumping the entire revoke list, I think I can just erase the
> > previous revoke records for just the corrupt block and then restart the replay.
> >
> > --D
> >
> > >
> > > Honza
> > >
> > > > Signed-off-by: Darrick J. Wong <[email protected]>
> > > > ---
> > > > fs/jbd2/recovery.c | 19 +++++++++++++++++--
> > > > 1 file changed, 17 insertions(+), 2 deletions(-)
> > > >
> > > >
> > > > diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> > > > index 9b329b5..0094d8b 100644
> > > > --- a/fs/jbd2/recovery.c
> > > > +++ b/fs/jbd2/recovery.c
> > > > @@ -439,6 +439,7 @@ static int do_one_pass(journal_t *journal,
> > > > * block offsets): query the superblock.
> > > > */
> > > >
> > > > +restart_pass:
> > > > sb = journal->j_superblock;
> > > > next_commit_ID = be32_to_cpu(sb->s_sequence);
> > > > next_log_block = be32_to_cpu(sb->s_start);
> > > > @@ -585,7 +586,8 @@ static int do_one_pass(journal_t *journal,
> > > > /* If the block has been
> > > > * revoked, then we're all done
> > > > * here. */
> > > > - if (jbd2_journal_test_revoke
> > > > + if (!block_error &&
> > > > + jbd2_journal_test_revoke
> > > > (journal, blocknr,
> > > > next_commit_ID)) {
> > > > brelse(obh);
> > > > @@ -599,11 +601,24 @@ static int do_one_pass(journal_t *journal,
> > > > be32_to_cpu(tmp->h_sequence))) {
> > > > brelse(obh);
> > > > success = -EIO;
> > > > + if (!block_error) {
> > > > + /* If we see a corrupt
> > > > + * block, kill the
> > > > + * revoke list and
> > > > + * restart the replay
> > > > + * so that the blocks
> > > > + * are as close to
> > > > + * accurate as
> > > > + * possible. */
> > > > + jbd2_journal_clear_revoke(journal);
> > > > + brelse(bh);
> > > > + block_error = 1;
> > > > + goto restart_pass;
> > > > + }
> > > > printk(KERN_ERR "JBD2: Invalid "
> > > > "checksum recovering "
> > > > "block %llu in log\n",
> > > > blocknr);
> > > > - block_error = 1;
> > > > goto skip_write;
> > > > }
> > > >
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > --
> > > Jan Kara <[email protected]>
> > > SUSE Labs, CR
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 12-09-14 09:14:31, TR Reardon wrote:
> Trying to follow your description below, but still have some confusion.
>
> In the most common mount case of metadata-only journalling (no data
> journalling), revokes are emitted when extent blocks or directory blocks
> are released and reused as data blocks? ie updating a metadata block
Yes.
> in-place will never yield a revoke transaction (inodes, bitmaps etc)?
Yes.
Honza
>
> --- Original Message ---
>
> From: "Jan Kara" <[email protected]>
> Sent: September 12, 2014 5:59 AM
> To: "Darrick J. Wong" <[email protected]>
> Cc: "Jan Kara" <[email protected]>, [email protected], [email protected]
> Subject: Re: [PATCH 3/4] jbd2: restart replay without revokes if journal block csum fails
>
> On Thu 11-09-14 10:43:29, Darrick J. Wong wrote:
> > On Thu, Sep 11, 2014 at 10:30:09AM -0700, Darrick J. Wong wrote:
> > > On Thu, Sep 11, 2014 at 03:15:11PM +0200, Jan Kara wrote:
> > > > On Wed 10-09-14 17:28:38, Darrick J. Wong wrote:
> > > > > If, during a journal_checksum_v3 replay we encounter a block that
> > > > > doesn't match its tag in the descriptor block tag, we need to restart
> > > > > the replay without the revoke table in the hopes of replaying the
> > > > > newest non-corrupt version of the block that we possibly can.
> > > > Ho hum, I don't like this. If you just ignore revoke list, you'll happily
> > > > overwrite freshly allocated data blocks with older metadata. Also when
> > > > verifying the checksum, we already know the block hasn't been revoked
> > > > so what's even the benefit of ignoring the revoke list?
> > >
> > > Let's say block X contains contents B0 and the journal contains:
> > >
> > > 1. write block 1 with B1
> > > 2. revoke "write of block 1 (with B1)"
> > > 3. write block 1 with B2
> > >
> > > Now say that B2 gets corrupt, which means that #3 won't get replayed. Because
> > > the revoke in #2 prevented the write in #1 from being written, at the end of
> > > replay, block 1 has contents B0, even though B1 could have been played back.
> > >
> > > What I'm really confused about is the intent of revoke records -- do they exist
> > > to say "don't replay older versions of this block; a new one will follow
> > > later"? Or they mean only "don't replay this block if it exists in an earlier
> > > transaction" either because a newer block will follow OR because that block is
> > > now something non-journalled (i.e. file data)? I started off thinking the
> > > first, but perhaps it's really the second.
> >
> > Ahh, I get it. Revoke records are used only to indicate that a particular
> > block that's in the journal has become an un-journalled block; a subsequent
> Yup, exactly.
>
> > re-add to the journal removes the revoke record.
> Well, not quite. Block is revoked in some transaction (and that
> information is stored in that transaction in the journal). Thus we don't
> replay that block in older transactions. If in your example B2 gets
> corrupt, replaying B1 has no sense because the existence of revoke record
> means that the block has been reused for data. So metadata in B1 is
> hopelessly outdated anyway.
>
> Honza
>
> > > Rather than dumping the entire revoke list, I think I can just erase the
> > > previous revoke records for just the corrupt block and then restart the replay.
> > >
> > > --D
> > >
> > > >
> > > > Honza
> > > >
> > > > > Signed-off-by: Darrick J. Wong <[email protected]>
> > > > > ---
> > > > > fs/jbd2/recovery.c | 19 +++++++++++++++++--
> > > > > 1 file changed, 17 insertions(+), 2 deletions(-)
> > > > >
> > > > >
> > > > > diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> > > > > index 9b329b5..0094d8b 100644
> > > > > --- a/fs/jbd2/recovery.c
> > > > > +++ b/fs/jbd2/recovery.c
> > > > > @@ -439,6 +439,7 @@ static int do_one_pass(journal_t *journal,
> > > > > * block offsets): query the superblock.
> > > > > */
> > > > >
> > > > > +restart_pass:
> > > > > sb = journal->j_superblock;
> > > > > next_commit_ID = be32_to_cpu(sb->s_sequence);
> > > > > next_log_block = be32_to_cpu(sb->s_start);
> > > > > @@ -585,7 +586,8 @@ static int do_one_pass(journal_t *journal,
> > > > > /* If the block has been
> > > > > * revoked, then we're all done
> > > > > * here. */
> > > > > - if (jbd2_journal_test_revoke
> > > > > + if (!block_error &&
> > > > > + jbd2_journal_test_revoke
> > > > > (journal, blocknr,
> > > > > next_commit_ID)) {
> > > > > brelse(obh);
> > > > > @@ -599,11 +601,24 @@ static int do_one_pass(journal_t *journal,
> > > > > be32_to_cpu(tmp->h_sequence))) {
> > > > > brelse(obh);
> > > > > success = -EIO;
> > > > > + if (!block_error) {
> > > > > + /* If we see a corrupt
> > > > > + * block, kill the
> > > > > + * revoke list and
> > > > > + * restart the replay
> > > > > + * so that the blocks
> > > > > + * are as close to
> > > > > + * accurate as
> > > > > + * possible. */
> > > > > + jbd2_journal_clear_revoke(journal);
> > > > > + brelse(bh);
> > > > > + block_error = 1;
> > > > > + goto restart_pass;
> > > > > + }
> > > > > printk(KERN_ERR "JBD2: Invalid "
> > > > > "checksum recovering "
> > > > > "block %llu in log\n",
> > > > > blocknr);
> > > > > - block_error = 1;
> > > > > goto skip_write;
> > > > > }
> > > > >
> > > > >
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > > > the body of a message to [email protected]
> > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > Jan Kara <[email protected]>
> > > > SUSE Labs, CR
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR