2011-10-08 07:55:59

by djwong

[permalink] [raw]
Subject: [PATCH v2 00/28] ext4: Add metadata checksumming

Hi all,

This patchset adds crc32c checksums to most of the ext4 metadata objects. A
full design document is on the ext4 wiki[1] but I will summarize that document here.

As much as we wish our storage hardware was totally reliable, it is still
quite possible for data to be corrupted on disk, corrupted during transfer over
a wire, or written to the wrong places. To protect against this sort of
non-hostile corruption, it is desirable to store checksums of metadata objects
on the filesystem to prevent broken metadata from shredding the filesystem.

The crc32c polynomial was chosen for its improved error detection capabilities
over crc32 and crc16, and because of its hardware acceleration on current and
upcoming Intel and Sparc chips.

Each type of metadata object has been retrofitted to store a checksum as follows:

- The superblock stores a crc32c of itself.
- Each inode stores crc32c(fs_uuid + inode_num + inode_gen + inode +
slack_space_after_inode)
- Block and inode bitmaps each get their own crc32c(fs_uuid + group_num +
bitmap), stored in the block group descriptor.
- Each extent tree block stores a crc32c(fs_uuid + inode_num + inode_gen +
extent_entries) in unused space at the end of the block.
- Each directory leaf block has an unused-looking directory entry big enough to
store a crc32c(fs_uuid + inode_num + inode_gen + block) at the end of the
block.
- Each directory htree block is shortened to contain a crc32c(fs_uuid +
inode_num + inode_gen + block) at the end of the block.
- Extended attribute blocks store crc32c(fs_uuid + id + ea_block) in the
header, where id is, depending on the refcount, either the inode_num and
inode_gen; or the block number.
- MMP blocks store crc32c(fs_uuid + mmpblock) at the end of the MMP block.
- Block groups can now use crc32c instead of crc16.
- The journal now has a v2 checksum feature flag.
- crc32c(j_uuid + block) checksums have been inserted into descriptor blocks,
commit blocks, revoke blocks, and the journal superblock.
- Each block tag in a descriptor block has a checksum of the related data block.

The first five patches in the kernel patchset fix existing bugs in ext4 that
cause incorrect checkums to be written. I think Ted already took them, but
they don't appear in 3.1.0-rc9. The subsequent patches add the necessary code
to support checksumming in ext4 and jbd2.

I also have a set patches that provide a faster crc32c implementation based on
Bob Pearson's earlier crc32 patchset. This has been sent under separate cover
to the crypto list and to lkml/linux-ext4.

The patchset for e2fsprogs will be sent under separate cover only to linux-ext4
as it is quite lengthy (~48 patches).

As far as performance impact goes, I see nearly no change with a standard mail
server ffsb simulation. On a test that involves only file creation and
deletion and extent tree modifications, I see a drop of about 50 percent with
the current kernel crc32c implementation; this improves to a drop of about 20
percent with the enclosed crc32c implementation. However, given that metadata
is usually a small fraction of total IO, it doesn't seem like the cost of
enabling this feature is unreasonable.

There are of course unresolved issues:

- I haven't fixed it up to checksum the exclude bitmap yet. I'll probably
submit that as an add-on to the snapshot patchset.

- Using the journal commit hooks to delay crc32c calculation until dirty
buffers are actually being written to disk.

- Interaction with online resize code. Yongqiang seems to be in the process of
rewriting this, so I haven't looked at it very closely yet.

Please have a look at the design document and patches, and please feel free to
suggest any changes.

v2: Checksum the MMP block, store the checksum type in the superblock, include
the inode generation in file checksums, and finally solve the problem of limited
space in block groups by splitting the checksum into two halves.

This patchset has been tested on 3.1.0-rc9 on x64, i386, ppc64, and ppc32. The
patches seems to work fine on all four platforms.

--D

[1] https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums



2011-10-08 07:56:29

by djwong

[permalink] [raw]
Subject: [PATCH 02/28] ext4: ext4_rename should dirty dir_bh with the correct directory

When ext4_rename performs a directory rename (move), dir_bh is a buffer that is
modified to update the '..' link in the directory being moved (old_inode).
However, ext4_handle_dirty_metadata is called with the old parent directory
inode (old_dir) and dir_bh, which is incorrect because dir_bh does not belong
to the parent inode. Fix this error.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/namei.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 310b356..6d3fab4 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2530,7 +2530,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
PARENT_INO(dir_bh->b_data, new_dir->i_sb->s_blocksize) =
cpu_to_le32(new_dir->i_ino);
BUFFER_TRACE(dir_bh, "call ext4_handle_dirty_metadata");
- retval = ext4_handle_dirty_metadata(handle, old_dir, dir_bh);
+ retval = ext4_handle_dirty_metadata(handle, old_inode, dir_bh);
if (retval) {
ext4_std_error(old_dir->i_sb, retval);
goto end_rename;


2011-10-08 07:53:50

by djwong

[permalink] [raw]
Subject: [PATCH 01/28] ext4: ext4_dx_add_entry should dirty directory metadata with the directory inode

ext4_dx_add_entry manipulates bh2 and frames[0].bh, which are two buffer_heads
that point to directory blocks assigned to the directory inode. However, the
function calls ext4_handle_dirty_metadata with the inode of the file that's
being added to the directory, not the directory inode itself. Therefore,
correct the code to dirty the directory buffers with the directory inode, not
the file inode.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/namei.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)


diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 1c924fa..310b356 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1586,7 +1586,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
dxtrace(dx_show_index("node", frames[1].entries));
dxtrace(dx_show_index("node",
((struct dx_node *) bh2->b_data)->entries));
- err = ext4_handle_dirty_metadata(handle, inode, bh2);
+ err = ext4_handle_dirty_metadata(handle, dir, bh2);
if (err)
goto journal_error;
brelse (bh2);
@@ -1612,7 +1612,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
if (err)
goto journal_error;
}
- err = ext4_handle_dirty_metadata(handle, inode, frames[0].bh);
+ err = ext4_handle_dirty_metadata(handle, dir, frames[0].bh);
if (err) {
ext4_std_error(inode->i_sb, err);
goto cleanup;

2011-10-08 07:54:03

by djwong

[permalink] [raw]
Subject: [PATCH 03/28] ext4: ext4_mkdir should dirty dir_block with the parent inode

ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir".
Unfortunately, dir_block belongs to the newly created directory (which is
"inode"), not the parent directory (which is "dir"). Fix the incorrect
association.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/namei.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)


diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 6d3fab4..50c7294 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1863,7 +1863,7 @@ retry:
ext4_set_de_type(dir->i_sb, de, S_IFDIR);
inode->i_nlink = 2;
BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, dir, dir_block);
+ err = ext4_handle_dirty_metadata(handle, inode, dir_block);
if (err)
goto out_clear_inode;
err = ext4_mark_inode_dirty(handle, inode);

2011-10-08 07:56:58

by djwong

[permalink] [raw]
Subject: [PATCH 05/28] ext4: Fix endian problem in MMP initialization

As part of startup, the MMP initialization code does this:

mmp->mmp_seq = seq = cpu_to_le32(mmp_new_seq());

Next, mmp->mmp_seq is written out to disk, a delay happens, and then the MMP
block is read back in and the sequence value is tested:

if (seq != le32_to_cpu(mmp->mmp_seq)) {
/* fail the mount */

On a LE system such as x86, the *le32* functions do nothing and this works.
Unfortunately, on a BE system such as ppc64, this comparison becomes:

if (cpu_to_le32(new_seq) != le32_to_cpu(cpu_to_le32(new_seq)) {
/* fail the mount */

Except for a few palindromic sequence numbers, this test always causes the
mount to fail, which makes MMP filesystems generally unmountable on ppc64. The
attached patch fixes this situation.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/mmp.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)


diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index 9bdef3f..a7a4986 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -295,7 +295,8 @@ skip:
/*
* write a new random sequence number.
*/
- mmp->mmp_seq = seq = cpu_to_le32(mmp_new_seq());
+ seq = mmp_new_seq();
+ mmp->mmp_seq = cpu_to_le32(seq);

retval = write_mmp_block(bh);
if (retval)


2011-10-08 07:54:10

by djwong

[permalink] [raw]
Subject: [PATCH 04/28] ext4: Prevent stack overrun in ext4_file_open when recording last known mountpoint

In ext4_file_open, the filesystem records the mountpoint of the first file that
is opened after mounting the filesystem. It does this by allocating a 64-byte
stack buffer, calling d_path() to grab the mount point through which this file
was accessed, and then memcpy()ing 64 bytes into the superblock's
s_last_mounted field, starting from the return value of d_path(), which is
stored as "cp". However, if cp > buf (which it frequently is since path
components are prepended starting at the end of buf) then we can end up copying
stack data into the superblock.

Writing stack variables into the superblock doesn't sound like a great idea, so
use strlcpy instead. Andi Kleen suggested using strlcpy instead of strncpy.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/file.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)


diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index e4095e9..9781099 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -181,8 +181,8 @@ static int ext4_file_open(struct inode * inode, struct file * filp)
path.dentry = mnt->mnt_root;
cp = d_path(&path, buf, sizeof(buf));
if (!IS_ERR(cp)) {
- memcpy(sbi->s_es->s_last_mounted, cp,
- sizeof(sbi->s_es->s_last_mounted));
+ strlcpy(sbi->s_es->s_last_mounted, cp,
+ sizeof(sbi->s_es->s_last_mounted));
ext4_mark_super_dirty(sb);
}
}

2011-10-08 07:54:29

by djwong

[permalink] [raw]
Subject: [PATCH 07/28] ext4: Create a rocompat flag for extended metadata checksumming

This patch introduces a rocompat feature flag to signal the presence of
checksums for metadata blocks. It also provides storage for a precomputed UUID
checksum.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4.h | 7 ++++++-
1 files changed, 6 insertions(+), 1 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b7d7bd0..8ee8646 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1222,6 +1222,9 @@ struct ext4_sb_info {

/* record the last minlen when FITRIM is called. */
atomic_t s_last_trim_minblks;
+
+ /* Precomputed FS UUID checksum */
+ __u32 s_uuid_crc;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -1360,6 +1363,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
#define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040
#define EXT4_FEATURE_RO_COMPAT_QUOTA 0x0100
+#define EXT4_FEATURE_RO_COMPAT_METADATA_CSUM 0x0400

#define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
#define EXT4_FEATURE_INCOMPAT_FILETYPE 0x0002
@@ -1402,7 +1406,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
EXT4_FEATURE_RO_COMPAT_BTREE_DIR |\
- EXT4_FEATURE_RO_COMPAT_HUGE_FILE)
+ EXT4_FEATURE_RO_COMPAT_HUGE_FILE |\
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)

/*
* Default values for user and/or group using reserved blocks

2011-10-08 07:55:29

by djwong

[permalink] [raw]
Subject: [PATCH 16/28] ext4: Verify and calculate checksums for extent tree blocks

Calculate and verify the checksum for each extent tree block. The checksum is
located in the space immediately after the last possible ext4_extent in the
block. The space is is typically the last 4-8 bytes in the block.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4_extents.h | 25 ++++++++++++++++++++-
fs/ext4/extents.c | 58 +++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 78 insertions(+), 5 deletions(-)


diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
index 095c36f..24b106a 100644
--- a/fs/ext4/ext4_extents.h
+++ b/fs/ext4/ext4_extents.h
@@ -62,10 +62,22 @@
/*
* ext4_inode has i_block array (60 bytes total).
* The first 12 bytes store ext4_extent_header;
- * the remainder stores an array of ext4_extent.
+ * the remainder stores an array of ext4_extent,
+ * followed by ext4_extent_tail.
*/

/*
+ * This is the extent tail on-disk structure.
+ * All other extent structures are 12 bytes long. It turns out that
+ * block_size % 12 >= 4 for all valid block sizes (1k, 2k, 4k).
+ * Therefore, this tail structure can be crammed into the end of the block
+ * without having to rebalance the tree.
+ */
+struct ext4_extent_tail {
+ __le32 et_checksum; /* crc32c(uuid+inum+extent_block) */
+};
+
+/*
* This is the extent on-disk structure.
* It's used at the bottom of the tree.
*/
@@ -101,6 +113,17 @@ struct ext4_extent_header {

#define EXT4_EXT_MAGIC cpu_to_le16(0xf30a)

+#define EXT4_EXTENT_TAIL_OFFSET(hdr) \
+ (sizeof(struct ext4_extent_header) + \
+ (sizeof(struct ext4_extent) * le16_to_cpu((hdr)->eh_max)))
+
+static inline struct ext4_extent_tail *
+find_ext4_extent_tail(struct ext4_extent_header *eh)
+{
+ return (struct ext4_extent_tail *)(((void *)eh) +
+ EXT4_EXTENT_TAIL_OFFSET(eh));
+}
+
/*
* Array of ext4_ext_path contains path to some extent.
* Creation/lookup routines use it for traversal/splitting/etc.
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 4ac4303..efd644d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -46,6 +46,46 @@

#include <trace/events/ext4.h>

+static __le32 ext4_extent_block_csum(struct inode *inode,
+ struct ext4_extent_header *eh)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ __u32 crc;
+
+ crc = ext4_chksum(sbi, ei->i_uuid_inum_crc, (__u8 *)eh,
+ EXT4_EXTENT_TAIL_OFFSET(eh));
+ return cpu_to_le32(crc);
+}
+
+static int ext4_extent_block_csum_verify(struct inode *inode,
+ struct ext4_extent_header *eh)
+{
+ struct ext4_extent_tail *et;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ et = find_ext4_extent_tail(eh);
+ if (et->et_checksum != ext4_extent_block_csum(inode, eh))
+ return 0;
+ return 1;
+}
+
+static void ext4_extent_block_csum_set(struct inode *inode,
+ struct ext4_extent_header *eh)
+{
+ struct ext4_extent_tail *et;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ et = find_ext4_extent_tail(eh);
+ et->et_checksum = ext4_extent_block_csum(inode, eh);
+}
+
static int ext4_split_extent(handle_t *handle,
struct inode *inode,
struct ext4_ext_path *path,
@@ -101,6 +141,7 @@ static int ext4_ext_dirty(handle_t *handle, struct inode *inode,
{
int err;
if (path->p_bh) {
+ ext4_extent_block_csum_set(inode, ext_block_hdr(path->p_bh));
/* path points to block */
err = ext4_handle_dirty_metadata(handle, inode, path->p_bh);
} else {
@@ -382,6 +423,12 @@ static int __ext4_ext_check(const char *function, unsigned int line,
error_msg = "invalid extent entries";
goto corrupted;
}
+ /* Verify checksum on non-root extent tree nodes */
+ if (ext_depth(inode) != depth &&
+ !ext4_extent_block_csum_verify(inode, eh)) {
+ error_msg = "extent tree corrupted";
+ goto corrupted;
+ }
return 0;

corrupted:
@@ -922,6 +969,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
le16_add_cpu(&neh->eh_entries, m);
}

+ ext4_extent_block_csum_set(inode, neh);
set_buffer_uptodate(bh);
unlock_buffer(bh);

@@ -1000,6 +1048,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
sizeof(struct ext4_extent_idx) * m);
le16_add_cpu(&neh->eh_entries, m);
}
+ ext4_extent_block_csum_set(inode, neh);
set_buffer_uptodate(bh);
unlock_buffer(bh);

@@ -1098,6 +1147,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
else
neh->eh_max = cpu_to_le16(ext4_ext_space_block(inode, 0));
neh->eh_magic = EXT4_EXT_MAGIC;
+ ext4_extent_block_csum_set(inode, neh);
set_buffer_uptodate(bh);
unlock_buffer(bh);

@@ -2458,10 +2508,6 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
if (uninitialized && num)
ext4_ext_mark_uninitialized(ex);

- err = ext4_ext_dirty(handle, inode, path + depth);
- if (err)
- goto out;
-
/*
* If the extent was completely released,
* we need to remove it from the leaf
@@ -2483,6 +2529,10 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
le16_add_cpu(&eh->eh_entries, -1);
}

+ err = ext4_ext_dirty(handle, inode, path + depth);
+ if (err)
+ goto out;
+
ext_debug("new extent: %u:%u:%llu\n", block, num,
ext4_ext_pblock(ex));
ex--;

2011-10-08 07:55:09

by djwong

[permalink] [raw]
Subject: [PATCH 13/28] ext4: Create bitmap checksum helper functions

These helper functions will be used to calculate and verify the block and inode
bitmap checksums.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/bitmap.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)


diff --git a/fs/ext4/bitmap.c b/fs/ext4/bitmap.c
index fa3af81..43d9173 100644
--- a/fs/ext4/bitmap.c
+++ b/fs/ext4/bitmap.c
@@ -29,3 +29,15 @@ unsigned int ext4_count_free(struct buffer_head *map, unsigned int numchars)

#endif /* EXT4FS_DEBUG */

+static __u32 ext4_bitmap_csum(struct super_block *sb, ext4_group_t group,
+ struct buffer_head *bh, int sz)
+{
+ __u32 crc;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ group = cpu_to_le32(group);
+ crc = ext4_chksum(sbi, sbi->s_uuid_crc, (__u8 *)&group, sizeof(group));
+ crc = ext4_chksum(sbi, crc, (__u8 *)bh->b_data, sz);
+
+ return crc;
+}

2011-10-08 07:54:22

by djwong

[permalink] [raw]
Subject: [PATCH 06/28] ext4: Create a new BH_Verified flag to avoid unnecessary metadata validation

Create a new BH_Verified flag to indicate that we've verified all the data in a
buffer_head for correctness. This allows us to bypass expensive verification
steps when they are not necessary without missing them when they are.

v2: The later jbd2 metadata checksumming patches need a BH_Verified flag for
journal metadata, so put the flag in jbd2.h so that both drivers can share the
same bit.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/extents.c | 35 ++++++++++++++++++++++++++---------
include/linux/jbd2.h | 2 ++
2 files changed, 28 insertions(+), 9 deletions(-)


diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 57cf568..4ac4303 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -403,6 +403,26 @@ int ext4_ext_check_inode(struct inode *inode)
return ext4_ext_check(inode, ext_inode_hdr(inode), ext_depth(inode));
}

+static int __ext4_ext_check_block(const char *function, unsigned int line,
+ struct inode *inode,
+ struct ext4_extent_header *eh,
+ int depth,
+ struct buffer_head *bh)
+{
+ int ret;
+
+ if (buffer_verified(bh))
+ return 0;
+ ret = ext4_ext_check(inode, eh, depth);
+ if (ret)
+ return ret;
+ set_buffer_verified(bh);
+ return ret;
+}
+
+#define ext4_ext_check_block(inode, eh, depth, bh) \
+ __ext4_ext_check_block(__func__, __LINE__, inode, eh, depth, bh)
+
#ifdef EXT_DEBUG
static void ext4_ext_show_path(struct inode *inode, struct ext4_ext_path *path)
{
@@ -659,8 +679,6 @@ ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block,
i = depth;
/* walk through the tree */
while (i) {
- int need_to_validate = 0;
-
ext_debug("depth %d: num %d, max %d\n",
ppos, le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max));

@@ -679,8 +697,6 @@ ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block,
put_bh(bh);
goto err;
}
- /* validate the extent entries */
- need_to_validate = 1;
}
eh = ext_block_hdr(bh);
ppos++;
@@ -694,7 +710,7 @@ ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block,
path[ppos].p_hdr = eh;
i--;

- if (need_to_validate && ext4_ext_check(inode, eh, i))
+ if (ext4_ext_check_block(inode, eh, i, bh))
goto err;
}

@@ -1350,7 +1366,8 @@ got_index:
return -EIO;
eh = ext_block_hdr(bh);
/* subtract from p_depth to get proper eh_depth */
- if (ext4_ext_check(inode, eh, path->p_depth - depth)) {
+ if (ext4_ext_check_block(inode, eh,
+ path->p_depth - depth, bh)) {
put_bh(bh);
return -EIO;
}
@@ -1363,7 +1380,7 @@ got_index:
if (bh == NULL)
return -EIO;
eh = ext_block_hdr(bh);
- if (ext4_ext_check(inode, eh, path->p_depth - depth)) {
+ if (ext4_ext_check_block(inode, eh, path->p_depth - depth, bh)) {
put_bh(bh);
return -EIO;
}
@@ -2591,8 +2608,8 @@ again:
err = -EIO;
break;
}
- if (ext4_ext_check(inode, ext_block_hdr(bh),
- depth - i - 1)) {
+ if (ext4_ext_check_block(inode, ext_block_hdr(bh),
+ depth - i - 1, bh)) {
err = -EIO;
break;
}
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 38f307b..3284706 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -313,6 +313,7 @@ enum jbd_state_bits {
BH_State, /* Pins most journal_head state */
BH_JournalHead, /* Pins bh->b_private and jh->b_bh */
BH_Unshadow, /* Dummy bit, for BJ_Shadow wakeup filtering */
+ BH_Verified, /* metadata block has been verified ok */
BH_JBDPrivateStart, /* First bit available for private use by FS */
};

@@ -325,6 +326,7 @@ TAS_BUFFER_FNS(Revoked, revoked)
BUFFER_FNS(RevokeValid, revokevalid)
TAS_BUFFER_FNS(RevokeValid, revokevalid)
BUFFER_FNS(Freed, freed)
+BUFFER_FNS(Verified, verified)

static inline struct buffer_head *jh2bh(struct journal_head *jh)
{

2011-10-08 07:55:02

by djwong

[permalink] [raw]
Subject: [PATCH 12/28] ext4: Use i_generation in inode-related metadata checksums

Whenever we are calculating a checksum for a piece of metadata that is
associated with an inode, incorporate i_generation into that calculation so
that old metadata blocks cannot be re-associated after a delete/create cycle.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ialloc.c | 9 ++++++---
fs/ext4/inode.c | 9 ++++++---
fs/ext4/ioctl.c | 4 ++++
3 files changed, 16 insertions(+), 6 deletions(-)


diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 6e5876a..d4b59e9 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1031,11 +1031,14 @@ got:
/* Precompute second piece of crc */
if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)) {
+ __u32 crc;
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
__le32 inum = cpu_to_le32(inode->i_ino);
- ei->i_uuid_inum_crc = ext4_chksum(sbi, sbi->s_uuid_crc,
- (__u8 *)&inum,
- sizeof(inum));
+ __le32 gen = cpu_to_le32(inode->i_generation);
+ crc = ext4_chksum(sbi, sbi->s_uuid_crc, (__u8 *)&inum,
+ sizeof(inum));
+ ei->i_uuid_inum_crc = ext4_chksum(sbi, crc, (__u8 *)&gen,
+ sizeof(gen));
}

ext4_clear_state_flags(ei); /* Only relevant on 32-bit archs */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b00315d..593f3bf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3492,10 +3492,13 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)) {
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ __u32 crc;
__le32 inum = cpu_to_le32(inode->i_ino);
- ei->i_uuid_inum_crc = ext4_chksum(sbi, sbi->s_uuid_crc,
- (__u8 *)&inum,
- sizeof(inum));
+ __le32 gen = raw_inode->i_generation;
+ crc = ext4_chksum(sbi, sbi->s_uuid_crc, (__u8 *)&inum,
+ sizeof(inum));
+ ei->i_uuid_inum_crc = ext4_chksum(sbi, crc, (__u8 *)&gen,
+ sizeof(gen));
}

if (!ext4_inode_csum_verify(inode, raw_inode, ei)) {
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index f18bfe3..fdf0b1e 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -149,6 +149,10 @@ flags_out:
if (!inode_owner_or_capable(inode))
return -EPERM;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return -ENOTTY;
+
err = mnt_want_write(filp->f_path.mnt);
if (err)
return err;

2011-10-08 07:55:22

by djwong

[permalink] [raw]
Subject: [PATCH 15/28] ext4: Calculate and verify block bitmap checksum

Compute and verify the checksum of the block bitmap; this checksum is stored in
the block group descriptor.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/balloc.c | 40 +++++++++++++++++++++------
fs/ext4/bitmap.c | 40 +++++++++++++++++++++++++++
fs/ext4/ext4.h | 10 +++++++
fs/ext4/ialloc.c | 4 +++
fs/ext4/mballoc.c | 78 +++++++++++++++++++++++++++++++++++++++++++++--------
5 files changed, 151 insertions(+), 21 deletions(-)


diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index f8224ad..dd96a7d 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -105,6 +105,9 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
ext4_free_inodes_set(sb, gdp, 0);
ext4_itable_unused_set(sb, gdp, 0);
memset(bh->b_data, 0xff, sb->s_blocksize);
+ ext4_block_bitmap_csum_set(sb, block_group, gdp, bh,
+ EXT4_BLOCKS_PER_GROUP(sb) /
+ 8);
return 0;
}
memset(bh->b_data, 0, sb->s_blocksize);
@@ -175,6 +178,10 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
*/
ext4_mark_bitmap_end(group_blocks, sb->s_blocksize * 8,
bh->b_data);
+ ext4_block_bitmap_csum_set(sb, block_group, gdp, bh,
+ EXT4_BLOCKS_PER_GROUP(sb) / 8);
+ gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group,
+ gdp);
}
return free_blocks - ext4_group_used_meta_blocks(sb, block_group, gdp);
}
@@ -232,10 +239,10 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block *sb,
return desc;
}

-static int ext4_valid_block_bitmap(struct super_block *sb,
- struct ext4_group_desc *desc,
- unsigned int block_group,
- struct buffer_head *bh)
+int ext4_valid_block_bitmap(struct super_block *sb,
+ struct ext4_group_desc *desc,
+ unsigned int block_group,
+ struct buffer_head *bh)
{
ext4_grpblk_t offset;
ext4_grpblk_t next_zero_bit;
@@ -312,12 +319,12 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
}

if (bitmap_uptodate(bh))
- return bh;
+ goto verify;

lock_buffer(bh);
if (bitmap_uptodate(bh)) {
unlock_buffer(bh);
- return bh;
+ goto verify;
}
ext4_lock_group(sb, block_group);
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
@@ -336,7 +343,7 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
*/
set_bitmap_uptodate(bh);
unlock_buffer(bh);
- return bh;
+ goto verify;
}
/*
* submit the buffer_head for read. We can
@@ -353,11 +360,26 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
block_group, bitmap_blk);
return NULL;
}
- ext4_valid_block_bitmap(sb, desc, block_group, bh);
+
+verify:
+ if (buffer_verified(bh))
+ return bh;
/*
* file system mounted not to panic on error,
- * continue with corrupt bitmap
+ * -EIO with corrupt bitmap
*/
+ ext4_lock_group(sb, block_group);
+ if (!ext4_valid_block_bitmap(sb, desc, block_group, bh) ||
+ !ext4_block_bitmap_csum_verify(sb, block_group, desc, bh,
+ EXT4_BLOCKS_PER_GROUP(sb) / 8)) {
+ ext4_unlock_group(sb, block_group);
+ put_bh(bh);
+ ext4_error(sb, "Corrupt block bitmap - block_group = %u, "
+ "block_bitmap = %llu", block_group, bitmap_blk);
+ return NULL;
+ }
+ ext4_unlock_group(sb, block_group);
+ set_buffer_verified(bh);
return bh;
}

diff --git a/fs/ext4/bitmap.c b/fs/ext4/bitmap.c
index e9342f6..00ebc7a 100644
--- a/fs/ext4/bitmap.c
+++ b/fs/ext4/bitmap.c
@@ -81,3 +81,43 @@ void ext4_inode_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
if (sbi->s_desc_size >= EXT4_BG_INODE_BITMAP_CSUM_HI_LOCATION)
gdp->bg_inode_bitmap_csum_hi = cpu_to_le16(crc >> 16);
}
+
+int ext4_block_bitmap_csum_verify(struct super_block *sb, ext4_group_t group,
+ struct ext4_group_desc *gdp,
+ struct buffer_head *bh, int sz)
+{
+ __u32 hi;
+ __u32 provided, calculated;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ provided = le16_to_cpu(gdp->bg_block_bitmap_csum_lo);
+ calculated = ext4_bitmap_csum(sb, group, bh, sz);
+ if (sbi->s_desc_size >= EXT4_BG_BLOCK_BITMAP_CSUM_HI_LOCATION) {
+ hi = le16_to_cpu(gdp->bg_block_bitmap_csum_hi);
+ provided |= (hi << 16);
+ } else
+ calculated &= 0xFFFF;
+
+ return provided == calculated;
+}
+
+void ext4_block_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
+ struct ext4_group_desc *gdp,
+ struct buffer_head *bh, int sz)
+{
+ __u32 crc;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ crc = ext4_bitmap_csum(sb, group, bh, sz);
+ gdp->bg_block_bitmap_csum_lo = cpu_to_le16(crc & 0xFFFF);
+ if (sbi->s_desc_size >= EXT4_BG_BLOCK_BITMAP_CSUM_HI_LOCATION)
+ gdp->bg_block_bitmap_csum_hi = cpu_to_le16(crc >> 16);
+}
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c98bc9f..7cd3541 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1776,8 +1776,18 @@ void ext4_inode_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
int ext4_inode_bitmap_csum_verify(struct super_block *sb, ext4_group_t group,
struct ext4_group_desc *gdp,
struct buffer_head *bh, int sz);
+void ext4_block_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
+ struct ext4_group_desc *gdp,
+ struct buffer_head *bh, int sz);
+int ext4_block_bitmap_csum_verify(struct super_block *sb, ext4_group_t group,
+ struct ext4_group_desc *gdp,
+ struct buffer_head *bh, int sz);

/* balloc.c */
+extern int ext4_valid_block_bitmap(struct super_block *sb,
+ struct ext4_group_desc *desc,
+ unsigned int block_group,
+ struct buffer_head *bh);
extern unsigned int ext4_block_group(struct super_block *sb,
ext4_fsblk_t blocknr);
extern ext4_grpblk_t ext4_block_group_offset(struct super_block *sb,
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 6fb00e7..77ae3df 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -981,6 +981,10 @@ got:
free = ext4_free_blocks_after_init(sb, group, gdp);
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
ext4_free_blks_set(sb, gdp, free);
+ ext4_block_bitmap_csum_set(sb, group, gdp,
+ block_bitmap_bh,
+ EXT4_BLOCKS_PER_GROUP(sb) /
+ 8);
gdp->bg_checksum = ext4_group_desc_csum(sbi, group,
gdp);
}
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 17a5a57..3d6300a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -753,6 +753,44 @@ void ext4_mb_generate_buddy(struct super_block *sb,
spin_unlock(&EXT4_SB(sb)->s_bal_lock);
}

+struct ext4_csum_data {
+ struct super_block *cd_sb;
+ ext4_group_t cd_group;
+};
+
+static void ext4_end_buffer_read_sync(struct buffer_head *bh, int uptodate)
+{
+ struct ext4_group_desc *desc;
+ struct super_block *sb =
+ ((struct ext4_csum_data *)bh->b_private)->cd_sb;
+ ext4_group_t group = ((struct ext4_csum_data *)bh->b_private)->cd_group;
+
+ if (!uptodate)
+ goto out;
+
+ desc = ext4_get_group_desc(sb, group, NULL);
+ if (!desc)
+ goto out;
+
+ if (buffer_verified(bh))
+ goto out;
+
+ ext4_lock_group(sb, group);
+ if (!ext4_valid_block_bitmap(sb, desc, group, bh) ||
+ !ext4_block_bitmap_csum_verify(sb, group, desc, bh,
+ EXT4_BLOCKS_PER_GROUP(sb) / 8)) {
+ ext4_unlock_group(sb, group);
+ ext4_error(sb, "Corrupt block bitmap, group = %u", group);
+ goto out;
+ }
+ ext4_unlock_group(sb, group);
+ set_buffer_verified(bh);
+
+out:
+ bh->b_private = NULL;
+ end_buffer_read_sync(bh, uptodate);
+}
+
/* The buddy information is attached the buddy cache inode
* for convenience. The information regarding each group
* is loaded via ext4_mb_load_buddy. The information involve
@@ -785,11 +823,12 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
int first_block;
struct super_block *sb;
struct buffer_head *bhs;
- struct buffer_head **bh;
+ struct buffer_head **bh = NULL;
struct inode *inode;
char *data;
char *bitmap;
struct ext4_group_info *grinfo;
+ struct ext4_csum_data *csd;

mb_debug(1, "init page %lu\n", page->index);

@@ -803,6 +842,11 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
if (groups_per_page == 0)
groups_per_page = 1;

+ csd = kzalloc(sizeof(struct ext4_csum_data) * groups_per_page,
+ GFP_NOFS);
+ if (csd == NULL)
+ goto out;
+
/* allocate buffer_heads to read bitmaps */
if (groups_per_page > 1) {
err = -ENOMEM;
@@ -880,22 +924,25 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
* get set with buffer lock held.
*/
set_bitmap_uptodate(bh[i]);
- bh[i]->b_end_io = end_buffer_read_sync;
+ csd[i].cd_sb = sb;
+ csd[i].cd_group = first_group + i;
+ bh[i]->b_private = csd + i;
+ bh[i]->b_end_io = ext4_end_buffer_read_sync;
submit_bh(READ, bh[i]);
mb_debug(1, "read bitmap for group %u\n", first_group + i);
}

- /* wait for I/O completion */
- for (i = 0; i < groups_per_page; i++)
- if (bh[i])
- wait_on_buffer(bh[i]);
-
- err = -EIO;
- for (i = 0; i < groups_per_page; i++)
- if (bh[i] && !buffer_uptodate(bh[i]))
- goto out;
-
+ /* Wait for I/O completion and checksum verification */
err = 0;
+ for (i = 0; i < groups_per_page; i++) {
+ if (bh[i] == NULL)
+ continue;
+ wait_on_buffer(bh[i]);
+ if (!buffer_uptodate(bh[i]) ||
+ !buffer_verified(bh[i]))
+ err = -EIO;
+ }
+
first_block = page->index * blocks_per_page;
for (i = 0; i < blocks_per_page; i++) {
int group;
@@ -972,6 +1019,7 @@ out:
if (bh != &bhs)
kfree(bh);
}
+ kfree(csd);
return err;
}

@@ -2829,6 +2877,8 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
}
len = ext4_free_blks_count(sb, gdp) - ac->ac_b_ex.fe_len;
ext4_free_blks_set(sb, gdp, len);
+ ext4_block_bitmap_csum_set(sb, ac->ac_b_ex.fe_group, gdp, bitmap_bh,
+ EXT4_BLOCKS_PER_GROUP(sb) / 8);
gdp->bg_checksum = ext4_group_desc_csum(sbi, ac->ac_b_ex.fe_group, gdp);

ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
@@ -4638,6 +4688,8 @@ do_more:

ret = ext4_free_blks_count(sb, gdp) + count;
ext4_free_blks_set(sb, gdp, ret);
+ ext4_block_bitmap_csum_set(sb, block_group, gdp, bitmap_bh,
+ EXT4_BLOCKS_PER_GROUP(sb) / 8);
gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
ext4_unlock_group(sb, block_group);
percpu_counter_add(&sbi->s_freeblocks_counter, count);
@@ -4780,6 +4832,8 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
mb_free_blocks(NULL, &e4b, bit, count);
blk_free_count = blocks_freed + ext4_free_blks_count(sb, desc);
ext4_free_blks_set(sb, desc, blk_free_count);
+ ext4_block_bitmap_csum_set(sb, block_group, desc, bitmap_bh,
+ EXT4_BLOCKS_PER_GROUP(sb) / 8);
desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc);
ext4_unlock_group(sb, block_group);
percpu_counter_add(&sbi->s_freeblocks_counter, blocks_freed);

2011-10-08 07:54:56

by djwong

[permalink] [raw]
Subject: [PATCH 11/28] ext4: Calculate and verify inode checksums

This patch introduces to ext4 the ability to calculate and verify inode
checksums. This requires the use of a new ro compatibility flag and some
accompanying e2fsprogs patches to provide the relevant features in tune2fs and
e2fsck.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4.h | 10 ++++-
fs/ext4/ialloc.c | 10 +++++
fs/ext4/inode.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++----
3 files changed, 117 insertions(+), 11 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c99e44c..227210a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -611,7 +611,8 @@ struct ext4_inode {
__le16 l_i_file_acl_high;
__le16 l_i_uid_high; /* these 2 fields */
__le16 l_i_gid_high; /* were reserved2[0] */
- __u32 l_i_reserved2;
+ __le16 l_i_checksum_lo;/* crc32c(uuid+inum+inode) LE */
+ __le16 l_i_reserved;
} linux2;
struct {
__le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */
@@ -627,7 +628,7 @@ struct ext4_inode {
} masix2;
} osd2; /* OS dependent 2 */
__le16 i_extra_isize;
- __le16 i_pad1;
+ __le16 i_checksum_hi; /* crc32c(uuid+inum+inode) BE */
__le32 i_ctime_extra; /* extra Change time (nsec << 2 | epoch) */
__le32 i_mtime_extra; /* extra Modification time(nsec << 2 | epoch) */
__le32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */
@@ -729,7 +730,7 @@ do { \
#define i_gid_low i_gid
#define i_uid_high osd2.linux2.l_i_uid_high
#define i_gid_high osd2.linux2.l_i_gid_high
-#define i_reserved2 osd2.linux2.l_i_reserved2
+#define i_checksum_lo osd2.linux2.l_i_checksum_lo

#elif defined(__GNU__)

@@ -868,6 +869,9 @@ struct ext4_inode_info {
*/
tid_t i_sync_tid;
tid_t i_datasync_tid;
+
+ /* crc32c(uuid+inum) */
+ __u32 i_uuid_inum_crc;
};

/*
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 9c63f27..6e5876a 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1028,6 +1028,16 @@ got:
inode->i_generation = sbi->s_next_generation++;
spin_unlock(&sbi->s_next_gen_lock);

+ /* Precompute second piece of crc */
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)) {
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ __le32 inum = cpu_to_le32(inode->i_ino);
+ ei->i_uuid_inum_crc = ext4_chksum(sbi, sbi->s_uuid_crc,
+ (__u8 *)&inum,
+ sizeof(inum));
+ }
+
ext4_clear_state_flags(ei); /* Only relevant on 32-bit archs */
ext4_set_inode_state(inode, EXT4_STATE_NEW);

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6e64e0b..b00315d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -49,6 +49,73 @@

#define MPAGE_DA_EXTENT_TAIL 0x01

+static __u32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw,
+ struct ext4_inode_info *ei)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ __u16 crc_lo;
+ __u16 crc_hi = 0;
+ __u32 crc;
+
+ crc_lo = raw->i_checksum_lo;
+ raw->i_checksum_lo = 0;
+ if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
+ EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi)) {
+ crc_hi = raw->i_checksum_hi;
+ raw->i_checksum_hi = 0;
+ }
+
+ crc = ext4_chksum(sbi, ei->i_uuid_inum_crc, (__u8 *)raw,
+ EXT4_INODE_SIZE(inode->i_sb));
+
+ raw->i_checksum_lo = crc_lo;
+ if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
+ EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi))
+ raw->i_checksum_hi = crc_hi;
+
+ return crc;
+}
+
+static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw,
+ struct ext4_inode_info *ei)
+{
+ __u32 provided, calculated;
+
+ if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
+ cpu_to_le32(EXT4_OS_LINUX) ||
+ !EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ provided = le16_to_cpu(raw->i_checksum_lo);
+ calculated = ext4_inode_csum(inode, raw, ei);
+ if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
+ EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi))
+ provided |= ((__u32)le16_to_cpu(raw->i_checksum_hi)) << 16;
+ else
+ calculated &= 0xFFFF;
+
+ return provided == calculated;
+}
+
+static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+ struct ext4_inode_info *ei)
+{
+ __u32 crc;
+
+ if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
+ cpu_to_le32(EXT4_OS_LINUX) ||
+ !EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ crc = ext4_inode_csum(inode, raw, ei);
+ raw->i_checksum_lo = cpu_to_le16(crc & 0xFFFF);
+ if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
+ EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi))
+ raw->i_checksum_hi = cpu_to_le16(crc >> 16);
+}
+
static inline int ext4_begin_ordered_truncate(struct inode *inode,
loff_t new_size)
{
@@ -3407,6 +3474,36 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
if (ret < 0)
goto bad_inode;
raw_inode = ext4_raw_inode(&iloc);
+
+ if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
+ ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
+ if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
+ EXT4_INODE_SIZE(inode->i_sb)) {
+ EXT4_ERROR_INODE(inode, "bad extra_isize (%u != %u)",
+ EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize,
+ EXT4_INODE_SIZE(inode->i_sb));
+ ret = -EIO;
+ goto bad_inode;
+ }
+ } else
+ ei->i_extra_isize = 0;
+
+ /* Precompute second piece of crc */
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)) {
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ __le32 inum = cpu_to_le32(inode->i_ino);
+ ei->i_uuid_inum_crc = ext4_chksum(sbi, sbi->s_uuid_crc,
+ (__u8 *)&inum,
+ sizeof(inum));
+ }
+
+ if (!ext4_inode_csum_verify(inode, raw_inode, ei)) {
+ EXT4_ERROR_INODE(inode, "checksum invalid");
+ ret = -EIO;
+ goto bad_inode;
+ }
+
inode->i_mode = le16_to_cpu(raw_inode->i_mode);
inode->i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
inode->i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
@@ -3484,12 +3581,6 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
}

if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE) {
- ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
- if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
- EXT4_INODE_SIZE(inode->i_sb)) {
- ret = -EIO;
- goto bad_inode;
- }
if (ei->i_extra_isize == 0) {
/* The extra space is currently unused. Use it. */
ei->i_extra_isize = sizeof(struct ext4_inode) -
@@ -3501,8 +3592,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
if (*magic == cpu_to_le32(EXT4_XATTR_MAGIC))
ext4_set_inode_state(inode, EXT4_STATE_XATTR);
}
- } else
- ei->i_extra_isize = 0;
+ }

EXT4_INODE_GET_XTIME(i_ctime, inode, raw_inode);
EXT4_INODE_GET_XTIME(i_mtime, inode, raw_inode);
@@ -3727,6 +3817,8 @@ static int ext4_do_update_inode(handle_t *handle,
raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
}

+ ext4_inode_csum_set(inode, raw_inode, ei);
+
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
if (!err)


2011-10-08 07:55:15

by djwong

[permalink] [raw]
Subject: [PATCH 14/28] ext4: Calculate and verify checksums for inode bitmaps

Compute and verify the checksum of the inode bitmap; the checkum is stored in
the block group descriptor.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/bitmap.c | 40 ++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4.h | 22 ++++++++++++++++++++--
fs/ext4/ialloc.c | 30 +++++++++++++++++++++++++++---
3 files changed, 87 insertions(+), 5 deletions(-)


diff --git a/fs/ext4/bitmap.c b/fs/ext4/bitmap.c
index 43d9173..e9342f6 100644
--- a/fs/ext4/bitmap.c
+++ b/fs/ext4/bitmap.c
@@ -41,3 +41,43 @@ static __u32 ext4_bitmap_csum(struct super_block *sb, ext4_group_t group,

return crc;
}
+
+int ext4_inode_bitmap_csum_verify(struct super_block *sb, ext4_group_t group,
+ struct ext4_group_desc *gdp,
+ struct buffer_head *bh, int sz)
+{
+ __u32 hi;
+ __u32 provided, calculated;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ provided = le16_to_cpu(gdp->bg_inode_bitmap_csum_lo);
+ calculated = ext4_bitmap_csum(sb, group, bh, sz);
+ if (sbi->s_desc_size >= EXT4_BG_INODE_BITMAP_CSUM_HI_LOCATION) {
+ hi = le16_to_cpu(gdp->bg_inode_bitmap_csum_hi);
+ provided |= (hi << 16);
+ } else
+ calculated &= 0xFFFF;
+
+ return provided == calculated;
+}
+
+void ext4_inode_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
+ struct ext4_group_desc *gdp,
+ struct buffer_head *bh, int sz)
+{
+ __u32 crc;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ crc = ext4_bitmap_csum(sb, group, bh, sz);
+ gdp->bg_inode_bitmap_csum_lo = cpu_to_le16(crc & 0xFFFF);
+ if (sbi->s_desc_size >= EXT4_BG_INODE_BITMAP_CSUM_HI_LOCATION)
+ gdp->bg_inode_bitmap_csum_hi = cpu_to_le16(crc >> 16);
+}
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 227210a..c98bc9f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -271,7 +271,9 @@ struct ext4_group_desc
__le16 bg_free_inodes_count_lo;/* Free inodes count */
__le16 bg_used_dirs_count_lo; /* Directories count */
__le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */
- __u32 bg_reserved[2]; /* Likely block/inode bitmap checksum */
+ __le32 bg_exclude_bitmap_lo; /* Exclude bitmap for snapshots */
+ __le16 bg_block_bitmap_csum_lo;/* crc32c(s_uuid+grp_num+bbitmap) LE */
+ __le16 bg_inode_bitmap_csum_lo;/* crc32c(s_uuid+grp_num+ibitmap) LE */
__le16 bg_itable_unused_lo; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
__le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
@@ -281,9 +283,19 @@ struct ext4_group_desc
__le16 bg_free_inodes_count_hi;/* Free inodes count MSB */
__le16 bg_used_dirs_count_hi; /* Directories count MSB */
__le16 bg_itable_unused_hi; /* Unused inodes count MSB */
- __u32 bg_reserved2[3];
+ __le32 bg_exclude_bitmap_hi; /* Exclude bitmap block MSB */
+ __le16 bg_block_bitmap_csum_hi;/* crc32c(s_uuid+grp_num+bbitmap) BE */
+ __le16 bg_inode_bitmap_csum_hi;/* crc32c(s_uuid+grp_num+ibitmap) BE */
+ __u32 bg_reserved;
};

+#define EXT4_BG_INODE_BITMAP_CSUM_HI_LOCATION \
+ (offsetof(struct ext4_group_desc, bg_inode_bitmap_csum_hi) + \
+ sizeof(__le16))
+#define EXT4_BG_BLOCK_BITMAP_CSUM_HI_LOCATION \
+ (offsetof(struct ext4_group_desc, bg_block_bitmap_csum_hi) + \
+ sizeof(__le16))
+
/*
* Structure of a flex block group info
*/
@@ -1758,6 +1770,12 @@ struct mmpd_data {

/* bitmap.c */
extern unsigned int ext4_count_free(struct buffer_head *, unsigned);
+void ext4_inode_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
+ struct ext4_group_desc *gdp,
+ struct buffer_head *bh, int sz);
+int ext4_inode_bitmap_csum_verify(struct super_block *sb, ext4_group_t group,
+ struct ext4_group_desc *gdp,
+ struct buffer_head *bh, int sz);

/* balloc.c */
extern unsigned int ext4_block_group(struct super_block *sb,
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index d4b59e9..6fb00e7 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -82,12 +82,17 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb,
ext4_free_inodes_set(sb, gdp, 0);
ext4_itable_unused_set(sb, gdp, 0);
memset(bh->b_data, 0xff, sb->s_blocksize);
+ ext4_inode_bitmap_csum_set(sb, block_group, gdp, bh,
+ EXT4_INODES_PER_GROUP(sb) / 8);
return 0;
}

memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
ext4_mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), sb->s_blocksize * 8,
bh->b_data);
+ ext4_inode_bitmap_csum_set(sb, block_group, gdp, bh,
+ EXT4_INODES_PER_GROUP(sb) / 8);
+ gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);

return EXT4_INODES_PER_GROUP(sb);
}
@@ -118,12 +123,12 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
return NULL;
}
if (bitmap_uptodate(bh))
- return bh;
+ goto verify;

lock_buffer(bh);
if (bitmap_uptodate(bh)) {
unlock_buffer(bh);
- return bh;
+ goto verify;
}

ext4_lock_group(sb, block_group);
@@ -131,6 +136,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
ext4_init_inode_bitmap(sb, bh, block_group, desc);
set_bitmap_uptodate(bh);
set_buffer_uptodate(bh);
+ set_buffer_verified(bh);
ext4_unlock_group(sb, block_group);
unlock_buffer(bh);
return bh;
@@ -144,7 +150,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
*/
set_bitmap_uptodate(bh);
unlock_buffer(bh);
- return bh;
+ goto verify;
}
/*
* submit the buffer_head for read. We can
@@ -161,6 +167,20 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
block_group, bitmap_blk);
return NULL;
}
+
+verify:
+ ext4_lock_group(sb, block_group);
+ if (!buffer_verified(bh) &&
+ !ext4_inode_bitmap_csum_verify(sb, block_group, desc, bh,
+ EXT4_INODES_PER_GROUP(sb) / 8)) {
+ ext4_unlock_group(sb, block_group);
+ put_bh(bh);
+ ext4_error(sb, "Corrupt inode bitmap - block_group = %u, "
+ "inode_bitmap = %llu", block_group, bitmap_blk);
+ return NULL;
+ }
+ ext4_unlock_group(sb, block_group);
+ set_buffer_verified(bh);
return bh;
}

@@ -265,6 +285,8 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
ext4_used_dirs_set(sb, gdp, count);
percpu_counter_dec(&sbi->s_dirs_counter);
}
+ ext4_inode_bitmap_csum_set(sb, block_group, gdp, bitmap_bh,
+ EXT4_INODES_PER_GROUP(sb) / 8);
gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
ext4_unlock_group(sb, block_group);

@@ -784,6 +806,8 @@ static int ext4_claim_inode(struct super_block *sb,
atomic_inc(&sbi->s_flex_groups[f].used_dirs);
}
}
+ ext4_inode_bitmap_csum_set(sb, group, gdp, inode_bitmap_bh,
+ EXT4_INODES_PER_GROUP(sb) / 8);
gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
err_ret:
ext4_unlock_group(sb, group);


2011-10-08 07:55:36

by djwong

[permalink] [raw]
Subject: [PATCH 17/28] ext4: Calculate and verify checksums for htree nodes

Calculate and verify the checksum for directory index tree (htree) node blocks.
The checksum is stored in the last 4 bytes of the htree block and requires the
dx_entry array to stop 1 dx_entry short of the end of the block.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/namei.c | 164 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 160 insertions(+), 4 deletions(-)


diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index d0158f2..daa0c18 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -145,6 +145,15 @@ struct dx_map_entry
u16 size;
};

+/*
+ * This goes at the end of each htree block. If you want to use the
+ * reserved field, you'll have to update the checksum code to include it.
+ */
+struct dx_tail {
+ u32 reserved;
+ u32 checksum; /* crc32c(uuid+inum+dirblock) */
+};
+
static inline ext4_lblk_t dx_get_block(struct dx_entry *entry);
static void dx_set_block(struct dx_entry *entry, ext4_lblk_t value);
static inline unsigned dx_get_hash(struct dx_entry *entry);
@@ -180,6 +189,116 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir,
static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
struct inode *inode);

+/* checksumming functions */
+static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
+ struct ext4_dir_entry *dirent,
+ int *offset)
+{
+ struct ext4_dir_entry *dp;
+ struct dx_root_info *root;
+ int count_offset;
+
+ if (le16_to_cpu(dirent->rec_len) == EXT4_BLOCK_SIZE(inode->i_sb))
+ count_offset = 8;
+ else if (le16_to_cpu(dirent->rec_len) == 12) {
+ dp = (struct ext4_dir_entry *)(((void *)dirent) + 12);
+ if (le16_to_cpu(dp->rec_len) !=
+ EXT4_BLOCK_SIZE(inode->i_sb) - 12)
+ return NULL;
+ root = (struct dx_root_info *)(((void *)dp + 12));
+ if (root->reserved_zero ||
+ root->info_length != sizeof(struct dx_root_info))
+ return NULL;
+ count_offset = 32;
+ } else
+ return NULL;
+
+ if (offset)
+ *offset = count_offset;
+ return (struct dx_countlimit *)(((void *)dirent) + count_offset);
+}
+
+static __le32 ext4_dx_csum(struct inode *inode, struct ext4_dir_entry *dirent,
+ int count_offset, int count)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ __u32 crc;
+ int size;
+
+ size = count_offset + (count * sizeof(struct dx_entry));
+
+ crc = ext4_chksum(sbi, ei->i_uuid_inum_crc, (__u8 *)dirent, size);
+ return cpu_to_le32(crc);
+}
+
+static int ext4_dx_csum_verify(struct inode *inode,
+ struct ext4_dir_entry *dirent)
+{
+ struct dx_countlimit *c;
+ struct dx_tail *t;
+ int count_offset, limit, count;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ c = get_dx_countlimit(inode, dirent, &count_offset);
+ if (!c) {
+ EXT4_ERROR_INODE(inode, "dir seems corrupt? Run e2fsck -D.");
+ return 1;
+ }
+ limit = le16_to_cpu(c->limit);
+ count = le16_to_cpu(c->count);
+ if (count_offset + (limit * sizeof(struct dx_entry)) >
+ EXT4_BLOCK_SIZE(inode->i_sb) - sizeof(struct dx_tail)) {
+ EXT4_ERROR_INODE(inode, "metadata_csum set but no space for "
+ "tree checksum found. Run e2fsck -D.");
+ return 1;
+ }
+ t = (struct dx_tail *)(((struct dx_entry *)c) + limit);
+
+ if (t->checksum != ext4_dx_csum(inode, dirent, count_offset, count))
+ return 0;
+ return 1;
+}
+
+static void ext4_dx_csum_set(struct inode *inode, struct ext4_dir_entry *dirent)
+{
+ struct dx_countlimit *c;
+ struct dx_tail *t;
+ int count_offset, limit, count;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ c = get_dx_countlimit(inode, dirent, &count_offset);
+ if (!c) {
+ EXT4_ERROR_INODE(inode, "dir seems corrupt? Run e2fsck -D.");
+ return;
+ }
+ limit = le16_to_cpu(c->limit);
+ count = le16_to_cpu(c->count);
+ if (count_offset + (limit * sizeof(struct dx_entry)) >
+ EXT4_BLOCK_SIZE(inode->i_sb) - sizeof(struct dx_tail)) {
+ EXT4_ERROR_INODE(inode, "metadata_csum set but no space for "
+ "tree checksum. Run e2fsck -D.");
+ return;
+ }
+ t = (struct dx_tail *)(((struct dx_entry *)c) + limit);
+
+ t->checksum = ext4_dx_csum(inode, dirent, count_offset, count);
+}
+
+static inline int ext4_handle_dirty_dx_node(handle_t *handle,
+ struct inode *inode,
+ struct buffer_head *bh)
+{
+ ext4_dx_csum_set(inode, (struct ext4_dir_entry *)bh->b_data);
+ return ext4_handle_dirty_metadata(handle, inode, bh);
+}
+
/*
* p is at least 6 bytes before the end of page
*/
@@ -239,12 +358,20 @@ static inline unsigned dx_root_limit(struct inode *dir, unsigned infosize)
{
unsigned entry_space = dir->i_sb->s_blocksize - EXT4_DIR_REC_LEN(1) -
EXT4_DIR_REC_LEN(2) - infosize;
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ entry_space -= sizeof(struct dx_tail);
return entry_space / sizeof(struct dx_entry);
}

static inline unsigned dx_node_limit(struct inode *dir)
{
unsigned entry_space = dir->i_sb->s_blocksize - EXT4_DIR_REC_LEN(0);
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ entry_space -= sizeof(struct dx_tail);
return entry_space / sizeof(struct dx_entry);
}

@@ -390,6 +517,15 @@ dx_probe(const struct qstr *d_name, struct inode *dir,
goto fail;
}

+ if (!buffer_verified(bh) &&
+ !ext4_dx_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data)) {
+ ext4_warning(dir->i_sb, "Root failed checksum");
+ brelse(bh);
+ *err = ERR_BAD_DX_DIR;
+ goto fail;
+ }
+ set_buffer_verified(bh);
+
entries = (struct dx_entry *) (((char *)&root->info) +
root->info.info_length);

@@ -450,6 +586,17 @@ dx_probe(const struct qstr *d_name, struct inode *dir,
if (!(bh = ext4_bread (NULL,dir, dx_get_block(at), 0, err)))
goto fail2;
at = entries = ((struct dx_node *) bh->b_data)->entries;
+
+ if (!buffer_verified(bh) &&
+ !ext4_dx_csum_verify(dir,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ ext4_warning(dir->i_sb, "Node failed checksum");
+ brelse(bh);
+ *err = ERR_BAD_DX_DIR;
+ goto fail;
+ }
+ set_buffer_verified(bh);
+
if (dx_get_limit(entries) != dx_node_limit (dir)) {
ext4_warning(dir->i_sb,
"dx entry: limit != node limit");
@@ -549,6 +696,15 @@ static int ext4_htree_next_block(struct inode *dir, __u32 hash,
if (!(bh = ext4_bread(NULL, dir, dx_get_block(p->at),
0, &err)))
return err; /* Failure */
+
+ if (!buffer_verified(bh) &&
+ !ext4_dx_csum_verify(dir,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ ext4_warning(dir->i_sb, "Node failed checksum");
+ return -EIO;
+ }
+ set_buffer_verified(bh);
+
p++;
brelse(p->bh);
p->bh = bh;
@@ -1224,7 +1380,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
err = ext4_handle_dirty_metadata(handle, dir, bh2);
if (err)
goto journal_error;
- err = ext4_handle_dirty_metadata(handle, dir, frame->bh);
+ err = ext4_handle_dirty_dx_node(handle, dir, frame->bh);
if (err)
goto journal_error;
brelse(bh2);
@@ -1411,7 +1567,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
frame->bh = bh;
bh = bh2;

- ext4_handle_dirty_metadata(handle, dir, frame->bh);
+ ext4_handle_dirty_dx_node(handle, dir, frame->bh);
ext4_handle_dirty_metadata(handle, dir, bh);

de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
@@ -1586,7 +1742,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
dxtrace(dx_show_index("node", frames[1].entries));
dxtrace(dx_show_index("node",
((struct dx_node *) bh2->b_data)->entries));
- err = ext4_handle_dirty_metadata(handle, dir, bh2);
+ err = ext4_handle_dirty_dx_node(handle, dir, bh2);
if (err)
goto journal_error;
brelse (bh2);
@@ -1612,7 +1768,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
if (err)
goto journal_error;
}
- err = ext4_handle_dirty_metadata(handle, dir, frames[0].bh);
+ err = ext4_handle_dirty_dx_node(handle, dir, frames[0].bh);
if (err) {
ext4_std_error(inode->i_sb, err);
goto cleanup;

2011-10-08 07:55:43

by djwong

[permalink] [raw]
Subject: [PATCH 18/28] ext4: Calculate and verify checksums of directory leaf blocks

Calculate and verify the checksums for directory leaf blocks (i.e. blocks that
only contain actual directory entries). The checksum lives in what looks to be
an unused directory entry with a 0 name_len at the end of the block. This
scheme is not used for internal htree nodes because the mechanism in place
there only costs one dx_entry, whereas the "empty" directory entry would cost
two dx_entries.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/dir.c | 12 +++
fs/ext4/ext4.h | 16 +++
fs/ext4/namei.c | 260 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 273 insertions(+), 15 deletions(-)


diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 164c560..bc40c9e 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -180,6 +180,18 @@ static int ext4_readdir(struct file *filp,
continue;
}

+ /* Check the checksum */
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(inode,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ EXT4_ERROR_FILE(filp, 0, "directory fails checksum "
+ "at offset %llu",
+ (unsigned long long)filp->f_pos);
+ filp->f_pos += sb->s_blocksize - offset;
+ continue;
+ }
+ set_buffer_verified(bh);
+
revalidate:
/* If the dir block has changed since the last call to
* readdir(2), then we might be pointing to an invalid
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 7cd3541..48e116f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1498,6 +1498,18 @@ struct ext4_dir_entry_2 {
};

/*
+ * This is a bogus directory entry at the end of each leaf block that
+ * records checksums.
+ */
+struct ext4_dir_entry_tail {
+ __le32 reserved_zero1; /* Pretend to be unused */
+ __le16 rec_len; /* 12 */
+ __u8 reserved_zero2; /* Zero name length */
+ __u8 reserved_ft; /* 0xDE, fake file type */
+ __le32 checksum; /* crc32c(uuid+inum+dirblock) */
+};
+
+/*
* Ext4 directory file types. Only the low 3 bits are used. The
* other bits are reserved for now.
*/
@@ -1512,6 +1524,8 @@ struct ext4_dir_entry_2 {

#define EXT4_FT_MAX 8

+#define EXT4_FT_DIR_CSUM 0xDE
+
/*
* EXT4_DIR_PAD defines the directory entries boundaries
*
@@ -1927,6 +1941,8 @@ extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
extern int ext4_ext_migrate(struct inode *);

/* namei.c */
+extern int ext4_dirent_csum_verify(struct inode *inode,
+ struct ext4_dir_entry *dirent);
extern int ext4_orphan_add(handle_t *, struct inode *);
extern int ext4_orphan_del(handle_t *, struct inode *);
extern int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index daa0c18..818779d 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -190,6 +190,115 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
struct inode *inode);

/* checksumming functions */
+#define EXT4_DIRENT_TAIL(block, blocksize) \
+ ((struct ext4_dir_entry_tail *)(((void *)(block)) + \
+ ((blocksize) - \
+ sizeof(struct ext4_dir_entry_tail))))
+
+static void initialize_dirent_tail(struct ext4_dir_entry_tail *t,
+ unsigned int blocksize)
+{
+ memset(t, 0, sizeof(struct ext4_dir_entry_tail));
+ t->rec_len = ext4_rec_len_to_disk(sizeof(struct ext4_dir_entry_tail),
+ blocksize);
+ t->reserved_ft = EXT4_FT_DIR_CSUM;
+}
+
+/* Walk through a dirent block to find a checksum "dirent" at the tail */
+static struct ext4_dir_entry_tail *get_dirent_tail(struct inode *inode,
+ struct ext4_dir_entry *de)
+{
+ struct ext4_dir_entry_tail *t;
+
+#ifdef PARANOID
+ struct ext4_dir_entry *d, *top;
+
+ d = de;
+ top = (struct ext4_dir_entry *)(((void *)de) +
+ (EXT4_BLOCK_SIZE(inode->i_sb) -
+ sizeof(struct ext4_dir_entry_tail)));
+ while (d < top && d->rec_len)
+ d = (struct ext4_dir_entry *)(((void *)d) +
+ le16_to_cpu(d->rec_len));
+
+ if (d != top)
+ return NULL;
+
+ t = (struct ext4_dir_entry_tail *)d;
+#else
+ t = EXT4_DIRENT_TAIL(de, EXT4_BLOCK_SIZE(inode->i_sb));
+#endif
+
+ if (t->reserved_zero1 ||
+ le16_to_cpu(t->rec_len) != sizeof(struct ext4_dir_entry_tail) ||
+ t->reserved_zero2 ||
+ t->reserved_ft != EXT4_FT_DIR_CSUM)
+ return NULL;
+
+ return t;
+}
+
+static __le32 ext4_dirent_csum(struct inode *inode,
+ struct ext4_dir_entry *dirent, int size)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ __u32 crc;
+
+ crc = ext4_chksum(sbi, ei->i_uuid_inum_crc, (__u8 *)dirent, size);
+ return cpu_to_le32(crc);
+}
+
+int ext4_dirent_csum_verify(struct inode *inode, struct ext4_dir_entry *dirent)
+{
+ struct ext4_dir_entry_tail *t;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ t = get_dirent_tail(inode, dirent);
+ if (!t) {
+ EXT4_ERROR_INODE(inode, "metadata_csum set but no space in dir "
+ "leaf for checksum. Please run e2fsck -D.");
+ return 0;
+ }
+
+ if (t->checksum != ext4_dirent_csum(inode, dirent,
+ (void *)t - (void *)dirent))
+ return 0;
+
+ return 1;
+}
+
+static void ext4_dirent_csum_set(struct inode *inode,
+ struct ext4_dir_entry *dirent)
+{
+ struct ext4_dir_entry_tail *t;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ t = get_dirent_tail(inode, dirent);
+ if (!t) {
+ EXT4_ERROR_INODE(inode, "metadata_csum set but no space in dir "
+ "leaf for checksum. Please run e2fsck -D.");
+ return;
+ }
+
+ t->checksum = ext4_dirent_csum(inode, dirent,
+ (void *)t - (void *)dirent);
+}
+
+static inline int ext4_handle_dirty_dirent_node(handle_t *handle,
+ struct inode *inode,
+ struct buffer_head *bh)
+{
+ ext4_dirent_csum_set(inode, (struct ext4_dir_entry *)bh->b_data);
+ return ext4_handle_dirty_metadata(handle, inode, bh);
+}
+
static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
struct ext4_dir_entry *dirent,
int *offset)
@@ -733,6 +842,11 @@ static int htree_dirblock_to_tree(struct file *dir_file,
if (!(bh = ext4_bread (NULL, dir, block, 0, &err)))
return err;

+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data))
+ return -EIO;
+ set_buffer_verified(bh);
+
de = (struct ext4_dir_entry_2 *) bh->b_data;
top = (struct ext4_dir_entry_2 *) ((char *) de +
dir->i_sb->s_blocksize -
@@ -1092,6 +1206,15 @@ restart:
brelse(bh);
goto next;
}
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(dir,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ EXT4_ERROR_INODE(dir, "checksumming directory "
+ "block %lu", (unsigned long)block);
+ brelse(bh);
+ goto next;
+ }
+ set_buffer_verified(bh);
i = search_dirblock(bh, dir, d_name,
block << EXT4_BLOCK_SIZE_BITS(sb), res_dir);
if (i == 1) {
@@ -1143,6 +1266,16 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir, const struct q
if (!(bh = ext4_bread(NULL, dir, block, 0, err)))
goto errout;

+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(dir,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ EXT4_ERROR_INODE(dir, "checksumming directory "
+ "block %lu", (unsigned long)block);
+ brelse(bh);
+ *err = -EIO;
+ goto errout;
+ }
+ set_buffer_verified(bh);
retval = search_dirblock(bh, dir, d_name,
block << EXT4_BLOCK_SIZE_BITS(sb),
res_dir);
@@ -1315,8 +1448,14 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
char *data1 = (*bh)->b_data, *data2;
unsigned split, move, size;
struct ext4_dir_entry_2 *de = NULL, *de2;
+ struct ext4_dir_entry_tail *t;
+ int csum_size = 0;
int err = 0, i;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);
+
bh2 = ext4_append (handle, dir, &newblock, &err);
if (!(bh2)) {
brelse(*bh);
@@ -1363,10 +1502,20 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
/* Fancy dance to stay within two buffers */
de2 = dx_move_dirents(data1, data2, map + split, count - split, blocksize);
de = dx_pack_dirents(data1, blocksize);
- de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de,
+ de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) -
+ (char *) de,
blocksize);
- de2->rec_len = ext4_rec_len_to_disk(data2 + blocksize - (char *) de2,
+ de2->rec_len = ext4_rec_len_to_disk(data2 + (blocksize - csum_size) -
+ (char *) de2,
blocksize);
+ if (csum_size) {
+ t = EXT4_DIRENT_TAIL(data2, blocksize);
+ initialize_dirent_tail(t, blocksize);
+
+ t = EXT4_DIRENT_TAIL(data1, blocksize);
+ initialize_dirent_tail(t, blocksize);
+ }
+
dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data1, blocksize, 1));
dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data2, blocksize, 1));

@@ -1377,7 +1526,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
de = de2;
}
dx_insert_block(frame, hash2 + continued, newblock);
- err = ext4_handle_dirty_metadata(handle, dir, bh2);
+ err = ext4_handle_dirty_dirent_node(handle, dir, bh2);
if (err)
goto journal_error;
err = ext4_handle_dirty_dx_node(handle, dir, frame->bh);
@@ -1417,11 +1566,16 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
unsigned short reclen;
int nlen, rlen, err;
char *top;
+ int csum_size = 0;
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);

reclen = EXT4_DIR_REC_LEN(namelen);
if (!de) {
de = (struct ext4_dir_entry_2 *)bh->b_data;
- top = bh->b_data + blocksize - reclen;
+ top = bh->b_data + (blocksize - csum_size) - reclen;
while ((char *) de <= top) {
if (ext4_check_dir_entry(dir, NULL, de, bh, offset))
return -EIO;
@@ -1477,7 +1631,7 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
dir->i_version++;
ext4_mark_inode_dirty(handle, dir);
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, dir, bh);
+ err = ext4_handle_dirty_dirent_node(handle, dir, bh);
if (err)
ext4_std_error(dir->i_sb, err);
return 0;
@@ -1498,6 +1652,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
struct dx_frame frames[2], *frame;
struct dx_entry *entries;
struct ext4_dir_entry_2 *de, *de2;
+ struct ext4_dir_entry_tail *t;
char *data1, *top;
unsigned len;
int retval;
@@ -1505,6 +1660,11 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
struct dx_hash_info hinfo;
ext4_lblk_t block;
struct fake_dirent *fde;
+ int csum_size = 0;
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);

blocksize = dir->i_sb->s_blocksize;
dxtrace(printk(KERN_DEBUG "Creating index: inode %lu\n", dir->i_ino));
@@ -1525,7 +1685,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
brelse(bh);
return -EIO;
}
- len = ((char *) root) + blocksize - (char *) de;
+ len = ((char *) root) + (blocksize - csum_size) - (char *) de;

/* Allocate new block for the 0th block's dirents */
bh2 = ext4_append(handle, dir, &block, &retval);
@@ -1541,8 +1701,15 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
top = data1 + len;
while ((char *)(de2 = ext4_next_entry(de, blocksize)) < top)
de = de2;
- de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de,
+ de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) -
+ (char *) de,
blocksize);
+
+ if (csum_size) {
+ t = EXT4_DIRENT_TAIL(data1, blocksize);
+ initialize_dirent_tail(t, blocksize);
+ }
+
/* Initialize the root; the dot dirents already exist */
de = (struct ext4_dir_entry_2 *) (&root->dotdot);
de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(2),
@@ -1568,7 +1735,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
bh = bh2;

ext4_handle_dirty_dx_node(handle, dir, frame->bh);
- ext4_handle_dirty_metadata(handle, dir, bh);
+ ext4_handle_dirty_dirent_node(handle, dir, bh);

de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
if (!de) {
@@ -1604,11 +1771,17 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
struct inode *dir = dentry->d_parent->d_inode;
struct buffer_head *bh;
struct ext4_dir_entry_2 *de;
+ struct ext4_dir_entry_tail *t;
struct super_block *sb;
int retval;
int dx_fallback=0;
unsigned blocksize;
ext4_lblk_t block, blocks;
+ int csum_size = 0;
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);

sb = dir->i_sb;
blocksize = sb->s_blocksize;
@@ -1627,6 +1800,11 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
bh = ext4_bread(handle, dir, block, 0, &retval);
if(!bh)
return retval;
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(dir,
+ (struct ext4_dir_entry *)bh->b_data))
+ return -EIO;
+ set_buffer_verified(bh);
retval = add_dirent_to_buf(handle, dentry, inode, NULL, bh);
if (retval != -ENOSPC) {
brelse(bh);
@@ -1643,7 +1821,13 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
return retval;
de = (struct ext4_dir_entry_2 *) bh->b_data;
de->inode = 0;
- de->rec_len = ext4_rec_len_to_disk(blocksize, blocksize);
+ de->rec_len = ext4_rec_len_to_disk(blocksize - csum_size, blocksize);
+
+ if (csum_size) {
+ t = EXT4_DIRENT_TAIL(bh->b_data, blocksize);
+ initialize_dirent_tail(t, blocksize);
+ }
+
retval = add_dirent_to_buf(handle, dentry, inode, de, bh);
brelse(bh);
if (retval == 0)
@@ -1675,6 +1859,11 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
if (!(bh = ext4_bread(handle,dir, dx_get_block(frame->at), 0, &err)))
goto cleanup;

+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data))
+ goto journal_error;
+ set_buffer_verified(bh);
+
BUFFER_TRACE(bh, "get_write_access");
err = ext4_journal_get_write_access(handle, bh);
if (err)
@@ -1800,12 +1989,17 @@ static int ext4_delete_entry(handle_t *handle,
{
struct ext4_dir_entry_2 *de, *pde;
unsigned int blocksize = dir->i_sb->s_blocksize;
+ int csum_size = 0;
int i, err;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);
+
i = 0;
pde = NULL;
de = (struct ext4_dir_entry_2 *) bh->b_data;
- while (i < bh->b_size) {
+ while (i < bh->b_size - csum_size) {
if (ext4_check_dir_entry(dir, NULL, de, bh, i))
return -EIO;
if (de == de_del) {
@@ -1826,7 +2020,7 @@ static int ext4_delete_entry(handle_t *handle,
de->inode = 0;
dir->i_version++;
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, dir, bh);
+ err = ext4_handle_dirty_dirent_node(handle, dir, bh);
if (unlikely(err)) {
ext4_std_error(dir->i_sb, err);
return err;
@@ -1969,9 +2163,15 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, int mode)
struct inode *inode;
struct buffer_head *dir_block = NULL;
struct ext4_dir_entry_2 *de;
+ struct ext4_dir_entry_tail *t;
unsigned int blocksize = dir->i_sb->s_blocksize;
+ int csum_size = 0;
int err, retries = 0;

+ if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);
+
if (EXT4_DIR_LINK_MAX(dir))
return -EMLINK;

@@ -2012,16 +2212,24 @@ retry:
ext4_set_de_type(dir->i_sb, de, S_IFDIR);
de = ext4_next_entry(de, blocksize);
de->inode = cpu_to_le32(dir->i_ino);
- de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(1),
+ de->rec_len = ext4_rec_len_to_disk(blocksize -
+ (csum_size + EXT4_DIR_REC_LEN(1)),
blocksize);
de->name_len = 2;
strcpy(de->name, "..");
ext4_set_de_type(dir->i_sb, de, S_IFDIR);
inode->i_nlink = 2;
+
+ if (csum_size) {
+ t = EXT4_DIRENT_TAIL(dir_block->b_data, blocksize);
+ initialize_dirent_tail(t, blocksize);
+ }
+
BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, inode, dir_block);
+ err = ext4_handle_dirty_dirent_node(handle, inode, dir_block);
if (err)
goto out_clear_inode;
+ set_buffer_verified(dir_block);
err = ext4_mark_inode_dirty(handle, inode);
if (!err)
err = ext4_add_entry(handle, dentry, inode);
@@ -2071,6 +2279,14 @@ static int empty_dir(struct inode *inode)
inode->i_ino);
return 1;
}
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(inode,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ EXT4_ERROR_INODE(inode, "checksum error reading directory "
+ "lblock 0");
+ return -EIO;
+ }
+ set_buffer_verified(bh);
de = (struct ext4_dir_entry_2 *) bh->b_data;
de1 = ext4_next_entry(de, sb->s_blocksize);
if (le32_to_cpu(de->inode) != inode->i_ino ||
@@ -2102,6 +2318,14 @@ static int empty_dir(struct inode *inode)
offset += sb->s_blocksize;
continue;
}
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(inode,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ EXT4_ERROR_INODE(inode, "checksum error "
+ "reading directory lblock 0");
+ return -EIO;
+ }
+ set_buffer_verified(bh);
de = (struct ext4_dir_entry_2 *) bh->b_data;
}
if (ext4_check_dir_entry(inode, NULL, de, bh, offset)) {
@@ -2602,6 +2826,11 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
dir_bh = ext4_bread(handle, old_inode, 0, 0, &retval);
if (!dir_bh)
goto end_rename;
+ if (!buffer_verified(dir_bh) &&
+ !ext4_dirent_csum_verify(old_inode,
+ (struct ext4_dir_entry *)dir_bh->b_data))
+ goto end_rename;
+ set_buffer_verified(dir_bh);
if (le32_to_cpu(PARENT_INO(dir_bh->b_data,
old_dir->i_sb->s_blocksize)) != old_dir->i_ino)
goto end_rename;
@@ -2632,7 +2861,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
ext4_current_time(new_dir);
ext4_mark_inode_dirty(handle, new_dir);
BUFFER_TRACE(new_bh, "call ext4_handle_dirty_metadata");
- retval = ext4_handle_dirty_metadata(handle, new_dir, new_bh);
+ retval = ext4_handle_dirty_dirent_node(handle, new_dir, new_bh);
if (unlikely(retval)) {
ext4_std_error(new_dir->i_sb, retval);
goto end_rename;
@@ -2686,7 +2915,8 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
PARENT_INO(dir_bh->b_data, new_dir->i_sb->s_blocksize) =
cpu_to_le32(new_dir->i_ino);
BUFFER_TRACE(dir_bh, "call ext4_handle_dirty_metadata");
- retval = ext4_handle_dirty_metadata(handle, old_inode, dir_bh);
+ retval = ext4_handle_dirty_dirent_node(handle, old_inode,
+ dir_bh);
if (retval) {
ext4_std_error(old_dir->i_sb, retval);
goto end_rename;

2011-10-08 07:55:49

by djwong

[permalink] [raw]
Subject: [PATCH 19/28] ext4: Calculate and verify checksums of extended attribute blocks

Calculate and verify the checksums of extended attribute blocks. This only
applies to separate EA blocks that are pointed to by inode->i_file_acl (i.e.
external EA blocks); the checksum lives in the EA header.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/xattr.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++---------
fs/ext4/xattr.h | 4 ++-
2 files changed, 75 insertions(+), 15 deletions(-)


diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index c757adc..1ab85dd 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -122,6 +122,57 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
NULL
};

+static __le32 ext4_xattr_block_csum(struct inode *inode,
+ sector_t block_nr,
+ struct ext4_xattr_header *hdr)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ __u32 crc, old;
+
+ old = hdr->h_checksum;
+ hdr->h_checksum = 0;
+ if (le32_to_cpu(hdr->h_refcount) != 1) {
+ block_nr = cpu_to_le64(block_nr);
+ crc = ext4_chksum(sbi, sbi->s_uuid_crc, (__u8 *)&block_nr,
+ sizeof(block_nr));
+ } else
+ crc = ei->i_uuid_inum_crc;
+ crc = ext4_chksum(sbi, crc, (__u8 *)hdr, EXT4_BLOCK_SIZE(inode->i_sb));
+ hdr->h_checksum = old;
+ return cpu_to_le32(crc);
+}
+
+static int ext4_xattr_block_csum_verify(struct inode *inode,
+ sector_t block_nr,
+ struct ext4_xattr_header *hdr)
+{
+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
+ (hdr->h_checksum != ext4_xattr_block_csum(inode, block_nr, hdr)))
+ return 0;
+ return 1;
+}
+
+static void ext4_xattr_block_csum_set(struct inode *inode,
+ sector_t block_nr,
+ struct ext4_xattr_header *hdr)
+{
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ hdr->h_checksum = ext4_xattr_block_csum(inode, block_nr, hdr);
+}
+
+static inline int ext4_handle_dirty_xattr_block(handle_t *handle,
+ struct inode *inode,
+ struct buffer_head *bh)
+{
+ ext4_xattr_block_csum_set(inode, bh->b_blocknr, BHDR(bh));
+ return ext4_handle_dirty_metadata(handle, inode, bh);
+}
+
static inline const struct xattr_handler *
ext4_xattr_handler(int name_index)
{
@@ -156,14 +207,21 @@ ext4_xattr_check_names(struct ext4_xattr_entry *entry, void *end)
}

static inline int
-ext4_xattr_check_block(struct buffer_head *bh)
+ext4_xattr_check_block(struct inode *inode, struct buffer_head *bh)
{
int error;

+ if (buffer_verified(bh))
+ return 0;
+
if (BHDR(bh)->h_magic != cpu_to_le32(EXT4_XATTR_MAGIC) ||
BHDR(bh)->h_blocks != cpu_to_le32(1))
return -EIO;
+ if (!ext4_xattr_block_csum_verify(inode, bh->b_blocknr, BHDR(bh)))
+ return -EIO;
error = ext4_xattr_check_names(BFIRST(bh), bh->b_data + bh->b_size);
+ if (!error)
+ set_buffer_verified(bh);
return error;
}

@@ -226,7 +284,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
goto cleanup;
ea_bdebug(bh, "b_count=%d, refcount=%d",
atomic_read(&(bh->b_count)), le32_to_cpu(BHDR(bh)->h_refcount));
- if (ext4_xattr_check_block(bh)) {
+ if (ext4_xattr_check_block(inode, bh)) {
bad_block:
EXT4_ERROR_INODE(inode, "bad block %llu",
EXT4_I(inode)->i_file_acl);
@@ -370,7 +428,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
goto cleanup;
ea_bdebug(bh, "b_count=%d, refcount=%d",
atomic_read(&(bh->b_count)), le32_to_cpu(BHDR(bh)->h_refcount));
- if (ext4_xattr_check_block(bh)) {
+ if (ext4_xattr_check_block(inode, bh)) {
EXT4_ERROR_INODE(inode, "bad block %llu",
EXT4_I(inode)->i_file_acl);
error = -EIO;
@@ -489,7 +547,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
EXT4_FREE_BLOCKS_FORGET);
} else {
le32_add_cpu(&BHDR(bh)->h_refcount, -1);
- error = ext4_handle_dirty_metadata(handle, inode, bh);
+ error = ext4_handle_dirty_xattr_block(handle, inode, bh);
if (IS_SYNC(inode))
ext4_handle_sync(handle);
dquot_free_block(inode, 1);
@@ -662,7 +720,7 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
ea_bdebug(bs->bh, "b_count=%d, refcount=%d",
atomic_read(&(bs->bh->b_count)),
le32_to_cpu(BHDR(bs->bh)->h_refcount));
- if (ext4_xattr_check_block(bs->bh)) {
+ if (ext4_xattr_check_block(inode, bs->bh)) {
EXT4_ERROR_INODE(inode, "bad block %llu",
EXT4_I(inode)->i_file_acl);
error = -EIO;
@@ -725,9 +783,9 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
if (error == -EIO)
goto bad_block;
if (!error)
- error = ext4_handle_dirty_metadata(handle,
- inode,
- bs->bh);
+ error = ext4_handle_dirty_xattr_block(handle,
+ inode,
+ bs->bh);
if (error)
goto cleanup;
goto inserted;
@@ -796,9 +854,9 @@ inserted:
ea_bdebug(new_bh, "reusing; refcount now=%d",
le32_to_cpu(BHDR(new_bh)->h_refcount));
unlock_buffer(new_bh);
- error = ext4_handle_dirty_metadata(handle,
- inode,
- new_bh);
+ error = ext4_handle_dirty_xattr_block(handle,
+ inode,
+ new_bh);
if (error)
goto cleanup_dquot;
}
@@ -848,8 +906,8 @@ getblk_failed:
set_buffer_uptodate(new_bh);
unlock_buffer(new_bh);
ext4_xattr_cache_insert(new_bh);
- error = ext4_handle_dirty_metadata(handle,
- inode, new_bh);
+ error = ext4_handle_dirty_xattr_block(handle,
+ inode, new_bh);
if (error)
goto cleanup;
}
@@ -1190,7 +1248,7 @@ retry:
error = -EIO;
if (!bh)
goto cleanup;
- if (ext4_xattr_check_block(bh)) {
+ if (ext4_xattr_check_block(inode, bh)) {
EXT4_ERROR_INODE(inode, "bad block %llu",
EXT4_I(inode)->i_file_acl);
error = -EIO;
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 25b7387..91f31ca 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -27,7 +27,9 @@ struct ext4_xattr_header {
__le32 h_refcount; /* reference count */
__le32 h_blocks; /* number of disk blocks used */
__le32 h_hash; /* hash value of all attributes */
- __u32 h_reserved[4]; /* zero right now */
+ __le32 h_checksum; /* crc32c(uuid+id+xattrblock) */
+ /* id = inum if refcount=1, blknum otherwise */
+ __u32 h_reserved[3]; /* zero right now */
};

struct ext4_xattr_ibody_header {


2011-10-08 07:55:56

by djwong

[permalink] [raw]
Subject: [PATCH 20/28] ext4: Add new feature to make block group checksums use metadata_csum algorithm

Add a new feature flag to enable block group descriptors to use the (faster)
metadata_csum checksum algorithm.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/balloc.c | 3 +-
fs/ext4/ext4.h | 8 ++++---
fs/ext4/ialloc.c | 11 ++++-----
fs/ext4/mballoc.c | 6 ++---
fs/ext4/resize.c | 2 +-
fs/ext4/super.c | 64 ++++++++++++++++++++++++++++++++++++++---------------
6 files changed, 61 insertions(+), 33 deletions(-)


diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index dd96a7d..7976a8b 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -180,8 +180,7 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
bh->b_data);
ext4_block_bitmap_csum_set(sb, block_group, gdp, bh,
EXT4_BLOCKS_PER_GROUP(sb) / 8);
- gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group,
- gdp);
+ ext4_group_desc_csum_set(sbi, block_group, gdp);
}
return free_blocks - ext4_group_used_meta_blocks(sb, block_group, gdp);
}
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 48e116f..4325d1f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1401,6 +1401,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_INCOMPAT_FLEX_BG 0x0200
#define EXT4_FEATURE_INCOMPAT_EA_INODE 0x0400 /* EA in inode */
#define EXT4_FEATURE_INCOMPAT_DIRDATA 0x1000 /* data in dirent */
+#define EXT4_FEATURE_INCOMPAT_BG_USE_META_CSUM 0x2000

#define EXT2_FEATURE_COMPAT_SUPP EXT4_FEATURE_COMPAT_EXT_ATTR
#define EXT2_FEATURE_INCOMPAT_SUPP (EXT4_FEATURE_INCOMPAT_FILETYPE| \
@@ -1424,7 +1425,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
EXT4_FEATURE_INCOMPAT_EXTENTS| \
EXT4_FEATURE_INCOMPAT_64BIT| \
EXT4_FEATURE_INCOMPAT_FLEX_BG| \
- EXT4_FEATURE_INCOMPAT_MMP)
+ EXT4_FEATURE_INCOMPAT_MMP| \
+ EXT4_FEATURE_INCOMPAT_BG_USE_META_CSUM)
#define EXT4_FEATURE_RO_COMPAT_SUPP (EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
EXT4_FEATURE_RO_COMPAT_LARGE_FILE| \
EXT4_FEATURE_RO_COMPAT_GDT_CSUM| \
@@ -2034,10 +2036,10 @@ extern void ext4_used_dirs_set(struct super_block *sb,
struct ext4_group_desc *bg, __u32 count);
extern void ext4_itable_unused_set(struct super_block *sb,
struct ext4_group_desc *bg, __u32 count);
-extern __le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 group,
- struct ext4_group_desc *gdp);
extern int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 group,
struct ext4_group_desc *gdp);
+extern void ext4_group_desc_csum_set(struct ext4_sb_info *sbi, __u32 group,
+ struct ext4_group_desc *gdp);

static inline ext4_fsblk_t ext4_blocks_count(struct ext4_super_block *es)
{
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 77ae3df..7d2d27d 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -92,7 +92,7 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb,
bh->b_data);
ext4_inode_bitmap_csum_set(sb, block_group, gdp, bh,
EXT4_INODES_PER_GROUP(sb) / 8);
- gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
+ ext4_group_desc_csum_set(sbi, block_group, gdp);

return EXT4_INODES_PER_GROUP(sb);
}
@@ -287,7 +287,7 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
}
ext4_inode_bitmap_csum_set(sb, block_group, gdp, bitmap_bh,
EXT4_INODES_PER_GROUP(sb) / 8);
- gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
+ ext4_group_desc_csum_set(sbi, block_group, gdp);
ext4_unlock_group(sb, block_group);

percpu_counter_inc(&sbi->s_freeinodes_counter);
@@ -808,7 +808,7 @@ static int ext4_claim_inode(struct super_block *sb,
}
ext4_inode_bitmap_csum_set(sb, group, gdp, inode_bitmap_bh,
EXT4_INODES_PER_GROUP(sb) / 8);
- gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
+ ext4_group_desc_csum_set(sbi, group, gdp);
err_ret:
ext4_unlock_group(sb, group);
up_read(&grp->alloc_sem);
@@ -985,8 +985,7 @@ got:
block_bitmap_bh,
EXT4_BLOCKS_PER_GROUP(sb) /
8);
- gdp->bg_checksum = ext4_group_desc_csum(sbi, group,
- gdp);
+ ext4_group_desc_csum_set(sbi, group, gdp);
}
ext4_unlock_group(sb, group);

@@ -1359,7 +1358,7 @@ extern int ext4_init_inode_table(struct super_block *sb, ext4_group_t group,
skip_zeroout:
ext4_lock_group(sb, group);
gdp->bg_flags |= cpu_to_le16(EXT4_BG_INODE_ZEROED);
- gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
+ ext4_group_desc_csum_set(sbi, group, gdp);
ext4_unlock_group(sb, group);

BUFFER_TRACE(group_desc_bh,
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 3d6300a..be62df1 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2879,7 +2879,7 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
ext4_free_blks_set(sb, gdp, len);
ext4_block_bitmap_csum_set(sb, ac->ac_b_ex.fe_group, gdp, bitmap_bh,
EXT4_BLOCKS_PER_GROUP(sb) / 8);
- gdp->bg_checksum = ext4_group_desc_csum(sbi, ac->ac_b_ex.fe_group, gdp);
+ ext4_group_desc_csum_set(sbi, ac->ac_b_ex.fe_group, gdp);

ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
percpu_counter_sub(&sbi->s_freeblocks_counter, ac->ac_b_ex.fe_len);
@@ -4690,7 +4690,7 @@ do_more:
ext4_free_blks_set(sb, gdp, ret);
ext4_block_bitmap_csum_set(sb, block_group, gdp, bitmap_bh,
EXT4_BLOCKS_PER_GROUP(sb) / 8);
- gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
+ ext4_group_desc_csum_set(sbi, block_group, gdp);
ext4_unlock_group(sb, block_group);
percpu_counter_add(&sbi->s_freeblocks_counter, count);

@@ -4834,7 +4834,7 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
ext4_free_blks_set(sb, desc, blk_free_count);
ext4_block_bitmap_csum_set(sb, block_group, desc, bitmap_bh,
EXT4_BLOCKS_PER_GROUP(sb) / 8);
- desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc);
+ ext4_group_desc_csum_set(sbi, block_group, desc);
ext4_unlock_group(sb, block_group);
percpu_counter_add(&sbi->s_freeblocks_counter, blocks_freed);

diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 2ad7008..f0a1f02 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -880,7 +880,7 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
ext4_free_blks_set(sb, gdp, input->free_blocks_count);
ext4_free_inodes_set(sb, gdp, EXT4_INODES_PER_GROUP(sb));
gdp->bg_flags = cpu_to_le16(EXT4_BG_INODE_ZEROED);
- gdp->bg_checksum = ext4_group_desc_csum(sbi, input->group, gdp);
+ ext4_group_desc_csum_set(sbi, input->group, gdp);

/*
* We can allocate memory for mb_alloc based on the new group
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 58157ec..5538fb7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2070,29 +2070,48 @@ failed:
return 0;
}

-__le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 block_group,
- struct ext4_group_desc *gdp)
+static __le16 ext4_group_desc_csum(struct ext4_sb_info *sbi, __u32 block_group,
+ struct ext4_group_desc *gdp)
{
+ int offset;
__u16 crc = 0;
+ __le32 le_group = cpu_to_le32(block_group);

- if (sbi->s_es->s_feature_ro_compat &
- cpu_to_le32(EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) {
- int offset = offsetof(struct ext4_group_desc, bg_checksum);
- __le32 le_group = cpu_to_le32(block_group);
-
- crc = crc16(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
- crc = crc16(crc, (__u8 *)&le_group, sizeof(le_group));
- crc = crc16(crc, (__u8 *)gdp, offset);
- offset += sizeof(gdp->bg_checksum); /* skip checksum */
- /* for checksum of struct ext4_group_desc do the rest...*/
- if ((sbi->s_es->s_feature_incompat &
- cpu_to_le32(EXT4_FEATURE_INCOMPAT_64BIT)) &&
- offset < le16_to_cpu(sbi->s_es->s_desc_size))
- crc = crc16(crc, (__u8 *)gdp + offset,
- le16_to_cpu(sbi->s_es->s_desc_size) -
- offset);
+ if ((sbi->s_es->s_feature_ro_compat &
+ cpu_to_le32(EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)) &&
+ (sbi->s_es->s_feature_incompat &
+ cpu_to_le32(EXT4_FEATURE_INCOMPAT_BG_USE_META_CSUM))) {
+ /* Use new metadata_csum algorithm */
+ __u16 old_crc;
+ __u32 crc32;
+
+ old_crc = gdp->bg_checksum;
+ gdp->bg_checksum = 0;
+ crc32 = ext4_chksum(sbi, sbi->s_uuid_crc, (__u8 *)&le_group,
+ sizeof(le_group));
+ crc32 = ext4_chksum(sbi, crc32, (__u8 *)gdp, sbi->s_desc_size);
+ gdp->bg_checksum = old_crc;
+
+ crc = crc32 & 0xFFFF;
+ goto out;
}

+ /* old crc16 code */
+ offset = offsetof(struct ext4_group_desc, bg_checksum);
+
+ crc = crc16(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
+ crc = crc16(crc, (__u8 *)&le_group, sizeof(le_group));
+ crc = crc16(crc, (__u8 *)gdp, offset);
+ offset += sizeof(gdp->bg_checksum); /* skip checksum */
+ /* for checksum of struct ext4_group_desc do the rest...*/
+ if ((sbi->s_es->s_feature_incompat &
+ cpu_to_le32(EXT4_FEATURE_INCOMPAT_64BIT)) &&
+ offset < le16_to_cpu(sbi->s_es->s_desc_size))
+ crc = crc16(crc, (__u8 *)gdp + offset,
+ le16_to_cpu(sbi->s_es->s_desc_size) -
+ offset);
+
+out:
return cpu_to_le16(crc);
}

@@ -2107,6 +2126,15 @@ int ext4_group_desc_csum_verify(struct ext4_sb_info *sbi, __u32 block_group,
return 1;
}

+void ext4_group_desc_csum_set(struct ext4_sb_info *sbi, __u32 block_group,
+ struct ext4_group_desc *gdp)
+{
+ if (!(sbi->s_es->s_feature_ro_compat &
+ cpu_to_le32(EXT4_FEATURE_RO_COMPAT_GDT_CSUM)))
+ return;
+ gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
+}
+
/* Called at mount-time, super-block is locked */
static int ext4_check_descriptors(struct super_block *sb,
ext4_group_t *first_not_zeroed)

2011-10-08 07:56:02

by djwong

[permalink] [raw]
Subject: [PATCH 21/28] ext4: Add checksums to the MMP block

Compute and verify a checksum for the MMP block.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4.h | 6 +++++-
fs/ext4/mmp.c | 44 +++++++++++++++++++++++++++++++++++++++-----
2 files changed, 44 insertions(+), 6 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 4325d1f..0df2dd1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1743,7 +1743,8 @@ struct mmp_struct {
__le16 mmp_check_interval;

__le16 mmp_pad1;
- __le32 mmp_pad2[227];
+ __le32 mmp_pad2[226];
+ __le32 mmp_checksum; /* crc32c(uuid+mmp_block) */
};

/* arguments passed to the mmp thread */
@@ -2328,6 +2329,9 @@ extern int ext4_bio_write_page(struct ext4_io_submit *io,

/* mmp.c */
extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
+extern void ext4_mmp_csum_set(struct super_block *sb, struct mmp_struct *mmp);
+extern int ext4_mmp_csum_verify(struct super_block *sb,
+ struct mmp_struct *mmp);

/* BH_Uninit flag: blocks are allocated but uninitialized on disk */
enum ext4_state_bits {
diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index a7a4986..77415b2 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -6,12 +6,45 @@

#include "ext4.h"

+/* Checksumming functions */
+static __u32 ext4_mmp_csum(struct super_block *sb, struct mmp_struct *mmp)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ int offset = offsetof(struct mmp_struct, mmp_checksum);
+ __u32 crc;
+
+ crc = ext4_chksum(sbi, sbi->s_uuid_crc, (char *)mmp, offset);
+
+ return cpu_to_le32(crc);
+}
+
+int ext4_mmp_csum_verify(struct super_block *sb, struct mmp_struct *mmp)
+{
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ return mmp->mmp_checksum == ext4_mmp_csum(sb, mmp);
+}
+
+void ext4_mmp_csum_set(struct super_block *sb, struct mmp_struct *mmp)
+{
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ mmp->mmp_checksum = ext4_mmp_csum(sb, mmp);
+}
+
/*
* Write the MMP block using WRITE_SYNC to try to get the block on-disk
* faster.
*/
-static int write_mmp_block(struct buffer_head *bh)
+static int write_mmp_block(struct super_block *sb, struct buffer_head *bh)
{
+ struct mmp_struct *mmp = (struct mmp_struct *)(bh->b_data);
+
+ ext4_mmp_csum_set(sb, mmp);
mark_buffer_dirty(bh);
lock_buffer(bh);
bh->b_end_io = end_buffer_write_sync;
@@ -59,7 +92,8 @@ static int read_mmp_block(struct super_block *sb, struct buffer_head **bh,
}

mmp = (struct mmp_struct *)((*bh)->b_data);
- if (le32_to_cpu(mmp->mmp_magic) != EXT4_MMP_MAGIC)
+ if (le32_to_cpu(mmp->mmp_magic) != EXT4_MMP_MAGIC ||
+ !ext4_mmp_csum_verify(sb, mmp))
return -EINVAL;

return 0;
@@ -120,7 +154,7 @@ static int kmmpd(void *data)
mmp->mmp_time = cpu_to_le64(get_seconds());
last_update_time = jiffies;

- retval = write_mmp_block(bh);
+ retval = write_mmp_block(sb, bh);
/*
* Don't spew too many error messages. Print one every
* (s_mmp_update_interval * 60) seconds.
@@ -199,7 +233,7 @@ static int kmmpd(void *data)
mmp->mmp_seq = cpu_to_le32(EXT4_MMP_SEQ_CLEAN);
mmp->mmp_time = cpu_to_le64(get_seconds());

- retval = write_mmp_block(bh);
+ retval = write_mmp_block(sb, bh);

failed:
kfree(data);
@@ -298,7 +332,7 @@ skip:
seq = mmp_new_seq();
mmp->mmp_seq = cpu_to_le32(seq);

- retval = write_mmp_block(bh);
+ retval = write_mmp_block(sb, bh);
if (retval)
goto failed;


2011-10-08 07:56:29

by djwong

[permalink] [raw]
Subject: [PATCH 25/28] jbd2: Checksum revocation blocks

Compute and verify revoke blocks inside the journal.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/jbd2/recovery.c | 35 +++++++++++++++++++++++++++++++++--
fs/jbd2/revoke.c | 27 ++++++++++++++++++++++++++-
include/linux/jbd2.h | 4 ++++
3 files changed, 63 insertions(+), 3 deletions(-)


diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 1cad869..4f4eda4 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -662,8 +662,17 @@ static int do_one_pass(journal_t *journal,
err = scan_revoke_records(journal, bh,
next_commit_ID, info);
brelse(bh);
- if (err)
- goto failed;
+ if (err) {
+ if (err != -EINVAL)
+ goto failed;
+ /*
+ * Ignoring corrupt revoke blocks is safe
+ * because at worst it results in unnecessary
+ * writes during recovery.
+ */
+ jbd_debug(3, "Skipping corrupt revoke "
+ "block.\n");
+ }
continue;

default:
@@ -703,6 +712,25 @@ static int do_one_pass(journal_t *journal,
return err;
}

+static int jbd2_revoke_block_csum_verify(journal_t *j,
+ void *buf)
+{
+ struct jbd2_journal_revoke_tail *tail;
+ __u32 provided, calculated;
+
+ if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ return 1;
+
+ tail = (struct jbd2_journal_revoke_tail *)(buf + j->j_blocksize -
+ sizeof(struct jbd2_journal_revoke_tail));
+ provided = tail->r_checksum;
+ tail->r_checksum = 0;
+ calculated = jbd2_chksum(j, j->j_uuid_crc, buf, j->j_blocksize);
+ tail->r_checksum = provided;
+
+ provided = be32_to_cpu(provided);
+ return provided == calculated;
+}

/* Scan a revoke record, marking all blocks mentioned as revoked. */

@@ -717,6 +745,9 @@ static int scan_revoke_records(journal_t *journal, struct buffer_head *bh,
offset = sizeof(jbd2_journal_revoke_header_t);
max = be32_to_cpu(header->r_count);

+ if (!jbd2_revoke_block_csum_verify(journal, header))
+ return -EINVAL;
+
if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT))
record_len = 8;

diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index 69fd935..e704d4a 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -548,6 +548,7 @@ static void write_one_revoke_record(journal_t *journal,
struct jbd2_revoke_record_s *record,
int write_op)
{
+ int csum_size = 0;
struct journal_head *descriptor;
int offset;
journal_header_t *header;
@@ -562,9 +563,13 @@ static void write_one_revoke_record(journal_t *journal,
descriptor = *descriptorp;
offset = *offsetp;

+ /* Do we need to leave space at the end for a checksum? */
+ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ csum_size = sizeof(struct jbd2_journal_revoke_tail);
+
/* Make sure we have a descriptor with space left for the record */
if (descriptor) {
- if (offset == journal->j_blocksize) {
+ if (offset >= journal->j_blocksize - csum_size) {
flush_descriptor(journal, descriptor, offset, write_op);
descriptor = NULL;
}
@@ -601,6 +606,24 @@ static void write_one_revoke_record(journal_t *journal,
*offsetp = offset;
}

+static void jbd2_revoke_csum_set(journal_t *j,
+ struct journal_head *descriptor)
+{
+ struct jbd2_journal_revoke_tail *tail;
+ __u32 crc;
+
+ if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ return;
+
+ tail = (struct jbd2_journal_revoke_tail *)
+ (jh2bh(descriptor)->b_data + j->j_blocksize -
+ sizeof(struct jbd2_journal_revoke_tail));
+ tail->r_checksum = 0;
+ crc = jbd2_chksum(j, j->j_uuid_crc, jh2bh(descriptor)->b_data,
+ j->j_blocksize);
+ tail->r_checksum = cpu_to_be32(crc);
+}
+
/*
* Flush a revoke descriptor out to the journal. If we are aborting,
* this is a noop; otherwise we are generating a buffer which needs to
@@ -622,6 +645,8 @@ static void flush_descriptor(journal_t *journal,

header = (jbd2_journal_revoke_header_t *) jh2bh(descriptor)->b_data;
header->r_count = cpu_to_be32(offset);
+ jbd2_revoke_csum_set(journal, descriptor);
+
set_buffer_jwrite(bh);
BUFFER_TRACE(bh, "write");
set_buffer_dirty(bh);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index be44094..125d5be 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -194,6 +194,10 @@ typedef struct jbd2_journal_revoke_header_s
__be32 r_count; /* Count of bytes used in the block */
} jbd2_journal_revoke_header_t;

+/* Tail of revoke block, for checksumming */
+struct jbd2_journal_revoke_tail {
+ __be32 r_checksum; /* crc32c(uuid+revoke_block) */
+};

/* Definitions for the journal tag flags word: */
#define JBD2_FLAG_ESCAPE 1 /* on-disk block is escaped */

2011-10-08 07:56:15

by djwong

[permalink] [raw]
Subject: [PATCH 23/28] jbd2: Grab a reference to the crc32c driver only when necessary

Obtain a reference to the crc32c driver only when the journal needs it for
checksum v2.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/jbd2/Kconfig | 2 ++
fs/jbd2/journal.c | 22 ++++++++++++++++++++++
include/linux/jbd2.h | 23 +++++++++++++++++++++++
3 files changed, 47 insertions(+), 0 deletions(-)


diff --git a/fs/jbd2/Kconfig b/fs/jbd2/Kconfig
index f32f346..69a48c2 100644
--- a/fs/jbd2/Kconfig
+++ b/fs/jbd2/Kconfig
@@ -1,6 +1,8 @@
config JBD2
tristate
select CRC32
+ select CRYPTO
+ select CRYPTO_CRC32C
help
This is a generic journaling layer for block devices that support
both 32-bit and 64-bit block numbers. It is currently used by
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 8d8985e..5c76a8b 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1276,6 +1276,16 @@ static int journal_get_superblock(journal_t *journal)
goto out;
}

+ /* Load the checksum driver */
+ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2)) {
+ journal->j_chksum_driver = crypto_alloc_shash("crc32c", 0, 0);
+ if (IS_ERR(journal->j_chksum_driver)) {
+ err = PTR_ERR(journal->j_chksum_driver);
+ journal->j_chksum_driver = NULL;
+ goto out;
+ }
+ }
+
set_buffer_verified(bh);

return 0;
@@ -1432,6 +1442,8 @@ int jbd2_journal_destroy(journal_t *journal)
iput(journal->j_inode);
if (journal->j_revoke)
jbd2_journal_destroy_revoke(journal);
+ if (journal->j_chksum_driver)
+ crypto_free_shash(journal->j_chksum_driver);
kfree(journal->j_wbuf);
kfree(journal);

@@ -1548,6 +1560,16 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
sb->s_checksum_type = JBD2_CRC32C_CHKSUM;
sb->s_feature_compat &=
~cpu_to_be32(JBD2_FEATURE_COMPAT_CHECKSUM);
+
+ /* Load the checksum driver */
+ if (journal->j_chksum_driver == NULL) {
+ journal->j_chksum_driver = crypto_alloc_shash("crc32c",
+ 0, 0);
+ if (IS_ERR(journal->j_chksum_driver)) {
+ journal->j_chksum_driver = NULL;
+ return 0;
+ }
+ }
}

/* If enabling v1 checksums, downgrade superblock */
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 69ca172..53df865 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -31,6 +31,7 @@
#include <linux/mutex.h>
#include <linux/timer.h>
#include <linux/slab.h>
+#include <crypto/hash.h>
#endif

#define journal_oom_retry 1
@@ -1010,6 +1011,9 @@ struct journal_s

/* Precomputed UUID checksum */
__u32 j_uuid_crc;
+
+ /* Reference to checksum algorithm driver via cryptoapi */
+ struct crypto_shash *j_chksum_driver;
};

/*
@@ -1328,6 +1332,25 @@ static inline int jbd_space_needed(journal_t *journal)

extern int jbd_blocks_per_page(struct inode *inode);

+static inline u32 jbd2_chksum(journal_t *journal, u32 crc,
+ const void *address, unsigned int length)
+{
+ struct {
+ struct shash_desc shash;
+ char ctx[crypto_shash_descsize(journal->j_chksum_driver)];
+ } desc;
+ int err;
+
+ desc.shash.tfm = journal->j_chksum_driver;
+ desc.shash.flags = 0;
+ *(u32 *)desc.ctx = crc;
+
+ err = crypto_shash_update(&desc.shash, address, length);
+ BUG_ON(err);
+
+ return *(u32 *)desc.ctx;
+}
+
#ifdef __KERNEL__

#define buffer_trace_init(bh) do {} while (0)

2011-10-08 07:56:48

by djwong

[permalink] [raw]
Subject: [PATCH 28/28] jbd2: Checksum data blocks that are stored in the journal

Calculate and verify checksums of each data block being stored in the journal.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/jbd2/commit.c | 22 ++++++++++++++++++++++
fs/jbd2/journal.c | 10 ++++++++--
fs/jbd2/recovery.c | 30 ++++++++++++++++++++++++++++++
include/linux/jbd2.h | 1 +
4 files changed, 61 insertions(+), 2 deletions(-)


diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 2bce8dc..0459e19 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -339,6 +339,26 @@ static void jbd2_descr_block_csum_set(journal_t *j,
tail->t_checksum = cpu_to_be32(crc);
}

+static void jbd2_block_tag_csum_set(journal_t *j, journal_block_tag_t *tag,
+ struct buffer_head *bh, __u32 sequence)
+{
+ struct page *page = bh->b_page;
+ __u8 *addr;
+ __u32 crc;
+
+ if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ return;
+
+ sequence = cpu_to_be32(sequence);
+ addr = kmap_atomic(page, KM_USER0);
+ crc = jbd2_chksum(j, j->j_uuid_crc, (__u8 *)&sequence,
+ sizeof(sequence));
+ crc = jbd2_chksum(j, crc, addr + offset_in_page(bh->b_data),
+ bh->b_size);
+ kunmap_atomic(addr, KM_USER0);
+
+ tag->t_checksum = cpu_to_be32(crc);
+}
/*
* jbd2_journal_commit_transaction
*
@@ -649,6 +669,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
tag = (journal_block_tag_t *) tagp;
write_tag_block(tag_bytes, tag, jh2bh(jh)->b_blocknr);
tag->t_flags = cpu_to_be32(tag_flag);
+ jbd2_block_tag_csum_set(journal, tag, jh2bh(new_jh),
+ commit_transaction->t_tid);
tagp += tag_bytes;
space_left -= tag_bytes;

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index d494c14..24a0241 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1981,10 +1981,16 @@ int jbd2_journal_blocks_per_page(struct inode *inode)
*/
size_t journal_tag_bytes(journal_t *journal)
{
+ journal_block_tag_t tag;
+ size_t x = 0;
+
+ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ x += sizeof(tag.t_checksum);
+
if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT))
- return JBD2_TAG_SIZE64;
+ return x + JBD2_TAG_SIZE64;
else
- return JBD2_TAG_SIZE32;
+ return x + JBD2_TAG_SIZE32;
}

/*
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 3db9625..ec315ba 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -390,6 +390,23 @@ static int jbd2_commit_block_csum_verify(journal_t *j, void *buf)
return provided == calculated;
}

+static int jbd2_block_tag_csum_verify(journal_t *j, journal_block_tag_t *tag,
+ void *buf, __u32 sequence)
+{
+ __u32 provided, calculated;
+
+ if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ return 1;
+
+ sequence = cpu_to_be32(sequence);
+ calculated = jbd2_chksum(j, j->j_uuid_crc, (__u8 *)&sequence,
+ sizeof(sequence));
+ calculated = jbd2_chksum(j, calculated, buf, j->j_blocksize);
+ provided = be32_to_cpu(tag->t_checksum);
+
+ return provided == cpu_to_be32(calculated);
+}
+
static int do_one_pass(journal_t *journal,
struct recovery_info *info, enum passtype pass)
{
@@ -566,6 +583,19 @@ static int do_one_pass(journal_t *journal,
goto skip_write;
}

+ /* Look for block corruption */
+ if (!jbd2_block_tag_csum_verify(
+ journal, tag, obh->b_data,
+ be32_to_cpu(tmp->h_sequence))) {
+ brelse(obh);
+ success = -EIO;
+ printk(KERN_ERR "JBD: Invalid "
+ "checksum recovering "
+ "block %llu in log\n",
+ blocknr);
+ continue;
+ }
+
/* Find a buffer for the new
* data being restored */
nbh = __getblk(journal->j_fs_dev,
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 633fea2..098b30b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -190,6 +190,7 @@ typedef struct journal_block_tag_s
__be32 t_blocknr; /* The on-disk block number */
__be32 t_flags; /* See below */
__be32 t_blocknr_high; /* most-significant high 32bits. */
+ __be32 t_checksum; /* crc32c(uuid+seq+block) */
} journal_block_tag_t;

#define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))


2011-10-08 07:59:14

by djwong

[permalink] [raw]
Subject: [PATCH 27/28] jbd2: Checksum commit blocks

Calculate and verify the checksum of commit blocks. In checksum v2, deprecate
most of the checksum v1 commit block checksum fields, since each block has its
own checksum.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/jbd2/commit.c | 21 ++++++++++++++++++++-
fs/jbd2/recovery.c | 31 +++++++++++++++++++++++++++++++
include/linux/jbd2.h | 11 +++++++++++
3 files changed, 62 insertions(+), 1 deletions(-)


diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index e2c355c..2bce8dc 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -86,6 +86,24 @@ nope:
__brelse(bh);
}

+static void jbd2_commit_block_csum_set(journal_t *j,
+ struct journal_head *descriptor)
+{
+ struct commit_header *h;
+ __u32 crc;
+
+ if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ return;
+
+ h = (struct commit_header *)(jh2bh(descriptor)->b_data);
+ h->h_chksum_type = 0;
+ h->h_chksum_size = 0;
+ h->h_chksum[0] = 0;
+ crc = jbd2_chksum(j, j->j_uuid_crc, jh2bh(descriptor)->b_data,
+ j->j_blocksize);
+ h->h_chksum[0] = cpu_to_be32(crc);
+}
+
/*
* Done it all: now submit the commit record. We should have
* cleaned up our previous buffers by now, so if we are in abort
@@ -129,6 +147,7 @@ static int journal_submit_commit_record(journal_t *journal,
tmp->h_chksum_size = JBD2_CRC32_CHKSUM_SIZE;
tmp->h_chksum[0] = cpu_to_be32(crc32_sum);
}
+ jbd2_commit_block_csum_set(journal, descriptor);

JBUFFER_TRACE(descriptor, "submit commit block");
lock_buffer(bh);
@@ -351,7 +370,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
struct blk_plug plug;
int csum_size = 0;

- if (JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
csum_size = sizeof(struct jbd2_journal_block_tail);

/*
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index ecb1eda..3db9625 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -372,6 +372,24 @@ static int calc_chksums(journal_t *journal, struct buffer_head *bh,
return 0;
}

+static int jbd2_commit_block_csum_verify(journal_t *j, void *buf)
+{
+ struct commit_header *h;
+ __u32 provided, calculated;
+
+ if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ return 1;
+
+ h = buf;
+ provided = h->h_chksum[0];
+ h->h_chksum[0] = 0;
+ calculated = jbd2_chksum(j, j->j_uuid_crc, buf, j->j_blocksize);
+ h->h_chksum[0] = provided;
+
+ provided = be32_to_cpu(provided);
+ return provided == calculated;
+}
+
static int do_one_pass(journal_t *journal,
struct recovery_info *info, enum passtype pass)
{
@@ -682,6 +700,19 @@ static int do_one_pass(journal_t *journal,
}
crc32_sum = ~0;
}
+ if (pass == PASS_SCAN &&
+ !jbd2_commit_block_csum_verify(journal,
+ bh->b_data)) {
+ info->end_transaction = next_commit_ID;
+
+ if (!JBD2_HAS_INCOMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)) {
+ journal->j_failed_commit =
+ next_commit_ID;
+ brelse(bh);
+ break;
+ }
+ }
brelse(bh);
next_commit_ID++;
continue;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 69e305d..633fea2 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -155,6 +155,17 @@ typedef struct journal_header_s
#define JBD2_CHECKSUM_BYTES (32 / sizeof(u32))
/*
* Commit block header for storing transactional checksums:
+ *
+ * NOTE: If FEATURE_COMPAT_CHECKSUM (checksum v1) is set, the h_chksum*
+ * fields are used to store a checksum of the descriptor and data blocks.
+ *
+ * If FEATURE_INCOMPAT_CSUM_V2 (checksum v2) is set, then the h_chksum
+ * field is used to store crc32c(uuid+commit_block). Each journal metadata
+ * block gets its own checksum, and data block checksums are stored in
+ * journal_block_tag (in the descriptor). The other h_chksum* fields are
+ * not used.
+ *
+ * Checksum v1 and v2 are mutually exclusive features.
*/
struct commit_header {
__be32 h_magic;


2011-10-08 07:56:08

by djwong

[permalink] [raw]
Subject: [PATCH 22/28] jbd2: Update structure definitions and flags to support extended checksumming

Add structure definitions, flags, and constants to extend checksum protection
to all journal blocks.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/super.c | 55 ++++++++++++++++++++++++++++++++++++++------------
fs/jbd2/journal.c | 50 +++++++++++++++++++++++++++++++++++++++++++++
include/linux/jbd2.h | 8 ++++++-
3 files changed, 99 insertions(+), 14 deletions(-)


diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 5538fb7..a715b3e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3141,6 +3141,44 @@ static void ext4_destroy_lazyinit_thread(void)
kthread_stop(ext4_lazyinit_task);
}

+static int set_journal_csum_feature_set(struct super_block *sb)
+{
+ int ret = 1;
+ int compat, incompat;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)) {
+ /* journal checksum v2 */
+ compat = 0;
+ incompat = JBD2_FEATURE_INCOMPAT_CSUM_V2;
+ } else {
+ /* journal checksum v1 */
+ compat = JBD2_FEATURE_COMPAT_CHECKSUM;
+ incompat = 0;
+ }
+
+ if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
+ ret = jbd2_journal_set_features(sbi->s_journal,
+ compat, 0,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT |
+ incompat);
+ } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
+ ret = jbd2_journal_set_features(sbi->s_journal,
+ compat, 0,
+ incompat);
+ jbd2_journal_clear_features(sbi->s_journal, 0, 0,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+ } else {
+ jbd2_journal_clear_features(sbi->s_journal,
+ JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT |
+ JBD2_FEATURE_INCOMPAT_CSUM_V2);
+ }
+
+ return ret;
+}
+
static int ext4_fill_super(struct super_block *sb, void *data, int silent)
__releases(kernel_lock)
__acquires(kernel_lock)
@@ -3667,19 +3705,10 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto failed_mount_wq;
}

- if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
- jbd2_journal_set_features(sbi->s_journal,
- JBD2_FEATURE_COMPAT_CHECKSUM, 0,
- JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
- } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
- jbd2_journal_set_features(sbi->s_journal,
- JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
- jbd2_journal_clear_features(sbi->s_journal, 0, 0,
- JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
- } else {
- jbd2_journal_clear_features(sbi->s_journal,
- JBD2_FEATURE_COMPAT_CHECKSUM, 0,
- JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+ if (!set_journal_csum_feature_set(sb)) {
+ ext4_msg(sb, KERN_ERR, "Failed to set journal checksum "
+ "feature set");
+ goto failed_mount_wq;
}

/* We have now updated the journal if required, so we can
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index f24df13..8d8985e 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -100,6 +100,15 @@ static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
static void __journal_abort_soft (journal_t *journal, int errno);
static int jbd2_journal_create_slab(size_t slab_size);

+/* Checksumming functions */
+int jbd2_verify_csum_type(journal_t *j, journal_superblock_t *sb)
+{
+ if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ return 1;
+
+ return sb->s_checksum_type == JBD2_CRC32C_CHKSUM;
+}
+
/*
* Helper function used to manage commit timeouts
*/
@@ -1222,6 +1231,9 @@ static int journal_get_superblock(journal_t *journal)
}
}

+ if (buffer_verified(bh))
+ return 0;
+
sb = journal->j_superblock;

err = -EINVAL;
@@ -1251,6 +1263,21 @@ static int journal_get_superblock(journal_t *journal)
goto out;
}

+ if (JBD2_HAS_COMPAT_FEATURE(journal, JBD2_FEATURE_COMPAT_CHECKSUM) &&
+ JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2)) {
+ /* Can't have checksum v1 and v2 on at the same time! */
+ printk(KERN_ERR "JBD: Can't enable checksumming v1 and v2 "
+ "at the same time!\n");
+ goto out;
+ }
+
+ if (!jbd2_verify_csum_type(journal, sb)) {
+ printk(KERN_ERR "JBD: Unknown checksum type\n");
+ goto out;
+ }
+
+ set_buffer_verified(bh);
+
return 0;

out:
@@ -1494,6 +1521,10 @@ int jbd2_journal_check_available_features (journal_t *journal, unsigned long com
int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
unsigned long ro, unsigned long incompat)
{
+#define INCOMPAT_FEATURE_ON(f) \
+ ((incompat & (f)) && !(sb->s_feature_incompat & cpu_to_be32(f)))
+#define COMPAT_FEATURE_ON(f) \
+ ((compat & (f)) && !(sb->s_feature_compat & cpu_to_be32(f)))
journal_superblock_t *sb;

if (jbd2_journal_check_used_features(journal, compat, ro, incompat))
@@ -1502,16 +1533,35 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
if (!jbd2_journal_check_available_features(journal, compat, ro, incompat))
return 0;

+ /* Asking for checksumming v2 and v1? Only give them v2. */
+ if (incompat & JBD2_FEATURE_INCOMPAT_CSUM_V2 &&
+ compat & JBD2_FEATURE_COMPAT_CHECKSUM)
+ compat &= ~JBD2_FEATURE_COMPAT_CHECKSUM;
+
jbd_debug(1, "Setting new features 0x%lx/0x%lx/0x%lx\n",
compat, ro, incompat);

sb = journal->j_superblock;

+ /* If enabling v2 checksums, update superblock */
+ if (INCOMPAT_FEATURE_ON(JBD2_FEATURE_INCOMPAT_CSUM_V2)) {
+ sb->s_checksum_type = JBD2_CRC32C_CHKSUM;
+ sb->s_feature_compat &=
+ ~cpu_to_be32(JBD2_FEATURE_COMPAT_CHECKSUM);
+ }
+
+ /* If enabling v1 checksums, downgrade superblock */
+ if (COMPAT_FEATURE_ON(JBD2_FEATURE_COMPAT_CHECKSUM))
+ sb->s_feature_incompat &=
+ ~cpu_to_be32(JBD2_FEATURE_INCOMPAT_CSUM_V2);
+
sb->s_feature_compat |= cpu_to_be32(compat);
sb->s_feature_ro_compat |= cpu_to_be32(ro);
sb->s_feature_incompat |= cpu_to_be32(incompat);

return 1;
+#undef COMPAT_FEATURE_ON
+#undef INCOMPAT_FEATURE_ON
}

/*
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 3284706..69ca172 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -147,6 +147,7 @@ typedef struct journal_header_s
#define JBD2_CRC32_CHKSUM 1
#define JBD2_MD5_CHKSUM 2
#define JBD2_SHA1_CHKSUM 3
+#define JBD2_CRC32C_CHKSUM 4

#define JBD2_CRC32_CHKSUM_SIZE 4

@@ -263,13 +264,15 @@ typedef struct journal_superblock_s
#define JBD2_FEATURE_INCOMPAT_REVOKE 0x00000001
#define JBD2_FEATURE_INCOMPAT_64BIT 0x00000002
#define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT 0x00000004
+#define JBD2_FEATURE_INCOMPAT_CSUM_V2 0x00000008

/* Features known to this kernel version: */
#define JBD2_KNOWN_COMPAT_FEATURES JBD2_FEATURE_COMPAT_CHECKSUM
#define JBD2_KNOWN_ROCOMPAT_FEATURES 0
#define JBD2_KNOWN_INCOMPAT_FEATURES (JBD2_FEATURE_INCOMPAT_REVOKE | \
JBD2_FEATURE_INCOMPAT_64BIT | \
- JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT)
+ JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT | \
+ JBD2_FEATURE_INCOMPAT_CSUM_V2)

#ifdef __KERNEL__

@@ -1004,6 +1007,9 @@ struct journal_s
* superblock pointer here
*/
void *j_private;
+
+ /* Precomputed UUID checksum */
+ __u32 j_uuid_crc;
};

/*

2011-10-08 07:56:35

by djwong

[permalink] [raw]
Subject: [PATCH 26/28] jbd2: Checksum descriptor blocks

Calculate and verify a checksum of each descriptor block.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/jbd2/commit.c | 26 ++++++++++++++++++++++++--
fs/jbd2/recovery.c | 37 ++++++++++++++++++++++++++++++++++++-
include/linux/jbd2.h | 5 +++++
3 files changed, 65 insertions(+), 3 deletions(-)


diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index eef6979..e2c355c 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -302,6 +302,24 @@ static void write_tag_block(int tag_bytes, journal_block_tag_t *tag,
tag->t_blocknr_high = cpu_to_be32((block >> 31) >> 1);
}

+static void jbd2_descr_block_csum_set(journal_t *j,
+ struct journal_head *descriptor)
+{
+ struct jbd2_journal_block_tail *tail;
+ __u32 crc;
+
+ if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ return;
+
+ tail = (struct jbd2_journal_block_tail *)
+ (jh2bh(descriptor)->b_data + j->j_blocksize -
+ sizeof(struct jbd2_journal_block_tail));
+ tail->t_checksum = 0;
+ crc = jbd2_chksum(j, j->j_uuid_crc, jh2bh(descriptor)->b_data,
+ j->j_blocksize);
+ tail->t_checksum = cpu_to_be32(crc);
+}
+
/*
* jbd2_journal_commit_transaction
*
@@ -331,6 +349,10 @@ void jbd2_journal_commit_transaction(journal_t *journal)
struct buffer_head *cbh = NULL; /* For transactional checksums */
__u32 crc32_sum = ~0;
struct blk_plug plug;
+ int csum_size = 0;
+
+ if (JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ csum_size = sizeof(struct jbd2_journal_block_tail);

/*
* First job: lock down the current transaction and wait for
@@ -623,7 +645,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)

if (bufs == journal->j_wbufsize ||
commit_transaction->t_buffers == NULL ||
- space_left < tag_bytes + 16) {
+ space_left < tag_bytes + 16 + csum_size) {

jbd_debug(4, "JBD: Submit %d IOs\n", bufs);

@@ -632,7 +654,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
the last tag we set up. */

tag->t_flags |= cpu_to_be32(JBD2_FLAG_LAST_TAG);
-
+ jbd2_descr_block_csum_set(journal, descriptor);
start_journal_io:
for (i = 0; i < bufs; i++) {
struct buffer_head *bh = wbuf[i];
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 4f4eda4..ecb1eda 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -173,6 +173,25 @@ static int jread(struct buffer_head **bhp, journal_t *journal,
return 0;
}

+static int jbd2_descr_block_csum_verify(journal_t *j,
+ void *buf)
+{
+ struct jbd2_journal_block_tail *tail;
+ __u32 provided, calculated;
+
+ if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ return 1;
+
+ tail = (struct jbd2_journal_block_tail *)(buf + j->j_blocksize -
+ sizeof(struct jbd2_journal_block_tail));
+ provided = tail->t_checksum;
+ tail->t_checksum = 0;
+ calculated = jbd2_chksum(j, j->j_uuid_crc, buf, j->j_blocksize);
+ tail->t_checksum = provided;
+
+ provided = be32_to_cpu(provided);
+ return provided == calculated;
+}

/*
* Count the number of in-use tags in a journal descriptor block.
@@ -185,6 +204,9 @@ static int count_tags(journal_t *journal, struct buffer_head *bh)
int nr = 0, size = journal->j_blocksize;
int tag_bytes = journal_tag_bytes(journal);

+ if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ size -= sizeof(struct jbd2_journal_block_tail);
+
tagp = &bh->b_data[sizeof(journal_header_t)];

while ((tagp - bh->b_data + tag_bytes) <= size) {
@@ -363,6 +385,7 @@ static int do_one_pass(journal_t *journal,
int blocktype;
int tag_bytes = journal_tag_bytes(journal);
__u32 crc32_sum = ~0; /* Transactional Checksums */
+ int descr_csum_size = 0;

/*
* First thing is to establish what we expect to find in the log
@@ -448,6 +471,18 @@ static int do_one_pass(journal_t *journal,

switch(blocktype) {
case JBD2_DESCRIPTOR_BLOCK:
+ /* Verify checksum first */
+ if (JBD2_HAS_INCOMPAT_FEATURE(journal,
+ JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ descr_csum_size =
+ sizeof(struct jbd2_journal_block_tail);
+ if (descr_csum_size > 0 &&
+ !jbd2_descr_block_csum_verify(journal,
+ bh->b_data)) {
+ err = -EIO;
+ goto failed;
+ }
+
/* If it is a valid descriptor block, replay it
* in pass REPLAY; if journal_checksums enabled, then
* calculate checksums in PASS_SCAN, otherwise,
@@ -478,7 +513,7 @@ static int do_one_pass(journal_t *journal,

tagp = &bh->b_data[sizeof(journal_header_t)];
while ((tagp - bh->b_data + tag_bytes)
- <= journal->j_blocksize) {
+ <= journal->j_blocksize - descr_csum_size) {
unsigned long io_block;

tag = (journal_block_tag_t *) tagp;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 125d5be..69e305d 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -184,6 +184,11 @@ typedef struct journal_block_tag_s
#define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
#define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t))

+/* Tail of descriptor block, for checksumming */
+struct jbd2_journal_block_tail {
+ __be32 t_checksum; /* crc32c(uuid+descr_block) */
+};
+
/*
* The revoke descriptor: used on disk to describe a series of blocks to
* be revoked from the log

2011-10-08 07:59:17

by djwong

[permalink] [raw]
Subject: [PATCH 24/28] jbd2: Update structure definitions and flags to support extended checksumming

Add structure definitions, flags, and constants to extend checksum protection
to all journal blocks.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/jbd2/journal.c | 37 +++++++++++++++++++++++++++++++++++++
include/linux/jbd2.h | 5 ++++-
2 files changed, 41 insertions(+), 1 deletions(-)


diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 5c76a8b..d494c14 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -109,6 +109,34 @@ int jbd2_verify_csum_type(journal_t *j, journal_superblock_t *sb)
return sb->s_checksum_type == JBD2_CRC32C_CHKSUM;
}

+static __u32 jbd2_superblock_csum(journal_t *j, journal_superblock_t *sb)
+{
+ __u32 crc, old_crc;
+
+ old_crc = sb->s_checksum;
+ sb->s_checksum = 0;
+ crc = jbd2_chksum(j, ~0, (char *)sb, sizeof(journal_superblock_t));
+ sb->s_checksum = old_crc;
+
+ return cpu_to_be32(crc);
+}
+
+int jbd2_superblock_csum_verify(journal_t *j, journal_superblock_t *sb)
+{
+ if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ return 1;
+
+ return sb->s_checksum == jbd2_superblock_csum(j, sb);
+}
+
+void jbd2_superblock_csum_set(journal_t *j, journal_superblock_t *sb)
+{
+ if (!JBD2_HAS_INCOMPAT_FEATURE(j, JBD2_FEATURE_INCOMPAT_CSUM_V2))
+ return;
+
+ sb->s_checksum = jbd2_superblock_csum(j, sb);
+}
+
/*
* Helper function used to manage commit timeouts
*/
@@ -1178,6 +1206,7 @@ void jbd2_journal_update_superblock(journal_t *journal, int wait)
sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
sb->s_start = cpu_to_be32(journal->j_tail);
sb->s_errno = cpu_to_be32(journal->j_errno);
+ jbd2_superblock_csum_set(journal, sb);
read_unlock(&journal->j_state_lock);

BUFFER_TRACE(bh, "marking dirty");
@@ -1286,6 +1315,11 @@ static int journal_get_superblock(journal_t *journal)
}
}

+ if (!jbd2_superblock_csum_verify(journal, sb)) {
+ printk(KERN_ERR "JBD: journal checksum error\n");
+ goto out;
+ }
+
set_buffer_verified(bh);

return 0;
@@ -1580,6 +1614,7 @@ int jbd2_journal_set_features (journal_t *journal, unsigned long compat,
sb->s_feature_compat |= cpu_to_be32(compat);
sb->s_feature_ro_compat |= cpu_to_be32(ro);
sb->s_feature_incompat |= cpu_to_be32(incompat);
+ jbd2_journal_update_superblock(journal, 0);

return 1;
#undef COMPAT_FEATURE_ON
@@ -1610,6 +1645,7 @@ void jbd2_journal_clear_features(journal_t *journal, unsigned long compat,
sb->s_feature_compat &= ~cpu_to_be32(compat);
sb->s_feature_ro_compat &= ~cpu_to_be32(ro);
sb->s_feature_incompat &= ~cpu_to_be32(incompat);
+ jbd2_journal_update_superblock(journal, 0);
}
EXPORT_SYMBOL(jbd2_journal_clear_features);

@@ -1658,6 +1694,7 @@ static int journal_convert_superblock_v1(journal_t *journal,

sb->s_nr_users = cpu_to_be32(1);
sb->s_header.h_blocktype = cpu_to_be32(JBD2_SUPERBLOCK_V2);
+ jbd2_superblock_csum_set(journal, sb);
journal->j_format_version = 2;

bh = journal->j_sb_buffer;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 53df865..be44094 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -243,7 +243,10 @@ typedef struct journal_superblock_s
__be32 s_max_trans_data; /* Limit of data blocks per trans. */

/* 0x0050 */
- __u32 s_padding[44];
+ __u8 s_checksum_type; /* checksum type */
+ __u8 s_padding2[3];
+ __u32 s_padding[42];
+ __be32 s_checksum; /* crc32c(superblock) */

/* 0x0100 */
__u8 s_users[16*48]; /* ids of all fs'es sharing the log */


2011-10-08 07:54:49

by djwong

[permalink] [raw]
Subject: [PATCH 10/28] ext4: Calculate and verify superblock checksum

Calculate and verify the superblock checksum. Since the UUID and block group
number are embedded in each copy of the superblock, we need only checksum the
entire block. Refactor some of the code to eliminate open-coding of the
checksum update call.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4.h | 11 ++++++++++-
fs/ext4/ext4_jbd2.c | 9 ++++++++-
fs/ext4/ext4_jbd2.h | 7 +++++--
fs/ext4/inode.c | 3 +--
fs/ext4/namei.c | 4 ++--
fs/ext4/resize.c | 6 +++++-
fs/ext4/super.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 78 insertions(+), 9 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9012911..c99e44c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1070,7 +1070,9 @@ struct ext4_super_block {
__u8 s_last_error_func[32]; /* function where the error happened */
#define EXT4_S_ERR_END offsetof(struct ext4_super_block, s_mount_opts)
__u8 s_mount_opts[64];
- __le32 s_reserved[112]; /* Padding to the end of the block */
+ __u32 s_reserved1[3]; /* Padding */
+ __le32 s_reserved2[108]; /* Padding to the end of the block */
+ __le32 s_checksum; /* crc32c(superblock) */
};

#define EXT4_S_ERR_LEN (EXT4_S_ERR_END - EXT4_S_ERR_START)
@@ -1906,6 +1908,10 @@ extern int ext4_group_extend(struct super_block *sb,
ext4_fsblk_t n_blocks_count);

/* super.c */
+extern int ext4_superblock_csum_verify(struct super_block *sb,
+ struct ext4_super_block *es);
+extern void ext4_superblock_csum_set(struct super_block *sb,
+ struct ext4_super_block *es);
extern void *ext4_kvmalloc(size_t size, gfp_t flags);
extern void *ext4_kvzalloc(size_t size, gfp_t flags);
extern void ext4_kvfree(void *ptr);
@@ -2180,6 +2186,9 @@ static inline void ext4_unlock_group(struct super_block *sb,

static inline void ext4_mark_super_dirty(struct super_block *sb)
{
+ struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+
+ ext4_superblock_csum_set(sb, es);
if (EXT4_SB(sb)->s_journal == NULL)
sb->s_dirt =1;
}
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index f5240aa..04ddc97 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -136,16 +136,23 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
}

int __ext4_handle_dirty_super(const char *where, unsigned int line,
- handle_t *handle, struct super_block *sb)
+ handle_t *handle, struct super_block *sb,
+ int now)
{
struct buffer_head *bh = EXT4_SB(sb)->s_sbh;
int err = 0;

if (ext4_handle_valid(handle)) {
+ ext4_superblock_csum_set(sb,
+ (struct ext4_super_block *)bh->b_data);
err = jbd2_journal_dirty_metadata(handle, bh);
if (err)
ext4_journal_abort_handle(where, line, __func__,
bh, handle, err);
+ } else if (now) {
+ ext4_superblock_csum_set(sb,
+ (struct ext4_super_block *)bh->b_data);
+ mark_buffer_dirty(bh);
} else
sb->s_dirt = 1;
return err;
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 5802fa1..ed9b78d 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -141,7 +141,8 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
struct buffer_head *bh);

int __ext4_handle_dirty_super(const char *where, unsigned int line,
- handle_t *handle, struct super_block *sb);
+ handle_t *handle, struct super_block *sb,
+ int now);

#define ext4_journal_get_write_access(handle, bh) \
__ext4_journal_get_write_access(__func__, __LINE__, (handle), (bh))
@@ -153,8 +154,10 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
#define ext4_handle_dirty_metadata(handle, inode, bh) \
__ext4_handle_dirty_metadata(__func__, __LINE__, (handle), (inode), \
(bh))
+#define ext4_handle_dirty_super_now(handle, sb) \
+ __ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb), 1)
#define ext4_handle_dirty_super(handle, sb) \
- __ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb))
+ __ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb), 0)

handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks);
int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 986e238..6e64e0b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3700,8 +3700,7 @@ static int ext4_do_update_inode(handle_t *handle,
EXT4_FEATURE_RO_COMPAT_LARGE_FILE);
sb->s_dirt = 1;
ext4_handle_sync(handle);
- err = ext4_handle_dirty_metadata(handle, NULL,
- EXT4_SB(sb)->s_sbh);
+ err = ext4_handle_dirty_super_now(handle, sb);
}
}
raw_inode->i_generation = cpu_to_le32(inode->i_generation);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 50c7294..d0158f2 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2014,7 +2014,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
/* Insert this inode at the head of the on-disk orphan list... */
NEXT_ORPHAN(inode) = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
EXT4_SB(sb)->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
- err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
+ err = ext4_handle_dirty_super_now(handle, sb);
rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
if (!err)
err = rc;
@@ -2087,7 +2087,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
if (err)
goto out_brelse;
sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
- err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh);
+ err = ext4_handle_dirty_super_now(handle, inode->i_sb);
} else {
struct ext4_iloc iloc2;
struct inode *i_prev =
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 707d3f1..2ad7008 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -511,7 +511,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,
ext4_kvfree(o_group_desc);

le16_add_cpu(&es->s_reserved_gdt_blocks, -1);
- err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
+ err = ext4_handle_dirty_super_now(handle, sb);
if (err)
ext4_std_error(sb, err);

@@ -682,6 +682,8 @@ static void update_backups(struct super_block *sb,
goto exit_err;
}

+ ext4_superblock_csum_set(sb, (struct ext4_super_block *)data);
+
while ((group = ext4_list_backups(sb, &three, &five, &seven)) < last) {
struct buffer_head *bh;

@@ -925,6 +927,8 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
/* Update the global fs size fields */
sbi->s_groups_count++;

+ ext4_superblock_csum_set(sb,
+ (struct ext4_super_block *)primary->b_data);
err = ext4_handle_dirty_metadata(handle, NULL, primary);
if (unlikely(err)) {
ext4_std_error(sb, err);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f02de28..58157ec 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -120,6 +120,38 @@ static int ext4_verify_csum_type(struct super_block *sb,
return es->s_checksum_type == EXT4_CRC32C_CHKSUM;
}

+static __le32 ext4_superblock_csum(struct super_block *sb,
+ struct ext4_super_block *es)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+ int offset = offsetof(struct ext4_super_block, s_checksum);
+ __u32 crc;
+
+ crc = ext4_chksum(sbi, ~0, (char *)es, offset);
+
+ return cpu_to_le32(crc);
+}
+
+int ext4_superblock_csum_verify(struct super_block *sb,
+ struct ext4_super_block *es)
+{
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ return es->s_checksum == ext4_superblock_csum(sb, es);
+}
+
+void ext4_superblock_csum_set(struct super_block *sb,
+ struct ext4_super_block *es)
+{
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ es->s_checksum = ext4_superblock_csum(sb, es);
+}
+
void *ext4_kvmalloc(size_t size, gfp_t flags)
{
void *ret;
@@ -3184,6 +3216,20 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
}
}

+ /* Check superblock checksum */
+ if (!ext4_superblock_csum_verify(sb, es)) {
+ ext4_msg(sb, KERN_ERR, "VFS: Found ext4 filesystem with "
+ "invalid superblock checksum. Run e2fsck?");
+ silent = 1;
+ goto cantfind_ext4;
+ }
+
+ /* Precompute first piece of crc */
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ sbi->s_uuid_crc = ext4_chksum(sbi, ~0, es->s_uuid,
+ sizeof(es->s_uuid));
+
/* Set defaults before we parse the mount options */
def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
set_opt(sb, INIT_INODE_TABLE);
@@ -4140,6 +4186,7 @@ static int ext4_commit_super(struct super_block *sb, int sync)
&EXT4_SB(sb)->s_freeinodes_counter));
sb->s_dirt = 0;
BUFFER_TRACE(sbh, "marking dirty");
+ ext4_superblock_csum_set(sb, es);
mark_buffer_dirty(sbh);
if (sync) {
error = sync_dirty_buffer(sbh);


2011-10-08 07:54:43

by djwong

[permalink] [raw]
Subject: [PATCH 09/28] ext4: Only call out to crc32c if necessary

Only obtain a reference to the cryptoapi and crc32c if we happen to mount a
filesystem with metadata checksumming enabled.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/Kconfig | 2 ++
fs/ext4/ext4.h | 23 +++++++++++++++++++++++
fs/ext4/super.c | 15 +++++++++++++++
3 files changed, 40 insertions(+), 0 deletions(-)


diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
index 9ed1bb1..c22f170 100644
--- a/fs/ext4/Kconfig
+++ b/fs/ext4/Kconfig
@@ -2,6 +2,8 @@ config EXT4_FS
tristate "The Extended 4 (ext4) filesystem"
select JBD2
select CRC16
+ select CRYPTO
+ select CRYPTO_CRC32C
help
This is the next generation of the ext3 filesystem.

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a74e125..9012911 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -29,6 +29,7 @@
#include <linux/wait.h>
#include <linux/blockgroup_lock.h>
#include <linux/percpu_counter.h>
+#include <crypto/hash.h>
#ifdef __KERNEL__
#include <linux/compat.h>
#endif
@@ -1228,6 +1229,9 @@ struct ext4_sb_info {

/* Precomputed FS UUID checksum */
__u32 s_uuid_crc;
+
+ /* Reference to checksum algorithm driver via cryptoapi */
+ struct crypto_shash *s_chksum_driver;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -1558,6 +1562,25 @@ static inline __le16 ext4_rec_len_to_disk(unsigned len, unsigned blocksize)
#define DX_HASH_HALF_MD4_UNSIGNED 4
#define DX_HASH_TEA_UNSIGNED 5

+static inline u32 ext4_chksum(struct ext4_sb_info *sbi, u32 crc,
+ const void *address, unsigned int length)
+{
+ struct {
+ struct shash_desc shash;
+ char ctx[crypto_shash_descsize(sbi->s_chksum_driver)];
+ } desc;
+ int err;
+
+ desc.shash.tfm = sbi->s_chksum_driver;
+ desc.shash.flags = 0;
+ *(u32 *)desc.ctx = crc;
+
+ err = crypto_shash_update(&desc.shash, address, length);
+ BUG_ON(err);
+
+ return *(u32 *)desc.ctx;
+}
+
#ifdef __KERNEL__

/* hash info structure used by the directory hash */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ac33e45..f02de28 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -870,6 +870,8 @@ static void ext4_put_super(struct super_block *sb)
unlock_super(sb);
kobject_put(&sbi->s_kobj);
wait_for_completion(&sbi->s_kobj_unregister);
+ if (sbi->s_chksum_driver)
+ crypto_free_shash(sbi->s_chksum_driver);
kfree(sbi->s_blockgroup_lock);
kfree(sbi);
}
@@ -3171,6 +3173,17 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto cantfind_ext4;
}

+ /* Load the checksum driver */
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)) {
+ sbi->s_chksum_driver = crypto_alloc_shash("crc32c", 0, 0);
+ if (IS_ERR(sbi->s_chksum_driver)) {
+ ret = PTR_ERR(sbi->s_chksum_driver);
+ sbi->s_chksum_driver = NULL;
+ goto failed_mount;
+ }
+ }
+
/* Set defaults before we parse the mount options */
def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
set_opt(sb, INIT_INODE_TABLE);
@@ -3803,6 +3816,8 @@ failed_mount2:
brelse(sbi->s_group_desc[i]);
ext4_kvfree(sbi->s_group_desc);
failed_mount:
+ if (sbi->s_chksum_driver)
+ crypto_free_shash(sbi->s_chksum_driver);
if (sbi->s_proc) {
remove_proc_entry(sb->s_id, ext4_proc_root);
}

2011-10-08 07:54:36

by djwong

[permalink] [raw]
Subject: [PATCH 08/28] ext4: Record the checksum algorithm in use in the superblock

Record the type of checksum algorithm we're using for metadata in the
superblock, in case we ever want/need to change the algorithm.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4.h | 5 ++++-
fs/ext4/super.c | 18 ++++++++++++++++++
2 files changed, 22 insertions(+), 1 deletions(-)


diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8ee8646..a74e125 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -957,6 +957,9 @@ extern void ext4_set_bits(void *bm, int cur, int len);
#define EXT4_ERRORS_PANIC 3 /* Panic */
#define EXT4_ERRORS_DEFAULT EXT4_ERRORS_CONTINUE

+/* Metadata checksum algorithm codes */
+#define EXT4_CRC32C_CHKSUM 1
+
/*
* Structure of the super block
*/
@@ -1043,7 +1046,7 @@ struct ext4_super_block {
__le64 s_mmp_block; /* Block for multi-mount protection */
__le32 s_raid_stripe_width; /* blocks on all data disks (N*stride)*/
__u8 s_log_groups_per_flex; /* FLEX_BG group size */
- __u8 s_reserved_char_pad;
+ __u8 s_checksum_type; /* metadata checksum algorithm used */
__le16 s_reserved_pad;
__le64 s_kbytes_written; /* nr of lifetime kilobytes written */
__le32 s_snapshot_inum; /* Inode number of active snapshot */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 44d0c8d..ac33e45 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -110,6 +110,16 @@ static struct file_system_type ext3_fs_type = {
#define IS_EXT3_SB(sb) (0)
#endif

+static int ext4_verify_csum_type(struct super_block *sb,
+ struct ext4_super_block *es)
+{
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ return es->s_checksum_type == EXT4_CRC32C_CHKSUM;
+}
+
void *ext4_kvmalloc(size_t size, gfp_t flags)
{
void *ret;
@@ -3153,6 +3163,14 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
goto cantfind_ext4;
sbi->s_kbytes_written = le64_to_cpu(es->s_kbytes_written);

+ /* Check for a known checksum algorithm */
+ if (!ext4_verify_csum_type(sb, es)) {
+ ext4_msg(sb, KERN_ERR, "VFS: Found ext4 filesystem with "
+ "unknown checksum algorithm.");
+ silent = 1;
+ goto cantfind_ext4;
+ }
+
/* Set defaults before we parse the mount options */
def_mount_opts = le32_to_cpu(es->s_default_mount_opts);
set_opt(sb, INIT_INODE_TABLE);

2011-10-12 19:45:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 11/28] ext4: Calculate and verify inode checksums

On 2011-10-08, at 12:54 AM, Darrick J. Wong wrote:
> This patch introduces to ext4 the ability to calculate and verify inode
> checksums. This requires the use of a new ro compatibility flag and some
> accompanying e2fsprogs patches to provide the relevant features in tune2fs and
> e2fsck.
>
> +static __u32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw,
> + struct ext4_inode_info *ei)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> + __u16 crc_lo;
> + __u16 crc_hi = 0;
> + __u32 crc;
> +
> + crc_lo = raw->i_checksum_lo;
> + raw->i_checksum_lo = 0;
> + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
> + EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi)) {
> + crc_hi = raw->i_checksum_hi;
> + raw->i_checksum_hi = 0;
> + }
> +
> + crc = ext4_chksum(sbi, ei->i_uuid_inum_crc, (__u8 *)raw,
> + EXT4_INODE_SIZE(inode->i_sb));
> +
> + raw->i_checksum_lo = crc_lo;
> + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
> + EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi))
> + raw->i_checksum_hi = crc_hi;
> +
> + return crc;
> +}

This computes both the _lo and _hi parts of the checksum and overwrites what is in the inode...

> +static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw,
> + struct ext4_inode_info *ei)
> +{
> + __u32 provided, calculated;
> +
> + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
> + cpu_to_le32(EXT4_OS_LINUX) ||
> + !EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> + return 1;
> +
> + provided = le16_to_cpu(raw->i_checksum_lo);
> + calculated = ext4_inode_csum(inode, raw, ei);

This only saves the _lo part of the checksum before computing the new
checksum (which overwrites both _lo and _hi fields), so the _hi part
of the checksum is never properly validated below.

> + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
> + EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi))
> + provided |= ((__u32)le16_to_cpu(raw->i_checksum_hi)) << 16;

This should be moved up to save _hi before calling ext4_inode_csum().

> + else
> + calculated &= 0xFFFF;
> +
> + return provided == calculated;
> +}

> +static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
> + struct ext4_inode_info *ei)
> +{
> + __u32 crc;
> +
> + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
> + cpu_to_le32(EXT4_OS_LINUX) ||
> + !EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> + return;
> +
> + crc = ext4_inode_csum(inode, raw, ei);
> + raw->i_checksum_lo = cpu_to_le16(crc & 0xFFFF);
> + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
> + EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi))
> + raw->i_checksum_hi = cpu_to_le16(crc >> 16);

What is the point of storing the returned crc into raw->i_checksum_lo
and raw->i_checksum_hi, if this is done internal to ext4_inode_csum()
already?

Also, would it be better to call the temporary variable "csum" instead
of "crc", since we may use something other than crc32c as the hash
function in the future.

Cheers, Andreas

2011-10-12 19:51:31

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 12/28] ext4: Use i_generation in inode-related metadata checksums

On 2011-10-08, at 12:55 AM, Darrick J. Wong wrote:
> Whenever we are calculating a checksum for a piece of metadata that is
> associated with an inode, incorporate i_generation into that calculation
> so that old metadata blocks cannot be re-associated after a delete/create cycle.

It would be better to fold this into the previous patch, since it will
otherwise cause the inode checksum algorithm to change in the middle of the patch series.

On a related note, in ext4_new_inode() it would be useful to change the
setting of i_generation so that it skips i_generation == 0, which doesn't
contribute to the checksum:

spin_lock(&sbi->s_next_gen_lock);
inode->i_generation = sbi->s_next_generation++;
+ if (unlikely(inode->i_generation == 0))
+ inode->i_generation = sbi->s_next_generation++;
spin_unlock(&sbi->s_next_gen_lock);

> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 6e5876a..d4b59e9 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -1031,11 +1031,14 @@ got:
> /* Precompute second piece of crc */
> if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
> EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)) {
> + __u32 crc;
> struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> __le32 inum = cpu_to_le32(inode->i_ino);
> - ei->i_uuid_inum_crc = ext4_chksum(sbi, sbi->s_uuid_crc,
> - (__u8 *)&inum,
> - sizeof(inum));
> + __le32 gen = cpu_to_le32(inode->i_generation);
> + crc = ext4_chksum(sbi, sbi->s_uuid_crc, (__u8 *)&inum,
> + sizeof(inum));
> + ei->i_uuid_inum_crc = ext4_chksum(sbi, crc, (__u8 *)&gen,
> + sizeof(gen));
> }
>
> ext4_clear_state_flags(ei); /* Only relevant on 32-bit archs */
>
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index f18bfe3..fdf0b1e 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -149,6 +149,10 @@ flags_out:
> if (!inode_owner_or_capable(inode))
> return -EPERM;
>
> + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> + return -ENOTTY;

This should get an ext4_warning() in the non-checksum case to warn
users that this ioctl is deprecated and will be removed in the
future unless there is a good reason to keep it.

Cheers, Andreas






2011-10-12 21:03:42

by djwong

[permalink] [raw]
Subject: Re: [PATCH 11/28] ext4: Calculate and verify inode checksums

On Wed, Oct 12, 2011 at 12:45:01PM -0700, Andreas Dilger wrote:
> On 2011-10-08, at 12:54 AM, Darrick J. Wong wrote:
> > This patch introduces to ext4 the ability to calculate and verify inode
> > checksums. This requires the use of a new ro compatibility flag and some
> > accompanying e2fsprogs patches to provide the relevant features in tune2fs and
> > e2fsck.
> >
> > +static __u32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw,
> > + struct ext4_inode_info *ei)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > + __u16 crc_lo;
> > + __u16 crc_hi = 0;
> > + __u32 crc;
> > +
> > + crc_lo = raw->i_checksum_lo;
> > + raw->i_checksum_lo = 0;
> > + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
> > + EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi)) {
> > + crc_hi = raw->i_checksum_hi;
> > + raw->i_checksum_hi = 0;
> > + }
> > +
> > + crc = ext4_chksum(sbi, ei->i_uuid_inum_crc, (__u8 *)raw,
> > + EXT4_INODE_SIZE(inode->i_sb));
> > +
> > + raw->i_checksum_lo = crc_lo;
> > + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
> > + EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi))
> > + raw->i_checksum_hi = crc_hi;
> > +
> > + return crc;
> > +}
>
> This computes both the _lo and _hi parts of the checksum and overwrites what
> is in the inode...

I don't follow your logic ... for the _lo component, first I save the old
i_checksum_lo contents in crc_lo. Then I stuff zero into i_checksum_lo. Next
I perform the checksum computation (with the checksum field effectively "zero")
and put the results into crc. Then I copy whatever I saved in crc_lo back into
i_checksum_lo.

crc_lo, crc_hi, and crc are three separate variables, and neither crc_lo nor
crc_hi are ever assigned any part of crc. Therefore crc_lo and crc_hi should
always contain the old checksum contents.

Did I miss something? Afaict the contents of raw should be the same before and
after the call to ext4_inode_csum(), but maybe I've been looking at this too
long. :)

> > +static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw,
> > + struct ext4_inode_info *ei)
> > +{
> > + __u32 provided, calculated;
> > +
> > + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
> > + cpu_to_le32(EXT4_OS_LINUX) ||
> > + !EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return 1;
> > +
> > + provided = le16_to_cpu(raw->i_checksum_lo);
> > + calculated = ext4_inode_csum(inode, raw, ei);
>
> This only saves the _lo part of the checksum before computing the new
> checksum (which overwrites both _lo and _hi fields), so the _hi part
> of the checksum is never properly validated below.
>
> > + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
> > + EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi))
> > + provided |= ((__u32)le16_to_cpu(raw->i_checksum_hi)) << 16;
>
> This should be moved up to save _hi before calling ext4_inode_csum().
>
> > + else
> > + calculated &= 0xFFFF;
> > +
> > + return provided == calculated;
> > +}
>
> > +static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
> > + struct ext4_inode_info *ei)
> > +{
> > + __u32 crc;
> > +
> > + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
> > + cpu_to_le32(EXT4_OS_LINUX) ||
> > + !EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return;
> > +
> > + crc = ext4_inode_csum(inode, raw, ei);
> > + raw->i_checksum_lo = cpu_to_le16(crc & 0xFFFF);
> > + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
> > + EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi))
> > + raw->i_checksum_hi = cpu_to_le16(crc >> 16);
>
> What is the point of storing the returned crc into raw->i_checksum_lo
> and raw->i_checksum_hi, if this is done internal to ext4_inode_csum()
> already?

It shouldn't be doing that (see above).

> Also, would it be better to call the temporary variable "csum" instead
> of "crc", since we may use something other than crc32c as the hash
> function in the future.

I suppose.

--D


2011-10-12 21:35:14

by djwong

[permalink] [raw]
Subject: Re: [PATCH 12/28] ext4: Use i_generation in inode-related metadata checksums

On Wed, Oct 12, 2011 at 12:52:30PM -0700, Andreas Dilger wrote:
> On 2011-10-08, at 12:55 AM, Darrick J. Wong wrote:
> > Whenever we are calculating a checksum for a piece of metadata that is
> > associated with an inode, incorporate i_generation into that calculation
> > so that old metadata blocks cannot be re-associated after a delete/create cycle.
>
> It would be better to fold this into the previous patch, since it will
> otherwise cause the inode checksum algorithm to change in the middle of the patch series.

Sure. I guess I can go do that on the e2fsprogs side too.

> On a related note, in ext4_new_inode() it would be useful to change the
> setting of i_generation so that it skips i_generation == 0, which doesn't
> contribute to the checksum:
>
> spin_lock(&sbi->s_next_gen_lock);
> inode->i_generation = sbi->s_next_generation++;
> + if (unlikely(inode->i_generation == 0))
> + inode->i_generation = sbi->s_next_generation++;
> spin_unlock(&sbi->s_next_gen_lock);

For that to happen the UUID+inode would have to checksum to zero, which seems
fairly unlikely since I set the "initial" value in ext4_chksum to ~0 to avoid
the case where UUID = 0, and there's no such thing as inode 0, which means thee
likelihood of the checksum being 0 at this point ought to be 1:2^32. Even
then, the checksum will be different if the value of i_generation changes.

That said, if we /do/ implement this, then perhaps e2fsprogs should be taught
to set i_generation, as it does not do that currently.

> > diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> > index 6e5876a..d4b59e9 100644
> > --- a/fs/ext4/ialloc.c
> > +++ b/fs/ext4/ialloc.c
> > @@ -1031,11 +1031,14 @@ got:
> > /* Precompute second piece of crc */
> > if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
> > EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)) {
> > + __u32 crc;
> > struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > __le32 inum = cpu_to_le32(inode->i_ino);
> > - ei->i_uuid_inum_crc = ext4_chksum(sbi, sbi->s_uuid_crc,
> > - (__u8 *)&inum,
> > - sizeof(inum));
> > + __le32 gen = cpu_to_le32(inode->i_generation);
> > + crc = ext4_chksum(sbi, sbi->s_uuid_crc, (__u8 *)&inum,
> > + sizeof(inum));
> > + ei->i_uuid_inum_crc = ext4_chksum(sbi, crc, (__u8 *)&gen,
> > + sizeof(gen));
> > }
> >
> > ext4_clear_state_flags(ei); /* Only relevant on 32-bit archs */
> >
> > diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> > index f18bfe3..fdf0b1e 100644
> > --- a/fs/ext4/ioctl.c
> > +++ b/fs/ext4/ioctl.c
> > @@ -149,6 +149,10 @@ flags_out:
> > if (!inode_owner_or_capable(inode))
> > return -EPERM;
> >
> > + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return -ENOTTY;
>
> This should get an ext4_warning() in the non-checksum case to warn
> users that this ioctl is deprecated and will be removed in the
> future unless there is a good reason to keep it.

Ok. Maybe put it in feature_removal_schedule.txt too? Maybe not; the ioctl is
not being totally removed, it's merely unsupported for the metadata_csum case.

--D


2011-10-13 00:02:21

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 11/28] ext4: Calculate and verify inode checksums

On 2011-10-12, at 3:03 PM, Darrick J. Wong wrote:
> On Wed, Oct 12, 2011 at 12:45:01PM -0700, Andreas Dilger wrote:
>>> +static __u32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw,
>>> + struct ext4_inode_info *ei)
>>> +{
>>> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>>> + __u16 crc_lo;
>>> + __u16 crc_hi = 0;
>>> + __u32 crc;
>>> +
>>> + crc_lo = raw->i_checksum_lo;
>>> + raw->i_checksum_lo = 0;
>>> + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
>>> + EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi)) {
>>> + crc_hi = raw->i_checksum_hi;
>>> + raw->i_checksum_hi = 0;
>>> + }
>>> +
>>> + crc = ext4_chksum(sbi, ei->i_uuid_inum_crc, (__u8 *)raw,
>>> + EXT4_INODE_SIZE(inode->i_sb));
>>> +
>>> + raw->i_checksum_lo = crc_lo;
>>> + if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE &&
>>> + EXT4_FITS_IN_INODE(raw, ei, i_checksum_hi))
>>> + raw->i_checksum_hi = crc_hi;
>>> +
>>> + return crc;
>>> +}
>>
>> This computes both the _lo and _hi parts of the checksum and overwrites
>> what is in the inode...
>
> I don't follow your logic ... for the _lo component, first I save the old
> i_checksum_lo contents in crc_lo. Then I stuff zero into i_checksum_lo.
> Next I perform the checksum computation (with the checksum field
> effectively "zero") and put the results into crc. Then I copy whatever
> I saved in crc_lo back into i_checksum_lo.
>
> crc_lo, crc_hi, and crc are three separate variables, and neither crc_lo
> nor crc_hi are ever assigned any part of crc. Therefore crc_lo and
> crc_hi should always contain the old checksum contents.
>
> Did I miss something? Afaict the contents of raw should be the same
> before and after the call to ext4_inode_csum(), but maybe I've been
> looking at this too long. :)

No, you are right. I misread the code and thought that crc_lo and crc_hi
were derived from the computed value.

Cheers, Andreas

2011-10-13 00:05:50

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 12/28] ext4: Use i_generation in inode-related metadata checksums

On 2011-10-12, at 3:28 PM, Darrick J. Wong wrote:
> On Wed, Oct 12, 2011 at 12:52:30PM -0700, Andreas Dilger wrote:
>> On 2011-10-08, at 12:55 AM, Darrick J. Wong wrote:
>>> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
>>> index f18bfe3..fdf0b1e 100644
>>> --- a/fs/ext4/ioctl.c
>>> +++ b/fs/ext4/ioctl.c
>>> @@ -149,6 +149,10 @@ flags_out:
>>> if (!inode_owner_or_capable(inode))
>>> return -EPERM;
>>>
>>> + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + return -ENOTTY;
>>
>> This should get an ext4_warning() in the non-checksum case to warn
>> users that this ioctl is deprecated and will be removed in the
>> future unless there is a good reason to keep it.
>
> Ok. Maybe put it in feature_removal_schedule.txt too? Maybe not; the
> ioctl is not being totally removed, it's merely unsupported for the
> metadata_csum case.

I understand it is not being removed entirely, but since it doesn't
work at all with the checksum-enabled case, we may as well mark it
deprecated for the non-checksum case to allow us to catch any users
(however unlikely I think it is) earlier rather than later.

Cheers, Andreas






2011-10-13 07:16:31

by djwong

[permalink] [raw]
Subject: Re: [PATCH 15/28] ext4: Calculate and verify block bitmap checksum

On Wed, Oct 12, 2011 at 06:00:40PM -0600, Andreas Dilger wrote:
> On 2011-10-08, at 1:55 AM, Darrick J. Wong wrote:
> > Compute and verify the checksum of the block bitmap; this checksum is
> > stored in the block group descriptor.
> >
> > @@ -353,11 +360,26 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
> > /*
> > * file system mounted not to panic on error,
> > + * -EIO with corrupt bitmap
> > */
> > + ext4_lock_group(sb, block_group);
> > + if (!ext4_valid_block_bitmap(sb, desc, block_group, bh) ||
> > + !ext4_block_bitmap_csum_verify(sb, block_group, desc, bh,
> > + EXT4_BLOCKS_PER_GROUP(sb) / 8)) {
> > + ext4_unlock_group(sb, block_group);
> > + put_bh(bh);
> > + ext4_error(sb, "Corrupt block bitmap - block_group = %u, "
> > + "block_bitmap = %llu", block_group, bitmap_blk);
> > + return NULL;
> > + }
> > + ext4_unlock_group(sb, block_group);
> > + set_buffer_verified(bh);
>
> I've been thinking a while that we should add per-group error flags
> for the block and inode bitmaps. That way, if we detect errors with
> either one, we can set the flag in the group descriptor and avoid
> using it for any allocations in the future. Otherwise, we try to
> read the bitmap in repeatedly.

I think there's some code in ext4 somewhere that does that. I also wonder if
the possibility that we're seeing a transient corruption error is worth
rechecking the block until it fails? (I suspect not, but I decided to throw
that out there anyway.)

> > @@ -803,6 +842,11 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> > if (groups_per_page == 0)
> > groups_per_page = 1;
> >
> > + csd = kzalloc(sizeof(struct ext4_csum_data) * groups_per_page,
> > + GFP_NOFS);
> > + if (csd == NULL)
> > + goto out;
> > +
> > /* allocate buffer_heads to read bitmaps */
> > if (groups_per_page > 1) {
> > err = -ENOMEM;
> > @@ -880,22 +924,25 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> > * get set with buffer lock held.
> > */
> > set_bitmap_uptodate(bh[i]);
> > - bh[i]->b_end_io = end_buffer_read_sync;
> > + csd[i].cd_sb = sb;
> > + csd[i].cd_group = first_group + i;
> > + bh[i]->b_private = csd + i;
> > + bh[i]->b_end_io = ext4_end_buffer_read_sync;
>
> It seems to be allocating this extra csd[] and calling the more complex
> ext4_end_buffer_read_sync() callback regardless of whether the checksum
> code is enabled or not. Would it be better to only set the custom
> callback if we need to verify the checksum?

Yep, we could go straight to end_buffer_read_sync in the no-csum case.

--D

2011-10-13 07:21:48

by djwong

[permalink] [raw]
Subject: Re: [PATCH 17/28] ext4: Calculate and verify checksums for htree nodes

On Wed, Oct 12, 2011 at 05:27:29PM -0600, Andreas Dilger wrote:
> On 2011-10-08, at 1:55 AM, Darrick J. Wong wrote:
> > Calculate and verify the checksum for directory index tree (htree) node blocks. The checksum is stored in the last 4 bytes of the htree block and requires the dx_entry array to stop 1 dx_entry short of the end of the block.
> >
> > +/*
> > + * This goes at the end of each htree block. If you want to use the
> > + * reserved field, you'll have to update the checksum code to include it.
> > + */
> > +struct dx_tail {
> > + u32 reserved;
> > + u32 checksum; /* crc32c(uuid+inum+dirblock) */
> > +};
>
> Why exclude the reserved field from the checksum? That would mean that
> the checksum value will depend on whether the other feature is in use
> or not, which will make everything more complicated in the future.
>
> Better to always set it to zero for now, and if it is used in the future
> then it can be set to whatever value is needed and the checksum code
> will remain the same in both the kernel and e2fsprogs.

As a minor speed optimization, the htree checksum only covers the fake dirent
structures, the htree block header, and header.count dx_entry structs, which
means that in the common case it won't come anywhere close to checksumming all
4096 bytes. But you do make a compelling case to cover that reserved field
even if it's zero now.

> > +/* checksumming functions */
> > +static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
> > + struct ext4_dir_entry *dirent,
> > + int *offset)
> > +{
> > + if (le16_to_cpu(dirent->rec_len) == EXT4_BLOCK_SIZE(inode->i_sb))
> > + count_offset = 8;
> > + else if (le16_to_cpu(dirent->rec_len) == 12) {
> > + dp = (struct ext4_dir_entry *)(((void *)dirent) + 12);
> > + if (le16_to_cpu(dp->rec_len) !=
> > + EXT4_BLOCK_SIZE(inode->i_sb) - 12)
> > + return NULL;
> > + root = (struct dx_root_info *)(((void *)dp + 12));
> > + if (root->reserved_zero ||
> > + root->info_length != sizeof(struct dx_root_info))
> > + return NULL;
> > + count_offset = 32;
> > + } else
> > + return NULL;
>
> (style) if one branch of an if-else has braces, they all should

Ok.

> > +static void ext4_dx_csum_set(struct inode *inode, struct ext4_dir_entry *dirent)
> > +{
> > + struct dx_countlimit *c;
> > + struct dx_tail *t;
> > + int count_offset, limit, count;
> > +
> > + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>
> It would be nice to add some macros to clean up the feature flag checks
> (in a separate patch, but this long line reminded me of it):
>
> #define EXT4_ROCOMPAT(sb, feature) \
> EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_ ## feature)
>
> Then the code can be changed to use:
>
> if (!EXT4_ROCOMPAT(inode->i_sb, METADATA_CSUM))
>
> which is not only shorter and has a chance of fitting on one line, but
> also avoids the occasional hard-to-find bug that uses a mismatched mask
> and flag word, like:
>
> if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> EXT4_FEATURE_INCOMPAT_FLEX_BG)
>
> With the helper macros, this would fail at compile time, because
> EXT4_FEATURE_ROCOMPAT_FLEX_BG does not exist.

Yes, that would make a nice (separate) cleanup patch.

--D

2011-10-18 04:11:54

by Tao Ma

[permalink] [raw]
Subject: Re: [PATCH 03/28] ext4: ext4_mkdir should dirty dir_block with the parent inode

Hi Darrick,
On 10/08/2011 03:54 PM, Darrick J. Wong wrote:
> ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir".
> Unfortunately, dir_block belongs to the newly created directory (which is
> "inode"), not the parent directory (which is "dir"). Fix the incorrect
> association.
could you please re-send this patch and other fixes that isn't related
to metadata checksum? I need this when implementing inlined dir, and I
guess this can be merged in this merge window.

Thanks
Tao
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/ext4/namei.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
>
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 6d3fab4..50c7294 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -1863,7 +1863,7 @@ retry:
> ext4_set_de_type(dir->i_sb, de, S_IFDIR);
> inode->i_nlink = 2;
> BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
> - err = ext4_handle_dirty_metadata(handle, dir, dir_block);
> + err = ext4_handle_dirty_metadata(handle, inode, dir_block);
> if (err)
> goto out_clear_inode;
> err = ext4_mark_inode_dirty(handle, inode);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-10-21 21:15:28

by djwong

[permalink] [raw]
Subject: Re: [PATCH 03/28] ext4: ext4_mkdir should dirty dir_block with the parent inode

On Tue, Oct 18, 2011 at 12:11:47PM +0800, Tao Ma wrote:
> Hi Darrick,
> On 10/08/2011 03:54 PM, Darrick J. Wong wrote:
> > ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir".
> > Unfortunately, dir_block belongs to the newly created directory (which is
> > "inode"), not the parent directory (which is "dir"). Fix the incorrect
> > association.
> could you please re-send this patch and other fixes that isn't related
> to metadata checksum? I need this when implementing inlined dir, and I
> guess this can be merged in this merge window.

Just FYI it's patches 1-5 of this patchset, but I'm happy to resend.

--D
>
> Thanks
> Tao
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/ext4/namei.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> >
> > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> > index 6d3fab4..50c7294 100644
> > --- a/fs/ext4/namei.c
> > +++ b/fs/ext4/namei.c
> > @@ -1863,7 +1863,7 @@ retry:
> > ext4_set_de_type(dir->i_sb, de, S_IFDIR);
> > inode->i_nlink = 2;
> > BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
> > - err = ext4_handle_dirty_metadata(handle, dir, dir_block);
> > + err = ext4_handle_dirty_metadata(handle, inode, dir_block);
> > if (err)
> > goto out_clear_inode;
> > err = ext4_mark_inode_dirty(handle, inode);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2011-11-07 20:00:08

by djwong

[permalink] [raw]
Subject: Re: [PATCH 15/28] ext4: Calculate and verify block bitmap checksum

On Thu, Oct 13, 2011 at 12:16:31AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 12, 2011 at 06:00:40PM -0600, Andreas Dilger wrote:
> > On 2011-10-08, at 1:55 AM, Darrick J. Wong wrote:
> > > Compute and verify the checksum of the block bitmap; this checksum is
> > > stored in the block group descriptor.
> > >
> > > @@ -353,11 +360,26 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
> > > /*
> > > * file system mounted not to panic on error,
> > > + * -EIO with corrupt bitmap
> > > */
> > > + ext4_lock_group(sb, block_group);
> > > + if (!ext4_valid_block_bitmap(sb, desc, block_group, bh) ||
> > > + !ext4_block_bitmap_csum_verify(sb, block_group, desc, bh,
> > > + EXT4_BLOCKS_PER_GROUP(sb) / 8)) {
> > > + ext4_unlock_group(sb, block_group);
> > > + put_bh(bh);
> > > + ext4_error(sb, "Corrupt block bitmap - block_group = %u, "
> > > + "block_bitmap = %llu", block_group, bitmap_blk);
> > > + return NULL;
> > > + }
> > > + ext4_unlock_group(sb, block_group);
> > > + set_buffer_verified(bh);
> >
> > I've been thinking a while that we should add per-group error flags
> > for the block and inode bitmaps. That way, if we detect errors with
> > either one, we can set the flag in the group descriptor and avoid
> > using it for any allocations in the future. Otherwise, we try to
> > read the bitmap in repeatedly.
>
> I think there's some code in ext4 somewhere that does that. I also wonder if
> the possibility that we're seeing a transient corruption error is worth
> rechecking the block until it fails? (I suspect not, but I decided to throw
> that out there anyway.)

There's a bit of code in ext4_init_block_bitmap that makes a block group
unwritable if the bg checksum fails to verify:

/* If checksum is bad mark all blocks used to prevent allocation
* essentially implementing a per-group read-only flag. */
if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
ext4_error(sb, "Checksum bad for group %u",
block_group);
ext4_free_blks_set(sb, gdp, 0);
ext4_free_inodes_set(sb, gdp, 0);
ext4_itable_unused_set(sb, gdp, 0);
memset(bh->b_data, 0xff, sb->s_blocksize);
ext4_block_bitmap_csum_set(sb, block_group, gdp, bh,
EXT4_BLOCKS_PER_GROUP(sb) /
8);
return 0;
}

Do people think that doing this in the event of a block/inode bitmap checksum
failure is a good idea?

--D
>
> > > @@ -803,6 +842,11 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> > > if (groups_per_page == 0)
> > > groups_per_page = 1;
> > >
> > > + csd = kzalloc(sizeof(struct ext4_csum_data) * groups_per_page,
> > > + GFP_NOFS);
> > > + if (csd == NULL)
> > > + goto out;
> > > +
> > > /* allocate buffer_heads to read bitmaps */
> > > if (groups_per_page > 1) {
> > > err = -ENOMEM;
> > > @@ -880,22 +924,25 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> > > * get set with buffer lock held.
> > > */
> > > set_bitmap_uptodate(bh[i]);
> > > - bh[i]->b_end_io = end_buffer_read_sync;
> > > + csd[i].cd_sb = sb;
> > > + csd[i].cd_group = first_group + i;
> > > + bh[i]->b_private = csd + i;
> > > + bh[i]->b_end_io = ext4_end_buffer_read_sync;
> >
> > It seems to be allocating this extra csd[] and calling the more complex
> > ext4_end_buffer_read_sync() callback regardless of whether the checksum
> > code is enabled or not. Would it be better to only set the custom
> > callback if we need to verify the checksum?
>
> Yep, we could go straight to end_buffer_read_sync in the no-csum case.
>
> --D
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2011-11-07 21:44:02

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 15/28] ext4: Calculate and verify block bitmap checksum

On 2011-11-07, at 1:00 PM, Darrick J. Wong wrote:
> On Thu, Oct 13, 2011 at 12:16:31AM -0700, Darrick J. Wong wrote:
>> On Wed, Oct 12, 2011 at 06:00:40PM -0600, Andreas Dilger wrote:
>>> I've been thinking a while that we should add per-group error flags
>>> for the block and inode bitmaps. That way, if we detect errors with
>>> either one, we can set the flag in the group descriptor and avoid
>>> using it for any allocations in the future. Otherwise, we try to
>>> read the bitmap in repeatedly.
>>
>> I think there's some code in ext4 somewhere that does that. I also wonder if
>> the possibility that we're seeing a transient corruption error is worth
>> rechecking the block until it fails? (I suspect not, but I decided to throw
>> that out there anyway.)
>
> There's a bit of code in ext4_init_block_bitmap that makes a block group
> unwritable if the bg checksum fails to verify:
>
> /* If checksum is bad mark all blocks used to prevent allocation
> * essentially implementing a per-group read-only flag. */
> if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
> ext4_error(sb, "Checksum bad for group %u",
> block_group);
> ext4_free_blks_set(sb, gdp, 0);
> ext4_free_inodes_set(sb, gdp, 0);
> ext4_itable_unused_set(sb, gdp, 0);
> memset(bh->b_data, 0xff, sb->s_blocksize);
> ext4_block_bitmap_csum_set(sb, block_group, gdp, bh,
> EXT4_BLOCKS_PER_GROUP(sb) /
> 8);
> return 0;
> }
>
> Do people think that doing this in the event of a block/inode bitmap checksum
> failure is a good idea?

For me, yes. The sanity checks we do on the block bitmaps are only very basic
(e.g. bits for bitmaps themselves are set, for inode table). Blocking any
allocation from a single group with a bad checksum is not harmful in the long
term, and can avoid an explosion of corruption if blocks would otherwise be allocated multiple times.

Cheers, Andreas

2011-11-10 00:57:03

by djwong

[permalink] [raw]
Subject: Re: [PATCH 15/28] ext4: Calculate and verify block bitmap checksum

On Mon, Nov 07, 2011 at 02:44:02PM -0700, Andreas Dilger wrote:
> On 2011-11-07, at 1:00 PM, Darrick J. Wong wrote:
> > On Thu, Oct 13, 2011 at 12:16:31AM -0700, Darrick J. Wong wrote:
> >> On Wed, Oct 12, 2011 at 06:00:40PM -0600, Andreas Dilger wrote:
> >>> I've been thinking a while that we should add per-group error flags
> >>> for the block and inode bitmaps. That way, if we detect errors with
> >>> either one, we can set the flag in the group descriptor and avoid
> >>> using it for any allocations in the future. Otherwise, we try to
> >>> read the bitmap in repeatedly.
> >>
> >> I think there's some code in ext4 somewhere that does that. I also wonder if
> >> the possibility that we're seeing a transient corruption error is worth
> >> rechecking the block until it fails? (I suspect not, but I decided to throw
> >> that out there anyway.)
> >
> > There's a bit of code in ext4_init_block_bitmap that makes a block group
> > unwritable if the bg checksum fails to verify:
> >
> > /* If checksum is bad mark all blocks used to prevent allocation
> > * essentially implementing a per-group read-only flag. */
> > if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
> > ext4_error(sb, "Checksum bad for group %u",
> > block_group);
> > ext4_free_blks_set(sb, gdp, 0);
> > ext4_free_inodes_set(sb, gdp, 0);
> > ext4_itable_unused_set(sb, gdp, 0);
> > memset(bh->b_data, 0xff, sb->s_blocksize);
> > ext4_block_bitmap_csum_set(sb, block_group, gdp, bh,
> > EXT4_BLOCKS_PER_GROUP(sb) /
> > 8);
> > return 0;
> > }
> >
> > Do people think that doing this in the event of a block/inode bitmap checksum
> > failure is a good idea?
>
> For me, yes. The sanity checks we do on the block bitmaps are only very
> basic (e.g. bits for bitmaps themselves are set, for inode table). Blocking
> any allocation from a single group with a bad checksum is not harmful in the
> long term, and can avoid an explosion of corruption if blocks would otherwise
> be allocated multiple times.

On second thought, let's see what happens with groups that fail checksums ...
it seems that ext4_check_descriptors() fails the mount in the event of a group
descriptor failing checksum. Both ext4_read_inode_bitmap() and
ext4_read_block_bitmap() return NULL if the respective bitmap fails checksum.
It looks like most of ext4's block {de,}allocate functions will fail on NULL
bitmap pointer, so it shouldn't be possible to allocate (or deallocate) from a
broken group.

There is one small deficiency: we need an explicit flag that marks the group
dead. That way if either bitmap fails checksum, the other _bitmap_read()
function will know to just return NULL, and the group becomes totally frozen to
allocation activity.

I think ext4_new_inode need to be taught to continue scanning groups if it
encounters a "dead" group? I'm not sure yet if the same thing needs to be
applied to the block allocation code.

I'm actually wondering how it is that a group could pass checksum verification
at mount time but fail it later, which is what that code snippet seems designed
to catch. Maybe something related to online resize? Either way, I'm now
wondering if we really need something like that for the simple case of bitmaps
failing checksum. I think we're already covered, but maybe I missed something?

--D
>
> Cheers, Andreas
>
>
>
>
>


2011-11-10 02:34:52

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 15/28] ext4: Calculate and verify block bitmap checksum

On 2011-11-09, at 5:57 PM, Darrick J. Wong wrote:
> On Mon, Nov 07, 2011 at 02:44:02PM -0700, Andreas Dilger wrote:
>> On 2011-11-07, at 1:00 PM, Darrick J. Wong wrote:
>>> On Thu, Oct 13, 2011 at 12:16:31AM -0700, Darrick J. Wong wrote:
>>>> On Wed, Oct 12, 2011 at 06:00:40PM -0600, Andreas Dilger wrote:
>>>>> I've been thinking a while that we should add per-group error flags
>>>>> for the block and inode bitmaps. That way, if we detect errors with
>>>>> either one, we can set the flag in the group descriptor and avoid
>>>>> using it for any allocations in the future. Otherwise, we try to
>>>>> read the bitmap in repeatedly.
>>>>
>>>> I think there's some code in ext4 somewhere that does that. I also
>>>> wonder if the possibility that we're seeing a transient corruption
>>>> error is worth rechecking the block until it fails? (I suspect not,
>>>> but I decided to throw that out there anyway.)
>>>
>>> There's a bit of code in ext4_init_block_bitmap that makes a block group
>>> unwritable if the bg checksum fails to verify:
>>>
>>> /* If checksum is bad mark all blocks used to prevent allocation
>>> * essentially implementing a per-group read-only flag. */
>>> if (!ext4_group_desc_csum_verify(sbi, block_group, gdp)) {
>>> ext4_error(sb, "Checksum bad for group %u",
>>> block_group);
>>> ext4_free_blks_set(sb, gdp, 0);
>>> ext4_free_inodes_set(sb, gdp, 0);
>>> ext4_itable_unused_set(sb, gdp, 0);
>>> memset(bh->b_data, 0xff, sb->s_blocksize);
>>> ext4_block_bitmap_csum_set(sb, block_group, gdp, bh,
>>> EXT4_BLOCKS_PER_GROUP(sb) /
>>> 8);
>>> return 0;
>>> }
>>>
>>> Do people think that doing this in the event of a block/inode bitmap
>>> checksum failure is a good idea?
>>
>> For me, yes. The sanity checks we do on the block bitmaps are only very
>> basic (e.g. bits for bitmaps themselves are set, for inode table). Blocking
>> any allocation from a single group with a bad checksum is not harmful in the
>> long term, and can avoid an explosion of corruption if blocks would otherwise
>> be allocated multiple times.
>
> On second thought, let's see what happens with groups that fail checksums ...
> it seems that ext4_check_descriptors() fails the mount in the event of a group
> descriptor failing checksum. Both ext4_read_inode_bitmap() and
> ext4_read_block_bitmap() return NULL if the respective bitmap fails checksum.
> It looks like most of ext4's block {de,}allocate functions will fail on NULL
> bitmap pointer, so it shouldn't be possible to allocate (or deallocate) from a
> broken group.
>

> There is one small deficiency: we need an explicit flag that marks the group
> dead. That way if either bitmap fails checksum, the other _bitmap_read()
> function will know to just return NULL, and the group becomes totally frozen to
> allocation activity.

These could be stored in the group flags, bg_flags, like the UNINIT bits.
Since the group descriptors are pinned in memory there is no need to
re-read the group (and fail) again. If it is written to disk, then e2fsck
can clear it.

> I think ext4_new_inode() need to be taught to continue scanning groups if it
> encounters a "dead" group? I'm not sure yet if the same thing needs to be
> applied to the block allocation code.

It would be nice, for the case of bitmap checksum errors at least that the
allocator just chose the next block group and continued on running, though
it should also mark the superblock error flag.

If things are so bad that the device is completely unreadable, all of the block
groups would quickly be marked in error and the filesystem is now aborted like
it would have been previously.

> I'm actually wondering how it is that a group could pass checksum verification
> at mount time but fail it later, which is what that code snippet seems designed
> to catch. Maybe something related to online resize? Either way, I'm now
> wondering if we really need something like that for the simple case of bitmaps
> failing checksum. I think we're already covered, but maybe I missed something?

Hmm, I guess the only way that this might be hit is via memory corruption?
I never really thought about it. I guess it isn't harmful to check, and in
case of an in-memory corruption, it does call ext4_error() and typically
marks the filesystem read-only.

For inode and block bitmaps, however, I've seen corruption in a lot of cases.
Sometimes the simple checks catch this, but having a proper checksum would be
a lot more reassuring.

Cheers, Andreas