Hi all,
This patchset adds crc32c checksums to most of the ext4 metadata objects. A
full design document is on the ext4 wiki[1] but I will summarize that document here.
As much as we wish our storage hardware was totally reliable, it is still
quite possible for data to be corrupted on disk, corrupted during transfer over
a wire, or written to the wrong places. To protect against this sort of
non-hostile corruption, it is desirable to store checksums of metadata objects
on the filesystem to prevent broken metadata from shredding the filesystem.
The crc32c polynomial was chosen for its improved error detection capabilities
over crc32 and crc16, and because of its hardware acceleration on current and
upcoming Intel and Sparc chips.
Each type of metadata object has been retrofitted to store a checksum as follows:
- The superblock stores a crc32c of itself.
- Each inode stores crc32c(fs_uuid + inode_num + inode + slack_space_after_inode)
- Block and inode bitmaps each get their own crc32c(fs_uuid + group_num +
bitmap), stored in the block group descriptor.
- Each extent tree block stores a crc32c(fs_uuid + inode_num + extent_entries)
in unused space at the end of the block.
- Each directory leaf block has an unused-looking directory entry big enough to
store a crc32c(fs_uuid + inode_num + block) at the end of the block.
- Each directory htree block is shortened to contain a crc32c(fs_uuid +
inode_num + block) at the end of the block.
- Extended attribute blocks store crc32c(fs_uuid + block_no + ea_block) in the
header.
- Journal commit blocks can be converted to use crc32c to checksum all blocks
in the transaction, if journal_checksum is given.
The first four patches in the kernel patchset fix existing bugs in ext4 that
cause incorrect checkums to be written. I think Ted already took them, but
with recent instability I'm resending them to be cautious. The subsequent 12
patches add the necessary code to support checksumming in ext4 and jbd2.
I also have a set of three patches that provide a faster crc32c implementation
based on Bob Pearson's earlier crc32 patchset. This will be sent under
separate cover to the crypto list and to lkml/linux-ext4.
The patchset for e2fsprogs will be sent under separate cover only to linux-ext4
as it is quite lengthy (~36 patches).
As far as performance impact goes, I see nearly no change with a standard mail
server ffsb simulation. On a test that involves only file creation and
deletion and extent tree modifications, I see a drop of about 50 percent with
the current kernel crc32c implementation; this improves to a drop of about 20
percent with the enclosed crc32c implementation. However, given that metadata
is usually a small fraction of total IO, it doesn't seem like the cost of
enabling this feature is unreasonable.
There are of course unresolved issues:
- What to do when the block group descriptor isn't big enough to hold 2 crc32s
(which is the case with 32-bit ext4 filesystems, sadly). I'm not quite
convinced that truncating a 32-bit checksum to 16-bits is a safe idea. Right
now, one has to enable the 64bit feature to enable any bitmap checksums.
I'm not sure how effective crc16 is at checksumming 32768-bit bitmaps.
- Using the journal commit hooks to delay crc32c calculation until dirty
buffers are actually being written to disk.
- Can we get away with using a (hw accelerated) LE crc32c for jbd2, which
stores its data in BE order?
- Interaction with online resize code. Yongqiang seems to be in the process of
rewriting this, so I haven't looked at it very closely yet.
- If block group descriptors can now exceed 32 bytes (when 64bit filesystem
support is enabled), should we use crc32c instead of crc16? From what I've
read of the literature, crc16 is not very effective on datasets exceeding 256
bytes.
Please have a look at the design document and patches, and please feel free to
suggest any changes. I will be at LPC next week if anyone wishes to discuss,
debate, or protest.
--D
[1] https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums
When ext4_rename performs a directory rename (move), dir_bh is a buffer that is
modified to update the '..' link in the directory being moved (old_inode).
However, ext4_handle_dirty_metadata is called with the old parent directory
inode (old_dir) and dir_bh, which is incorrect because dir_bh does not belong
to the parent inode. Fix this error.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/namei.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 815c31a..f778a54 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2529,7 +2529,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
PARENT_INO(dir_bh->b_data, new_dir->i_sb->s_blocksize) =
cpu_to_le32(new_dir->i_ino);
BUFFER_TRACE(dir_bh, "call ext4_handle_dirty_metadata");
- retval = ext4_handle_dirty_metadata(handle, old_dir, dir_bh);
+ retval = ext4_handle_dirty_metadata(handle, old_inode, dir_bh);
if (retval) {
ext4_std_error(old_dir->i_sb, retval);
goto end_rename;
ext4_dx_add_entry manipulates bh2 and frames[0].bh, which are two buffer_heads
that point to directory blocks assigned to the directory inode. However, the
function calls ext4_handle_dirty_metadata with the inode of the file that's
being added to the directory, not the directory inode itself. Therefore,
correct the code to dirty the directory buffers with the directory inode, not
the file inode.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/namei.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index f8068c7..815c31a 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1585,7 +1585,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
dxtrace(dx_show_index("node", frames[1].entries));
dxtrace(dx_show_index("node",
((struct dx_node *) bh2->b_data)->entries));
- err = ext4_handle_dirty_metadata(handle, inode, bh2);
+ err = ext4_handle_dirty_metadata(handle, dir, bh2);
if (err)
goto journal_error;
brelse (bh2);
@@ -1611,7 +1611,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
if (err)
goto journal_error;
}
- err = ext4_handle_dirty_metadata(handle, inode, frames[0].bh);
+ err = ext4_handle_dirty_metadata(handle, dir, frames[0].bh);
if (err) {
ext4_std_error(inode->i_sb, err);
goto cleanup;
Create a new BH_Verified flag to indicate that we've verified all the data in a
buffer_head for correctness. This allows us to bypass expensive verification
steps when they are not necessary without missing them when they are.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4.h | 2 ++
fs/ext4/extents.c | 35 ++++++++++++++++++++++++++---------
2 files changed, 28 insertions(+), 9 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e717dfd..ecb86c2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2242,10 +2242,12 @@ extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
enum ext4_state_bits {
BH_Uninit /* blocks are allocated but uninitialized on disk */
= BH_JBDPrivateStart,
+ BH_Verified, /* metadata block has been verified ok */
};
BUFFER_FNS(Uninit, uninit)
TAS_BUFFER_FNS(Uninit, uninit)
+BUFFER_FNS(Verified, verified)
/*
* Add new method to test wether block and inode bitmaps are properly
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 57cf568..4ac4303 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -403,6 +403,26 @@ int ext4_ext_check_inode(struct inode *inode)
return ext4_ext_check(inode, ext_inode_hdr(inode), ext_depth(inode));
}
+static int __ext4_ext_check_block(const char *function, unsigned int line,
+ struct inode *inode,
+ struct ext4_extent_header *eh,
+ int depth,
+ struct buffer_head *bh)
+{
+ int ret;
+
+ if (buffer_verified(bh))
+ return 0;
+ ret = ext4_ext_check(inode, eh, depth);
+ if (ret)
+ return ret;
+ set_buffer_verified(bh);
+ return ret;
+}
+
+#define ext4_ext_check_block(inode, eh, depth, bh) \
+ __ext4_ext_check_block(__func__, __LINE__, inode, eh, depth, bh)
+
#ifdef EXT_DEBUG
static void ext4_ext_show_path(struct inode *inode, struct ext4_ext_path *path)
{
@@ -659,8 +679,6 @@ ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block,
i = depth;
/* walk through the tree */
while (i) {
- int need_to_validate = 0;
-
ext_debug("depth %d: num %d, max %d\n",
ppos, le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max));
@@ -679,8 +697,6 @@ ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block,
put_bh(bh);
goto err;
}
- /* validate the extent entries */
- need_to_validate = 1;
}
eh = ext_block_hdr(bh);
ppos++;
@@ -694,7 +710,7 @@ ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block,
path[ppos].p_hdr = eh;
i--;
- if (need_to_validate && ext4_ext_check(inode, eh, i))
+ if (ext4_ext_check_block(inode, eh, i, bh))
goto err;
}
@@ -1350,7 +1366,8 @@ got_index:
return -EIO;
eh = ext_block_hdr(bh);
/* subtract from p_depth to get proper eh_depth */
- if (ext4_ext_check(inode, eh, path->p_depth - depth)) {
+ if (ext4_ext_check_block(inode, eh,
+ path->p_depth - depth, bh)) {
put_bh(bh);
return -EIO;
}
@@ -1363,7 +1380,7 @@ got_index:
if (bh == NULL)
return -EIO;
eh = ext_block_hdr(bh);
- if (ext4_ext_check(inode, eh, path->p_depth - depth)) {
+ if (ext4_ext_check_block(inode, eh, path->p_depth - depth, bh)) {
put_bh(bh);
return -EIO;
}
@@ -2591,8 +2608,8 @@ again:
err = -EIO;
break;
}
- if (ext4_ext_check(inode, ext_block_hdr(bh),
- depth - i - 1)) {
+ if (ext4_ext_check_block(inode, ext_block_hdr(bh),
+ depth - i - 1, bh)) {
err = -EIO;
break;
}
ext4_mkdir calls ext4_handle_dirty_metadata with dir_block and the inode "dir".
Unfortunately, dir_block belongs to the newly created directory (which is
"inode"), not the parent directory (which is "dir"). Fix the incorrect
association.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/namei.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index f778a54..a067835 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -1862,7 +1862,7 @@ retry:
ext4_set_de_type(dir->i_sb, de, S_IFDIR);
inode->i_nlink = 2;
BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, dir, dir_block);
+ err = ext4_handle_dirty_metadata(handle, inode, dir_block);
if (err)
goto out_clear_inode;
err = ext4_mark_inode_dirty(handle, inode);
This patch introduces a rocompat feature flag to signal the presence of
checksums for metadata blocks. It also pulls in CRC32c since we'll be
using that for the new checksums.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/Kconfig | 1 +
fs/ext4/ext4.h | 4 +++-
2 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/fs/ext4/Kconfig b/fs/ext4/Kconfig
index 9ed1bb1..97f4a7d 100644
--- a/fs/ext4/Kconfig
+++ b/fs/ext4/Kconfig
@@ -2,6 +2,7 @@ config EXT4_FS
tristate "The Extended 4 (ext4) filesystem"
select JBD2
select CRC16
+ select LIBCRC32C
help
This is the next generation of the ext3 filesystem.
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ecb86c2..f79ddac 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1359,6 +1359,7 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
#define EXT4_FEATURE_RO_COMPAT_DIR_NLINK 0x0020
#define EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE 0x0040
#define EXT4_FEATURE_RO_COMPAT_QUOTA 0x0100
+#define EXT4_FEATURE_RO_COMPAT_METADATA_CSUM 0x0400
#define EXT4_FEATURE_INCOMPAT_COMPRESSION 0x0001
#define EXT4_FEATURE_INCOMPAT_FILETYPE 0x0002
@@ -1401,7 +1402,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
EXT4_FEATURE_RO_COMPAT_DIR_NLINK | \
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE | \
EXT4_FEATURE_RO_COMPAT_BTREE_DIR |\
- EXT4_FEATURE_RO_COMPAT_HUGE_FILE)
+ EXT4_FEATURE_RO_COMPAT_HUGE_FILE |\
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM)
/*
* Default values for user and/or group using reserved blocks
These helper functions will be used to calculate and verify the block and inode
bitmap checksums.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/bitmap.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/ext4.h | 6 ++++++
2 files changed, 50 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/bitmap.c b/fs/ext4/bitmap.c
index fa3af81..5421152 100644
--- a/fs/ext4/bitmap.c
+++ b/fs/ext4/bitmap.c
@@ -9,6 +9,7 @@
#include <linux/buffer_head.h>
#include <linux/jbd2.h>
+#include <linux/crc32c.h>
#include "ext4.h"
#ifdef EXT4FS_DEBUG
@@ -29,3 +30,46 @@ unsigned int ext4_count_free(struct buffer_head *map, unsigned int numchars)
#endif /* EXT4FS_DEBUG */
+__le32 ext4_bitmap_csum(struct super_block *sb, ext4_group_t group,
+ struct buffer_head *bh, int sz)
+{
+ __le32 crc;
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ if (sbi->s_desc_size < EXT4_MIN_DESC_SIZE_64BIT)
+ return 0;
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 0;
+
+ group = cpu_to_le32(group);
+ crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
+ crc = crc32c_le(crc, (__u8 *)&group, sizeof(group));
+ crc = crc32c_le(crc, (__u8 *)bh->b_data, sz);
+
+ return cpu_to_le32(crc);
+}
+
+int ext4_bitmap_csum_verify(struct super_block *sb, ext4_group_t group,
+ __le32 provided, struct buffer_head *bh, int sz)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ if (sbi->s_desc_size >= EXT4_MIN_DESC_SIZE_64BIT &&
+ EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
+ (provided != ext4_bitmap_csum(sb, group, bh, sz)))
+ return 0;
+ return 1;
+}
+
+void ext4_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
+ __le32 *csum, struct buffer_head *bh, int sz)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+ if (sbi->s_desc_size >= EXT4_MIN_DESC_SIZE_64BIT &&
+ EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ *csum = ext4_bitmap_csum(sb, group, bh, sz);
+}
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e2361cc..bc7ace1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1722,6 +1722,12 @@ struct mmpd_data {
/* bitmap.c */
extern unsigned int ext4_count_free(struct buffer_head *, unsigned);
+__le32 ext4_bitmap_csum(struct super_block *sb, ext4_group_t group,
+ struct buffer_head *bh, int sz);
+int ext4_bitmap_csum_verify(struct super_block *sb, ext4_group_t group,
+ __le32 provided, struct buffer_head *bh, int sz);
+void ext4_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
+ __le32 *csum, struct buffer_head *bh, int sz);
/* balloc.c */
extern unsigned int ext4_block_group(struct super_block *sb,
Compute and verify the checksum of the inode bitmap; the checkum is stored in
the block group descriptor.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4.h | 3 ++-
fs/ext4/ialloc.c | 33 ++++++++++++++++++++++++++++++---
2 files changed, 32 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bc7ace1..248cbd2 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -279,7 +279,8 @@ struct ext4_group_desc
__le16 bg_free_inodes_count_hi;/* Free inodes count MSB */
__le16 bg_used_dirs_count_hi; /* Directories count MSB */
__le16 bg_itable_unused_hi; /* Unused inodes count MSB */
- __u32 bg_reserved2[3];
+ __le32 bg_inode_bitmap_csum; /* crc32c(uuid+group+ibitmap) */
+ __u32 bg_reserved2[2];
};
/*
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 9c63f27..53faffc 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -82,12 +82,18 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb,
ext4_free_inodes_set(sb, gdp, 0);
ext4_itable_unused_set(sb, gdp, 0);
memset(bh->b_data, 0xff, sb->s_blocksize);
+ ext4_bitmap_csum_set(sb, block_group,
+ &gdp->bg_inode_bitmap_csum, bh,
+ (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
return 0;
}
memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
ext4_mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), sb->s_blocksize * 8,
bh->b_data);
+ ext4_bitmap_csum_set(sb, block_group, &gdp->bg_inode_bitmap_csum, bh,
+ (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
+ gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
return EXT4_INODES_PER_GROUP(sb);
}
@@ -118,12 +124,12 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
return NULL;
}
if (bitmap_uptodate(bh))
- return bh;
+ goto verify;
lock_buffer(bh);
if (bitmap_uptodate(bh)) {
unlock_buffer(bh);
- return bh;
+ goto verify;
}
ext4_lock_group(sb, block_group);
@@ -131,6 +137,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
ext4_init_inode_bitmap(sb, bh, block_group, desc);
set_bitmap_uptodate(bh);
set_buffer_uptodate(bh);
+ set_buffer_verified(bh);
ext4_unlock_group(sb, block_group);
unlock_buffer(bh);
return bh;
@@ -144,7 +151,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
*/
set_bitmap_uptodate(bh);
unlock_buffer(bh);
- return bh;
+ goto verify;
}
/*
* submit the buffer_head for read. We can
@@ -161,6 +168,21 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
block_group, bitmap_blk);
return NULL;
}
+
+verify:
+ ext4_lock_group(sb, block_group);
+ if (!buffer_verified(bh) &&
+ !ext4_bitmap_csum_verify(sb, block_group,
+ desc->bg_inode_bitmap_csum, bh,
+ (EXT4_INODES_PER_GROUP(sb) + 7) / 8)) {
+ ext4_unlock_group(sb, block_group);
+ put_bh(bh);
+ ext4_error(sb, "Corrupt inode bitmap - block_group = %u, "
+ "inode_bitmap = %llu", block_group, bitmap_blk);
+ return NULL;
+ }
+ ext4_unlock_group(sb, block_group);
+ set_buffer_verified(bh);
return bh;
}
@@ -265,6 +287,8 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
ext4_used_dirs_set(sb, gdp, count);
percpu_counter_dec(&sbi->s_dirs_counter);
}
+ ext4_bitmap_csum_set(sb, block_group, &gdp->bg_inode_bitmap_csum,
+ bitmap_bh, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
ext4_unlock_group(sb, block_group);
@@ -784,6 +808,9 @@ static int ext4_claim_inode(struct super_block *sb,
atomic_inc(&sbi->s_flex_groups[f].used_dirs);
}
}
+ ext4_bitmap_csum_set(sb, group, &gdp->bg_inode_bitmap_csum,
+ inode_bitmap_bh,
+ (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
err_ret:
ext4_unlock_group(sb, group);
Compute and verify the checksum of the block bitmap; this checksum is stored in
the block group descriptor.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/balloc.c | 43 ++++++++++++++++++++++++++++++++++---------
fs/ext4/ext4.h | 7 ++++++-
fs/ext4/ialloc.c | 5 +++++
fs/ext4/mballoc.c | 34 ++++++++++++++++++++++++++++++++++
4 files changed, 79 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index f8224ad..36d3020 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -105,6 +105,10 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
ext4_free_inodes_set(sb, gdp, 0);
ext4_itable_unused_set(sb, gdp, 0);
memset(bh->b_data, 0xff, sb->s_blocksize);
+ ext4_bitmap_csum_set(sb, block_group,
+ &gdp->bg_block_bitmap_csum, bh,
+ (EXT4_BLOCKS_PER_GROUP(sb) + 7) /
+ 8);
return 0;
}
memset(bh->b_data, 0, sb->s_blocksize);
@@ -175,6 +179,11 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
*/
ext4_mark_bitmap_end(group_blocks, sb->s_blocksize * 8,
bh->b_data);
+ ext4_bitmap_csum_set(sb, block_group,
+ &gdp->bg_block_bitmap_csum, bh,
+ (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8);
+ gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group,
+ gdp);
}
return free_blocks - ext4_group_used_meta_blocks(sb, block_group, gdp);
}
@@ -232,10 +241,10 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block *sb,
return desc;
}
-static int ext4_valid_block_bitmap(struct super_block *sb,
- struct ext4_group_desc *desc,
- unsigned int block_group,
- struct buffer_head *bh)
+int ext4_valid_block_bitmap(struct super_block *sb,
+ struct ext4_group_desc *desc,
+ unsigned int block_group,
+ struct buffer_head *bh)
{
ext4_grpblk_t offset;
ext4_grpblk_t next_zero_bit;
@@ -312,12 +321,12 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
}
if (bitmap_uptodate(bh))
- return bh;
+ goto verify;
lock_buffer(bh);
if (bitmap_uptodate(bh)) {
unlock_buffer(bh);
- return bh;
+ goto verify;
}
ext4_lock_group(sb, block_group);
if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
@@ -336,7 +345,7 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
*/
set_bitmap_uptodate(bh);
unlock_buffer(bh);
- return bh;
+ goto verify;
}
/*
* submit the buffer_head for read. We can
@@ -353,11 +362,27 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
block_group, bitmap_blk);
return NULL;
}
- ext4_valid_block_bitmap(sb, desc, block_group, bh);
+
+verify:
+ if (buffer_verified(bh))
+ return bh;
/*
* file system mounted not to panic on error,
- * continue with corrupt bitmap
+ * -EIO with corrupt bitmap
*/
+ ext4_lock_group(sb, block_group);
+ if (!ext4_valid_block_bitmap(sb, desc, block_group, bh) ||
+ !ext4_bitmap_csum_verify(sb, block_group,
+ desc->bg_block_bitmap_csum, bh,
+ (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8)) {
+ ext4_unlock_group(sb, block_group);
+ put_bh(bh);
+ ext4_error(sb, "Corrupt block bitmap - block_group = %u, "
+ "block_bitmap = %llu", block_group, bitmap_blk);
+ return NULL;
+ }
+ ext4_unlock_group(sb, block_group);
+ set_buffer_verified(bh);
return bh;
}
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 248cbd2..df149b3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -269,7 +269,8 @@ struct ext4_group_desc
__le16 bg_free_inodes_count_lo;/* Free inodes count */
__le16 bg_used_dirs_count_lo; /* Directories count */
__le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */
- __u32 bg_reserved[2]; /* Likely block/inode bitmap checksum */
+ __u32 bg_reserved[1]; /* unclaimed */
+ __le32 bg_block_bitmap_csum; /* crc32c(uuid+group+bbitmap) */
__le16 bg_itable_unused_lo; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
__le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
@@ -1731,6 +1732,10 @@ void ext4_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
__le32 *csum, struct buffer_head *bh, int sz);
/* balloc.c */
+extern int ext4_valid_block_bitmap(struct super_block *sb,
+ struct ext4_group_desc *desc,
+ unsigned int block_group,
+ struct buffer_head *bh);
extern unsigned int ext4_block_group(struct super_block *sb,
ext4_fsblk_t blocknr);
extern ext4_grpblk_t ext4_block_group_offset(struct super_block *sb,
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 53faffc..a335d19 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -984,6 +984,11 @@ got:
free = ext4_free_blocks_after_init(sb, group, gdp);
gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
ext4_free_blks_set(sb, gdp, free);
+ ext4_bitmap_csum_set(sb, group,
+ &gdp->bg_block_bitmap_csum,
+ block_bitmap_bh,
+ (EXT4_BLOCKS_PER_GROUP(sb) + 7) /
+ 8);
gdp->bg_checksum = ext4_group_desc_csum(sbi, group,
gdp);
}
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 17a5a57..8dc3055 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -895,6 +895,33 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
if (bh[i] && !buffer_uptodate(bh[i]))
goto out;
+ for (i = 0; i < groups_per_page; i++) {
+ struct ext4_group_desc *desc;
+
+ if (!bh[i] || !bh[i]->b_end_io)
+ continue;
+ desc = ext4_get_group_desc(sb, first_group + i, NULL);
+ if (!desc)
+ goto out;
+
+ if (buffer_verified(bh[i]))
+ continue;
+ ext4_lock_group(sb, first_group + i);
+ if (!ext4_valid_block_bitmap(sb, desc, first_group + i,
+ bh[i]) ||
+ !ext4_bitmap_csum_verify(sb, first_group + i,
+ desc->bg_block_bitmap_csum, bh[i],
+ (EXT4_BLOCKS_PER_GROUP(sb) + 7) /
+ 8)) {
+ ext4_unlock_group(sb, first_group + i);
+ ext4_error(sb, "Corrupt block bitmap, group = %u",
+ first_group + i);
+ goto out;
+ }
+ ext4_unlock_group(sb, first_group + i);
+ set_buffer_verified(bh[i]);
+ }
+
err = 0;
first_block = page->index * blocks_per_page;
for (i = 0; i < blocks_per_page; i++) {
@@ -2829,6 +2856,9 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
}
len = ext4_free_blks_count(sb, gdp) - ac->ac_b_ex.fe_len;
ext4_free_blks_set(sb, gdp, len);
+ ext4_bitmap_csum_set(sb, ac->ac_b_ex.fe_group,
+ &gdp->bg_block_bitmap_csum, bitmap_bh,
+ (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8);
gdp->bg_checksum = ext4_group_desc_csum(sbi, ac->ac_b_ex.fe_group, gdp);
ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
@@ -4638,6 +4668,8 @@ do_more:
ret = ext4_free_blks_count(sb, gdp) + count;
ext4_free_blks_set(sb, gdp, ret);
+ ext4_bitmap_csum_set(sb, block_group, &gdp->bg_block_bitmap_csum,
+ bitmap_bh, (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8);
gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
ext4_unlock_group(sb, block_group);
percpu_counter_add(&sbi->s_freeblocks_counter, count);
@@ -4780,6 +4812,8 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
mb_free_blocks(NULL, &e4b, bit, count);
blk_free_count = blocks_freed + ext4_free_blks_count(sb, desc);
ext4_free_blks_set(sb, desc, blk_free_count);
+ ext4_bitmap_csum_set(sb, block_group, &desc->bg_block_bitmap_csum,
+ bitmap_bh, (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8);
desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc);
ext4_unlock_group(sb, block_group);
percpu_counter_add(&sbi->s_freeblocks_counter, blocks_freed);
Calculate and verify the checksum for each extent tree block. The checksum is
located immediately after the last ext4_extent in the block, which is typically
4-8 bytes in size.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4_extents.h | 25 ++++++++++++++++++-
fs/ext4/extents.c | 64 +++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 84 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
index 095c36f..24b106a 100644
--- a/fs/ext4/ext4_extents.h
+++ b/fs/ext4/ext4_extents.h
@@ -62,10 +62,22 @@
/*
* ext4_inode has i_block array (60 bytes total).
* The first 12 bytes store ext4_extent_header;
- * the remainder stores an array of ext4_extent.
+ * the remainder stores an array of ext4_extent,
+ * followed by ext4_extent_tail.
*/
/*
+ * This is the extent tail on-disk structure.
+ * All other extent structures are 12 bytes long. It turns out that
+ * block_size % 12 >= 4 for all valid block sizes (1k, 2k, 4k).
+ * Therefore, this tail structure can be crammed into the end of the block
+ * without having to rebalance the tree.
+ */
+struct ext4_extent_tail {
+ __le32 et_checksum; /* crc32c(uuid+inum+extent_block) */
+};
+
+/*
* This is the extent on-disk structure.
* It's used at the bottom of the tree.
*/
@@ -101,6 +113,17 @@ struct ext4_extent_header {
#define EXT4_EXT_MAGIC cpu_to_le16(0xf30a)
+#define EXT4_EXTENT_TAIL_OFFSET(hdr) \
+ (sizeof(struct ext4_extent_header) + \
+ (sizeof(struct ext4_extent) * le16_to_cpu((hdr)->eh_max)))
+
+static inline struct ext4_extent_tail *
+find_ext4_extent_tail(struct ext4_extent_header *eh)
+{
+ return (struct ext4_extent_tail *)(((void *)eh) +
+ EXT4_EXTENT_TAIL_OFFSET(eh));
+}
+
/*
* Array of ext4_ext_path contains path to some extent.
* Creation/lookup routines use it for traversal/splitting/etc.
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 4ac4303..94f09ce 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -41,11 +41,57 @@
#include <linux/falloc.h>
#include <asm/uaccess.h>
#include <linux/fiemap.h>
+#include <linux/crc32c.h>
#include "ext4_jbd2.h"
#include "ext4_extents.h"
#include <trace/events/ext4.h>
+static __le32 ext4_extent_block_csum(struct inode *inode,
+ struct ext4_extent_header *eh)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ __le32 inum = cpu_to_le32(inode->i_ino);
+ __u32 crc = 0;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 0;
+
+ crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
+ crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
+ crc = crc32c_le(crc, (__u8 *)eh, EXT4_EXTENT_TAIL_OFFSET(eh));
+ return cpu_to_le32(crc);
+}
+
+static int ext4_extent_block_csum_verify(struct inode *inode,
+ struct ext4_extent_header *eh)
+{
+ struct ext4_extent_tail *et;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ et = find_ext4_extent_tail(eh);
+ if (et->et_checksum != ext4_extent_block_csum(inode, eh))
+ return 0;
+ return 1;
+}
+
+static void ext4_extent_block_csum_set(struct inode *inode,
+ struct ext4_extent_header *eh)
+{
+ struct ext4_extent_tail *et;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ et = find_ext4_extent_tail(eh);
+ et->et_checksum = ext4_extent_block_csum(inode, eh);
+}
+
static int ext4_split_extent(handle_t *handle,
struct inode *inode,
struct ext4_ext_path *path,
@@ -101,6 +147,7 @@ static int ext4_ext_dirty(handle_t *handle, struct inode *inode,
{
int err;
if (path->p_bh) {
+ ext4_extent_block_csum_set(inode, ext_block_hdr(path->p_bh));
/* path points to block */
err = ext4_handle_dirty_metadata(handle, inode, path->p_bh);
} else {
@@ -382,6 +429,12 @@ static int __ext4_ext_check(const char *function, unsigned int line,
error_msg = "invalid extent entries";
goto corrupted;
}
+ /* Verify checksum on non-root extent tree nodes */
+ if (ext_depth(inode) != depth &&
+ !ext4_extent_block_csum_verify(inode, eh)) {
+ error_msg = "extent tree corrupted";
+ goto corrupted;
+ }
return 0;
corrupted:
@@ -922,6 +975,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
le16_add_cpu(&neh->eh_entries, m);
}
+ ext4_extent_block_csum_set(inode, neh);
set_buffer_uptodate(bh);
unlock_buffer(bh);
@@ -1000,6 +1054,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
sizeof(struct ext4_extent_idx) * m);
le16_add_cpu(&neh->eh_entries, m);
}
+ ext4_extent_block_csum_set(inode, neh);
set_buffer_uptodate(bh);
unlock_buffer(bh);
@@ -1098,6 +1153,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
else
neh->eh_max = cpu_to_le16(ext4_ext_space_block(inode, 0));
neh->eh_magic = EXT4_EXT_MAGIC;
+ ext4_extent_block_csum_set(inode, neh);
set_buffer_uptodate(bh);
unlock_buffer(bh);
@@ -2458,10 +2514,6 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
if (uninitialized && num)
ext4_ext_mark_uninitialized(ex);
- err = ext4_ext_dirty(handle, inode, path + depth);
- if (err)
- goto out;
-
/*
* If the extent was completely released,
* we need to remove it from the leaf
@@ -2483,6 +2535,10 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
le16_add_cpu(&eh->eh_entries, -1);
}
+ err = ext4_ext_dirty(handle, inode, path + depth);
+ if (err)
+ goto out;
+
ext_debug("new extent: %u:%u:%llu\n", block, num,
ext4_ext_pblock(ex));
ex--;
Calculate and verify the checksums of extended attribute blocks. This only
applies to separate EA blocks that are pointed to by inode->i_file_acl (i.e.
external EA blocks); the checksum lives in the EA header.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/xattr.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++---------
fs/ext4/xattr.h | 3 +-
2 files changed, 76 insertions(+), 15 deletions(-)
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index c757adc..2825e59 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -56,6 +56,7 @@
#include <linux/mbcache.h>
#include <linux/quotaops.h>
#include <linux/rwsem.h>
+#include <linux/crc32c.h>
#include "ext4_jbd2.h"
#include "ext4.h"
#include "xattr.h"
@@ -122,6 +123,58 @@ const struct xattr_handler *ext4_xattr_handlers[] = {
NULL
};
+static __le32 ext4_xattr_block_csum(struct inode *inode,
+ sector_t block_nr,
+ struct ext4_xattr_header *hdr)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ int offset = offsetof(struct ext4_xattr_header, h_checksum);
+ __u32 crc = 0;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 0;
+
+ block_nr = cpu_to_le64(block_nr);
+ crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
+ crc = crc32c_le(crc, (__u8 *)&block_nr, sizeof(block_nr));
+ crc = crc32c_le(crc, (__u8 *)hdr, offset);
+ offset += sizeof(hdr->h_checksum); /* skip checksum */
+ crc = crc32c_le(crc, (__u8 *)hdr + offset,
+ EXT4_BLOCK_SIZE(inode->i_sb) - offset);
+ return cpu_to_le32(crc);
+}
+
+static int ext4_xattr_block_csum_verify(struct inode *inode,
+ sector_t block_nr,
+ struct ext4_xattr_header *hdr)
+{
+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
+ (hdr->h_checksum != ext4_xattr_block_csum(inode, block_nr, hdr)))
+ return 0;
+ return 1;
+}
+
+static void ext4_xattr_block_csum_set(struct inode *inode,
+ sector_t block_nr,
+ struct ext4_xattr_header *hdr)
+{
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ hdr->h_checksum = ext4_xattr_block_csum(inode, block_nr, hdr);
+}
+
+static inline int ext4_handle_dirty_xattr_block(handle_t *handle,
+ struct inode *inode,
+ struct buffer_head *bh)
+{
+ ext4_xattr_block_csum_set(inode, bh->b_blocknr, BHDR(bh));
+ return ext4_handle_dirty_metadata(handle, inode, bh);
+}
+
static inline const struct xattr_handler *
ext4_xattr_handler(int name_index)
{
@@ -156,14 +209,21 @@ ext4_xattr_check_names(struct ext4_xattr_entry *entry, void *end)
}
static inline int
-ext4_xattr_check_block(struct buffer_head *bh)
+ext4_xattr_check_block(struct inode *inode, struct buffer_head *bh)
{
int error;
+ if (buffer_verified(bh))
+ return 0;
+
if (BHDR(bh)->h_magic != cpu_to_le32(EXT4_XATTR_MAGIC) ||
BHDR(bh)->h_blocks != cpu_to_le32(1))
return -EIO;
+ if (!ext4_xattr_block_csum_verify(inode, bh->b_blocknr, BHDR(bh)))
+ return -EIO;
error = ext4_xattr_check_names(BFIRST(bh), bh->b_data + bh->b_size);
+ if (!error)
+ set_buffer_verified(bh);
return error;
}
@@ -226,7 +286,7 @@ ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
goto cleanup;
ea_bdebug(bh, "b_count=%d, refcount=%d",
atomic_read(&(bh->b_count)), le32_to_cpu(BHDR(bh)->h_refcount));
- if (ext4_xattr_check_block(bh)) {
+ if (ext4_xattr_check_block(inode, bh)) {
bad_block:
EXT4_ERROR_INODE(inode, "bad block %llu",
EXT4_I(inode)->i_file_acl);
@@ -370,7 +430,7 @@ ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
goto cleanup;
ea_bdebug(bh, "b_count=%d, refcount=%d",
atomic_read(&(bh->b_count)), le32_to_cpu(BHDR(bh)->h_refcount));
- if (ext4_xattr_check_block(bh)) {
+ if (ext4_xattr_check_block(inode, bh)) {
EXT4_ERROR_INODE(inode, "bad block %llu",
EXT4_I(inode)->i_file_acl);
error = -EIO;
@@ -489,7 +549,7 @@ ext4_xattr_release_block(handle_t *handle, struct inode *inode,
EXT4_FREE_BLOCKS_FORGET);
} else {
le32_add_cpu(&BHDR(bh)->h_refcount, -1);
- error = ext4_handle_dirty_metadata(handle, inode, bh);
+ error = ext4_handle_dirty_xattr_block(handle, inode, bh);
if (IS_SYNC(inode))
ext4_handle_sync(handle);
dquot_free_block(inode, 1);
@@ -662,7 +722,7 @@ ext4_xattr_block_find(struct inode *inode, struct ext4_xattr_info *i,
ea_bdebug(bs->bh, "b_count=%d, refcount=%d",
atomic_read(&(bs->bh->b_count)),
le32_to_cpu(BHDR(bs->bh)->h_refcount));
- if (ext4_xattr_check_block(bs->bh)) {
+ if (ext4_xattr_check_block(inode, bs->bh)) {
EXT4_ERROR_INODE(inode, "bad block %llu",
EXT4_I(inode)->i_file_acl);
error = -EIO;
@@ -725,9 +785,9 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
if (error == -EIO)
goto bad_block;
if (!error)
- error = ext4_handle_dirty_metadata(handle,
- inode,
- bs->bh);
+ error = ext4_handle_dirty_xattr_block(handle,
+ inode,
+ bs->bh);
if (error)
goto cleanup;
goto inserted;
@@ -796,9 +856,9 @@ inserted:
ea_bdebug(new_bh, "reusing; refcount now=%d",
le32_to_cpu(BHDR(new_bh)->h_refcount));
unlock_buffer(new_bh);
- error = ext4_handle_dirty_metadata(handle,
- inode,
- new_bh);
+ error = ext4_handle_dirty_xattr_block(handle,
+ inode,
+ new_bh);
if (error)
goto cleanup_dquot;
}
@@ -848,8 +908,8 @@ getblk_failed:
set_buffer_uptodate(new_bh);
unlock_buffer(new_bh);
ext4_xattr_cache_insert(new_bh);
- error = ext4_handle_dirty_metadata(handle,
- inode, new_bh);
+ error = ext4_handle_dirty_xattr_block(handle,
+ inode, new_bh);
if (error)
goto cleanup;
}
@@ -1190,7 +1250,7 @@ retry:
error = -EIO;
if (!bh)
goto cleanup;
- if (ext4_xattr_check_block(bh)) {
+ if (ext4_xattr_check_block(inode, bh)) {
EXT4_ERROR_INODE(inode, "bad block %llu",
EXT4_I(inode)->i_file_acl);
error = -EIO;
diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
index 25b7387..b2b20af 100644
--- a/fs/ext4/xattr.h
+++ b/fs/ext4/xattr.h
@@ -27,7 +27,8 @@ struct ext4_xattr_header {
__le32 h_refcount; /* reference count */
__le32 h_blocks; /* number of disk blocks used */
__le32 h_hash; /* hash value of all attributes */
- __u32 h_reserved[4]; /* zero right now */
+ __le32 h_checksum; /* crc32c(uuid+inum+xattrblock) */
+ __u32 h_reserved[3]; /* zero right now */
};
struct ext4_xattr_ibody_header {
The CRC32c polynomial provides better error detection and can be hardware
accelerated on certain machines. To that end, add support for it to jbd2.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/jbd2/Kconfig | 1 +
fs/jbd2/commit.c | 6 +++---
fs/jbd2/recovery.c | 20 +++++++++++++++++---
include/linux/jbd2.h | 1 +
4 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/fs/jbd2/Kconfig b/fs/jbd2/Kconfig
index f32f346..40a126b 100644
--- a/fs/jbd2/Kconfig
+++ b/fs/jbd2/Kconfig
@@ -1,6 +1,7 @@
config JBD2
tristate
select CRC32
+ select LIBCRC32C
help
This is a generic journaling layer for block devices that support
both 32-bit and 64-bit block numbers. It is currently used by
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index eef6979..00387be 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -21,7 +21,7 @@
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/jiffies.h>
-#include <linux/crc32.h>
+#include <linux/crc32c.h>
#include <linux/writeback.h>
#include <linux/backing-dev.h>
#include <linux/bio.h>
@@ -125,7 +125,7 @@ static int journal_submit_commit_record(journal_t *journal,
if (JBD2_HAS_COMPAT_FEATURE(journal,
JBD2_FEATURE_COMPAT_CHECKSUM)) {
- tmp->h_chksum_type = JBD2_CRC32_CHKSUM;
+ tmp->h_chksum_type = JBD2_CRC32C_CHKSUM;
tmp->h_chksum_size = JBD2_CRC32_CHKSUM_SIZE;
tmp->h_chksum[0] = cpu_to_be32(crc32_sum);
}
@@ -287,7 +287,7 @@ static __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
__u32 checksum;
addr = kmap_atomic(page, KM_USER0);
- checksum = crc32_be(crc32_sum,
+ checksum = crc32c_le(crc32_sum,
(void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
kunmap_atomic(addr, KM_USER0);
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 1cad869..4bab4dd 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -21,6 +21,7 @@
#include <linux/jbd2.h>
#include <linux/errno.h>
#include <linux/crc32.h>
+#include <linux/crc32c.h>
#endif
/*
@@ -323,7 +324,8 @@ static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag
* descriptor block.
*/
static int calc_chksums(journal_t *journal, struct buffer_head *bh,
- unsigned long *next_log_block, __u32 *crc32_sum)
+ unsigned long *next_log_block, __u32 *crc32_sum,
+ __u32 *crc32c_sum)
{
int i, num_blks, err;
unsigned long io_block;
@@ -332,6 +334,7 @@ static int calc_chksums(journal_t *journal, struct buffer_head *bh,
num_blks = count_tags(journal, bh);
/* Calculate checksum of the descriptor block. */
*crc32_sum = crc32_be(*crc32_sum, (void *)bh->b_data, bh->b_size);
+ *crc32c_sum = crc32c_le(*crc32c_sum, (void *)bh->b_data, bh->b_size);
for (i = 0; i < num_blks; i++) {
io_block = (*next_log_block)++;
@@ -344,6 +347,9 @@ static int calc_chksums(journal_t *journal, struct buffer_head *bh,
} else {
*crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
obh->b_size);
+ *crc32c_sum = crc32c_le(*crc32c_sum,
+ (void *)obh->b_data,
+ obh->b_size);
}
put_bh(obh);
}
@@ -363,6 +369,7 @@ static int do_one_pass(journal_t *journal,
int blocktype;
int tag_bytes = journal_tag_bytes(journal);
__u32 crc32_sum = ~0; /* Transactional Checksums */
+ __u32 crc32c_sum = ~0; /* Transactional Checksums */
/*
* First thing is to establish what we expect to find in the log
@@ -459,7 +466,8 @@ static int do_one_pass(journal_t *journal,
!info->end_transaction) {
if (calc_chksums(journal, bh,
&next_log_block,
- &crc32_sum)) {
+ &crc32_sum,
+ &crc32c_sum)) {
put_bh(bh);
break;
}
@@ -617,7 +625,12 @@ static int do_one_pass(journal_t *journal,
cbh->h_chksum_type == JBD2_CRC32_CHKSUM &&
cbh->h_chksum_size ==
JBD2_CRC32_CHKSUM_SIZE)
- chksum_seen = 1;
+ chksum_seen = 1;
+ else if (crc32c_sum == found_chksum &&
+ cbh->h_chksum_type == JBD2_CRC32C_CHKSUM &&
+ cbh->h_chksum_size ==
+ JBD2_CRC32_CHKSUM_SIZE)
+ chksum_seen = 1;
else if (!(cbh->h_chksum_type == 0 &&
cbh->h_chksum_size == 0 &&
found_chksum == 0 &&
@@ -646,6 +659,7 @@ static int do_one_pass(journal_t *journal,
}
}
crc32_sum = ~0;
+ crc32c_sum = ~0;
}
brelse(bh);
next_commit_ID++;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 38f307b..de3ec23 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -147,6 +147,7 @@ typedef struct journal_header_s
#define JBD2_CRC32_CHKSUM 1
#define JBD2_MD5_CHKSUM 2
#define JBD2_SHA1_CHKSUM 3
+#define JBD2_CRC32C_CHKSUM 4
#define JBD2_CRC32_CHKSUM_SIZE 4
Calculate and verify the checksums for directory leaf blocks (i.e. blocks that
only contain actual directory entries). The checksum lives in what looks to be
an unused directory entry with a 0 name_len at the end of the block. This
scheme is not used for internal htree nodes because the mechanism in place
there only costs one dx_entry, whereas the "empty" directory entry would cost
two dx_entries.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/dir.c | 12 +++
fs/ext4/ext4.h | 13 +++
fs/ext4/namei.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 269 insertions(+), 15 deletions(-)
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 164c560..bc40c9e 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -180,6 +180,18 @@ static int ext4_readdir(struct file *filp,
continue;
}
+ /* Check the checksum */
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(inode,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ EXT4_ERROR_FILE(filp, 0, "directory fails checksum "
+ "at offset %llu",
+ (unsigned long long)filp->f_pos);
+ filp->f_pos += sb->s_blocksize - offset;
+ continue;
+ }
+ set_buffer_verified(bh);
+
revalidate:
/* If the dir block has changed since the last call to
* readdir(2), then we might be pointing to an invalid
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index df149b3..b7aa5b5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1471,6 +1471,17 @@ struct ext4_dir_entry_2 {
};
/*
+ * This is a bogus directory entry at the end of each leaf block that
+ * records checksums.
+ */
+struct ext4_dir_entry_tail {
+ __le32 reserved_zero1; /* Pretend to be unused */
+ __le16 rec_len; /* 12 */
+ __le16 reserved_zero2; /* Zero name length */
+ __le32 checksum; /* crc32c(uuid+inum+dirblock) */
+};
+
+/*
* Ext4 directory file types. Only the low 3 bits are used. The
* other bits are reserved for now.
*/
@@ -1875,6 +1886,8 @@ extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
extern int ext4_ext_migrate(struct inode *);
/* namei.c */
+extern int ext4_dirent_csum_verify(struct inode *inode,
+ struct ext4_dir_entry *dirent);
extern int ext4_orphan_add(handle_t *, struct inode *);
extern int ext4_orphan_del(handle_t *, struct inode *);
extern int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 89797bf..2d0fdb9 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -191,6 +191,104 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
struct inode *inode);
/* checksumming functions */
+static struct ext4_dir_entry_tail *get_dirent_tail(struct inode *inode,
+ struct ext4_dir_entry *de)
+{
+ struct ext4_dir_entry *d, *top;
+ struct ext4_dir_entry_tail *t;
+
+ d = de;
+ top = (struct ext4_dir_entry *)(((void *)de) +
+ (EXT4_BLOCK_SIZE(inode->i_sb) -
+ sizeof(struct ext4_dir_entry_tail)));
+ while (d < top && d->rec_len)
+ d = (struct ext4_dir_entry *)(((void *)d) +
+ le16_to_cpu(d->rec_len));
+
+ if (d != top)
+ return NULL;
+
+ t = (struct ext4_dir_entry_tail *)d;
+ if (t->reserved_zero1 ||
+ le16_to_cpu(t->rec_len) != sizeof(struct ext4_dir_entry_tail) ||
+ t->reserved_zero2)
+ return NULL;
+
+ return t;
+}
+
+static __le32 ext4_dirent_csum(struct inode *inode,
+ struct ext4_dir_entry *dirent)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct ext4_dir_entry_tail *t;
+ __le32 inum = cpu_to_le32(inode->i_ino);
+ int size;
+ __u32 crc = 0;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 0;
+
+ t = get_dirent_tail(inode, dirent);
+ if (!t)
+ return 0;
+
+ size = (void *)t - (void *)dirent;
+ crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
+ crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
+ crc = crc32c_le(crc, (__u8 *)dirent, size);
+ return cpu_to_le32(crc);
+}
+
+int ext4_dirent_csum_verify(struct inode *inode, struct ext4_dir_entry *dirent)
+{
+ struct ext4_dir_entry_tail *t;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ t = get_dirent_tail(inode, dirent);
+ if (!t) {
+ EXT4_ERROR_INODE(inode, "metadata_csum set but no space in dir "
+ "leaf for checksum. Please run e2fsck -D.");
+ return 0;
+ }
+
+ if (t->checksum != ext4_dirent_csum(inode, dirent))
+ return 0;
+
+ return 1;
+}
+
+static void ext4_dirent_csum_set(struct inode *inode,
+ struct ext4_dir_entry *dirent)
+{
+ struct ext4_dir_entry_tail *t;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ t = get_dirent_tail(inode, dirent);
+ if (!t) {
+ EXT4_ERROR_INODE(inode, "metadata_csum set but no space in dir "
+ "leaf for checksum. Please run e2fsck -D.");
+ return;
+ }
+
+ t->checksum = ext4_dirent_csum(inode, dirent);
+}
+
+static inline int ext4_handle_dirty_dirent_node(handle_t *handle,
+ struct inode *inode,
+ struct buffer_head *bh)
+{
+ ext4_dirent_csum_set(inode, (struct ext4_dir_entry *)bh->b_data);
+ return ext4_handle_dirty_metadata(handle, inode, bh);
+}
+
static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
struct ext4_dir_entry *dirent,
int *offset)
@@ -748,6 +846,11 @@ static int htree_dirblock_to_tree(struct file *dir_file,
if (!(bh = ext4_bread (NULL, dir, block, 0, &err)))
return err;
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data))
+ return -EIO;
+ set_buffer_verified(bh);
+
de = (struct ext4_dir_entry_2 *) bh->b_data;
top = (struct ext4_dir_entry_2 *) ((char *) de +
dir->i_sb->s_blocksize -
@@ -1106,6 +1209,15 @@ restart:
brelse(bh);
goto next;
}
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(dir,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ EXT4_ERROR_INODE(dir, "checksumming directory "
+ "block %lu", (unsigned long)block);
+ brelse(bh);
+ goto next;
+ }
+ set_buffer_verified(bh);
i = search_dirblock(bh, dir, d_name,
block << EXT4_BLOCK_SIZE_BITS(sb), res_dir);
if (i == 1) {
@@ -1157,6 +1269,16 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir, const struct q
if (!(bh = ext4_bread(NULL, dir, block, 0, err)))
goto errout;
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(dir,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ EXT4_ERROR_INODE(dir, "checksumming directory "
+ "block %lu", (unsigned long)block);
+ brelse(bh);
+ *err = -EIO;
+ goto errout;
+ }
+ set_buffer_verified(bh);
retval = search_dirblock(bh, dir, d_name,
block << EXT4_BLOCK_SIZE_BITS(sb),
res_dir);
@@ -1329,8 +1451,14 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
char *data1 = (*bh)->b_data, *data2;
unsigned split, move, size;
struct ext4_dir_entry_2 *de = NULL, *de2;
+ struct ext4_dir_entry_tail *t;
+ int csum_size = 0;
int err = 0, i;
+ if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);
+
bh2 = ext4_append (handle, dir, &newblock, &err);
if (!(bh2)) {
brelse(*bh);
@@ -1377,10 +1505,24 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
/* Fancy dance to stay within two buffers */
de2 = dx_move_dirents(data1, data2, map + split, count - split, blocksize);
de = dx_pack_dirents(data1, blocksize);
- de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de,
+ de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) -
+ (char *) de,
blocksize);
- de2->rec_len = ext4_rec_len_to_disk(data2 + blocksize - (char *) de2,
+ de2->rec_len = ext4_rec_len_to_disk(data2 + (blocksize - csum_size) -
+ (char *) de2,
blocksize);
+ if (csum_size) {
+ t = (struct ext4_dir_entry_tail *)(data2 +
+ (blocksize - csum_size));
+ memset(t, 0, csum_size);
+ t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
+
+ t = (struct ext4_dir_entry_tail *)(data1 +
+ (blocksize - csum_size));
+ memset(t, 0, csum_size);
+ t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
+ }
+
dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data1, blocksize, 1));
dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data2, blocksize, 1));
@@ -1391,7 +1533,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
de = de2;
}
dx_insert_block(frame, hash2 + continued, newblock);
- err = ext4_handle_dirty_metadata(handle, dir, bh2);
+ err = ext4_handle_dirty_dirent_node(handle, dir, bh2);
if (err)
goto journal_error;
err = ext4_handle_dirty_dx_node(handle, dir, frame->bh);
@@ -1431,11 +1573,16 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
unsigned short reclen;
int nlen, rlen, err;
char *top;
+ int csum_size = 0;
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);
reclen = EXT4_DIR_REC_LEN(namelen);
if (!de) {
de = (struct ext4_dir_entry_2 *)bh->b_data;
- top = bh->b_data + blocksize - reclen;
+ top = bh->b_data + (blocksize - csum_size) - reclen;
while ((char *) de <= top) {
if (ext4_check_dir_entry(dir, NULL, de, bh, offset))
return -EIO;
@@ -1491,7 +1638,7 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
dir->i_version++;
ext4_mark_inode_dirty(handle, dir);
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, dir, bh);
+ err = ext4_handle_dirty_dirent_node(handle, dir, bh);
if (err)
ext4_std_error(dir->i_sb, err);
return 0;
@@ -1512,6 +1659,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
struct dx_frame frames[2], *frame;
struct dx_entry *entries;
struct ext4_dir_entry_2 *de, *de2;
+ struct ext4_dir_entry_tail *t;
char *data1, *top;
unsigned len;
int retval;
@@ -1519,6 +1667,11 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
struct dx_hash_info hinfo;
ext4_lblk_t block;
struct fake_dirent *fde;
+ int csum_size = 0;
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);
blocksize = dir->i_sb->s_blocksize;
dxtrace(printk(KERN_DEBUG "Creating index: inode %lu\n", dir->i_ino));
@@ -1539,7 +1692,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
brelse(bh);
return -EIO;
}
- len = ((char *) root) + blocksize - (char *) de;
+ len = ((char *) root) + (blocksize - csum_size) - (char *) de;
/* Allocate new block for the 0th block's dirents */
bh2 = ext4_append(handle, dir, &block, &retval);
@@ -1555,8 +1708,17 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
top = data1 + len;
while ((char *)(de2 = ext4_next_entry(de, blocksize)) < top)
de = de2;
- de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de,
+ de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) -
+ (char *) de,
blocksize);
+
+ if (csum_size) {
+ t = (struct ext4_dir_entry_tail *)(data1 +
+ (blocksize - csum_size));
+ memset(t, 0, csum_size);
+ t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
+ }
+
/* Initialize the root; the dot dirents already exist */
de = (struct ext4_dir_entry_2 *) (&root->dotdot);
de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(2),
@@ -1582,7 +1744,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
bh = bh2;
ext4_handle_dirty_dx_node(handle, dir, frame->bh);
- ext4_handle_dirty_metadata(handle, dir, bh);
+ ext4_handle_dirty_dirent_node(handle, dir, bh);
de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
if (!de) {
@@ -1618,11 +1780,17 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
struct inode *dir = dentry->d_parent->d_inode;
struct buffer_head *bh;
struct ext4_dir_entry_2 *de;
+ struct ext4_dir_entry_tail *t;
struct super_block *sb;
int retval;
int dx_fallback=0;
unsigned blocksize;
ext4_lblk_t block, blocks;
+ int csum_size = 0;
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);
sb = dir->i_sb;
blocksize = sb->s_blocksize;
@@ -1641,6 +1809,11 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
bh = ext4_bread(handle, dir, block, 0, &retval);
if(!bh)
return retval;
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(dir,
+ (struct ext4_dir_entry *)bh->b_data))
+ return -EIO;
+ set_buffer_verified(bh);
retval = add_dirent_to_buf(handle, dentry, inode, NULL, bh);
if (retval != -ENOSPC) {
brelse(bh);
@@ -1657,7 +1830,15 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
return retval;
de = (struct ext4_dir_entry_2 *) bh->b_data;
de->inode = 0;
- de->rec_len = ext4_rec_len_to_disk(blocksize, blocksize);
+ de->rec_len = ext4_rec_len_to_disk(blocksize - csum_size, blocksize);
+
+ if (csum_size) {
+ t = (struct ext4_dir_entry_tail *)(((void *)bh->b_data) +
+ (blocksize - csum_size));
+ memset(t, 0, csum_size);
+ t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
+ }
+
retval = add_dirent_to_buf(handle, dentry, inode, de, bh);
brelse(bh);
if (retval == 0)
@@ -1689,6 +1870,11 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
if (!(bh = ext4_bread(handle,dir, dx_get_block(frame->at), 0, &err)))
goto cleanup;
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data))
+ goto journal_error;
+ set_buffer_verified(bh);
+
BUFFER_TRACE(bh, "get_write_access");
err = ext4_journal_get_write_access(handle, bh);
if (err)
@@ -1814,12 +2000,17 @@ static int ext4_delete_entry(handle_t *handle,
{
struct ext4_dir_entry_2 *de, *pde;
unsigned int blocksize = dir->i_sb->s_blocksize;
+ int csum_size = 0;
int i, err;
+ if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);
+
i = 0;
pde = NULL;
de = (struct ext4_dir_entry_2 *) bh->b_data;
- while (i < bh->b_size) {
+ while (i < bh->b_size - csum_size) {
if (ext4_check_dir_entry(dir, NULL, de, bh, i))
return -EIO;
if (de == de_del) {
@@ -1840,7 +2031,7 @@ static int ext4_delete_entry(handle_t *handle,
de->inode = 0;
dir->i_version++;
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, dir, bh);
+ err = ext4_handle_dirty_dirent_node(handle, dir, bh);
if (unlikely(err)) {
ext4_std_error(dir->i_sb, err);
return err;
@@ -1983,9 +2174,15 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, int mode)
struct inode *inode;
struct buffer_head *dir_block = NULL;
struct ext4_dir_entry_2 *de;
+ struct ext4_dir_entry_tail *t;
unsigned int blocksize = dir->i_sb->s_blocksize;
+ int csum_size = 0;
int err, retries = 0;
+ if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ csum_size = sizeof(struct ext4_dir_entry_tail);
+
if (EXT4_DIR_LINK_MAX(dir))
return -EMLINK;
@@ -2026,16 +2223,26 @@ retry:
ext4_set_de_type(dir->i_sb, de, S_IFDIR);
de = ext4_next_entry(de, blocksize);
de->inode = cpu_to_le32(dir->i_ino);
- de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(1),
+ de->rec_len = ext4_rec_len_to_disk(blocksize -
+ (csum_size + EXT4_DIR_REC_LEN(1)),
blocksize);
de->name_len = 2;
strcpy(de->name, "..");
ext4_set_de_type(dir->i_sb, de, S_IFDIR);
inode->i_nlink = 2;
+
+ if (csum_size) {
+ t = (struct ext4_dir_entry_tail *)(((void *)dir_block->b_data) +
+ (blocksize - csum_size));
+ memset(t, 0, csum_size);
+ t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
+ }
+
BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
- err = ext4_handle_dirty_metadata(handle, inode, dir_block);
+ err = ext4_handle_dirty_dirent_node(handle, inode, dir_block);
if (err)
goto out_clear_inode;
+ set_buffer_verified(dir_block);
err = ext4_mark_inode_dirty(handle, inode);
if (!err)
err = ext4_add_entry(handle, dentry, inode);
@@ -2085,6 +2292,14 @@ static int empty_dir(struct inode *inode)
inode->i_ino);
return 1;
}
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(inode,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ EXT4_ERROR_INODE(inode, "checksum error reading directory "
+ "lblock 0");
+ return -EIO;
+ }
+ set_buffer_verified(bh);
de = (struct ext4_dir_entry_2 *) bh->b_data;
de1 = ext4_next_entry(de, sb->s_blocksize);
if (le32_to_cpu(de->inode) != inode->i_ino ||
@@ -2116,6 +2331,14 @@ static int empty_dir(struct inode *inode)
offset += sb->s_blocksize;
continue;
}
+ if (!buffer_verified(bh) &&
+ !ext4_dirent_csum_verify(inode,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ EXT4_ERROR_INODE(inode, "checksum error "
+ "reading directory lblock 0");
+ return -EIO;
+ }
+ set_buffer_verified(bh);
de = (struct ext4_dir_entry_2 *) bh->b_data;
}
if (ext4_check_dir_entry(inode, NULL, de, bh, offset)) {
@@ -2616,6 +2839,11 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
dir_bh = ext4_bread(handle, old_inode, 0, 0, &retval);
if (!dir_bh)
goto end_rename;
+ if (!buffer_verified(dir_bh) &&
+ !ext4_dirent_csum_verify(old_inode,
+ (struct ext4_dir_entry *)dir_bh->b_data))
+ goto end_rename;
+ set_buffer_verified(dir_bh);
if (le32_to_cpu(PARENT_INO(dir_bh->b_data,
old_dir->i_sb->s_blocksize)) != old_dir->i_ino)
goto end_rename;
@@ -2646,7 +2874,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
ext4_current_time(new_dir);
ext4_mark_inode_dirty(handle, new_dir);
BUFFER_TRACE(new_bh, "call ext4_handle_dirty_metadata");
- retval = ext4_handle_dirty_metadata(handle, new_dir, new_bh);
+ retval = ext4_handle_dirty_dirent_node(handle, new_dir, new_bh);
if (unlikely(retval)) {
ext4_std_error(new_dir->i_sb, retval);
goto end_rename;
@@ -2700,7 +2928,8 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
PARENT_INO(dir_bh->b_data, new_dir->i_sb->s_blocksize) =
cpu_to_le32(new_dir->i_ino);
BUFFER_TRACE(dir_bh, "call ext4_handle_dirty_metadata");
- retval = ext4_handle_dirty_metadata(handle, old_inode, dir_bh);
+ retval = ext4_handle_dirty_dirent_node(handle, old_inode,
+ dir_bh);
if (retval) {
ext4_std_error(old_dir->i_sb, retval);
goto end_rename;
Calculate and verify the superblock checksum. Since the UUID and block group
number are embedded in each copy of the superblock, we need only checksum the
entire block. Refactor some of the code to eliminate open-coding of the
checksum update call.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4.h | 8 +++++++-
fs/ext4/ext4_jbd2.c | 9 ++++++++-
fs/ext4/ext4_jbd2.h | 7 +++++--
fs/ext4/inode.c | 3 +--
fs/ext4/namei.c | 4 ++--
fs/ext4/resize.c | 6 +++++-
fs/ext4/super.c | 43 +++++++++++++++++++++++++++++++++++++++++++
7 files changed, 71 insertions(+), 9 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b7aa5b5..1e93410 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1067,7 +1067,9 @@ struct ext4_super_block {
__u8 s_last_error_func[32]; /* function where the error happened */
#define EXT4_S_ERR_END offsetof(struct ext4_super_block, s_mount_opts)
__u8 s_mount_opts[64];
- __le32 s_reserved[112]; /* Padding to the end of the block */
+ __u32 s_reserved1[3]; /* Padding */
+ __u32 s_checksum; /* crc32c(superblock) */
+ __le32 s_reserved2[108]; /* Padding to the end of the block */
};
#define EXT4_S_ERR_LEN (EXT4_S_ERR_END - EXT4_S_ERR_START)
@@ -1901,6 +1903,10 @@ extern int ext4_group_extend(struct super_block *sb,
ext4_fsblk_t n_blocks_count);
/* super.c */
+extern int ext4_superblock_csum_verify(struct super_block *sb,
+ struct ext4_super_block *es);
+extern void ext4_superblock_csum_set(struct super_block *sb,
+ struct ext4_super_block *es);
extern void *ext4_kvmalloc(size_t size, gfp_t flags);
extern void *ext4_kvzalloc(size_t size, gfp_t flags);
extern void ext4_kvfree(void *ptr);
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index f5240aa..04ddc97 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -136,16 +136,23 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
}
int __ext4_handle_dirty_super(const char *where, unsigned int line,
- handle_t *handle, struct super_block *sb)
+ handle_t *handle, struct super_block *sb,
+ int now)
{
struct buffer_head *bh = EXT4_SB(sb)->s_sbh;
int err = 0;
if (ext4_handle_valid(handle)) {
+ ext4_superblock_csum_set(sb,
+ (struct ext4_super_block *)bh->b_data);
err = jbd2_journal_dirty_metadata(handle, bh);
if (err)
ext4_journal_abort_handle(where, line, __func__,
bh, handle, err);
+ } else if (now) {
+ ext4_superblock_csum_set(sb,
+ (struct ext4_super_block *)bh->b_data);
+ mark_buffer_dirty(bh);
} else
sb->s_dirt = 1;
return err;
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 5802fa1..ed9b78d 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -141,7 +141,8 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
struct buffer_head *bh);
int __ext4_handle_dirty_super(const char *where, unsigned int line,
- handle_t *handle, struct super_block *sb);
+ handle_t *handle, struct super_block *sb,
+ int now);
#define ext4_journal_get_write_access(handle, bh) \
__ext4_journal_get_write_access(__func__, __LINE__, (handle), (bh))
@@ -153,8 +154,10 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
#define ext4_handle_dirty_metadata(handle, inode, bh) \
__ext4_handle_dirty_metadata(__func__, __LINE__, (handle), (inode), \
(bh))
+#define ext4_handle_dirty_super_now(handle, sb) \
+ __ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb), 1)
#define ext4_handle_dirty_super(handle, sb) \
- __ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb))
+ __ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb), 0)
handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks);
int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e24ba98..52e9b67 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3761,8 +3761,7 @@ static int ext4_do_update_inode(handle_t *handle,
EXT4_FEATURE_RO_COMPAT_LARGE_FILE);
sb->s_dirt = 1;
ext4_handle_sync(handle);
- err = ext4_handle_dirty_metadata(handle, NULL,
- EXT4_SB(sb)->s_sbh);
+ err = ext4_handle_dirty_super_now(handle, sb);
}
}
raw_inode->i_generation = cpu_to_le32(inode->i_generation);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2d0fdb9..5ebf281 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2407,7 +2407,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
/* Insert this inode at the head of the on-disk orphan list... */
NEXT_ORPHAN(inode) = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
EXT4_SB(sb)->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
- err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
+ err = ext4_handle_dirty_super_now(handle, sb);
rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
if (!err)
err = rc;
@@ -2480,7 +2480,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
if (err)
goto out_brelse;
sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
- err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh);
+ err = ext4_handle_dirty_super_now(handle, inode->i_sb);
} else {
struct ext4_iloc iloc2;
struct inode *i_prev =
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 707d3f1..2ad7008 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -511,7 +511,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,
ext4_kvfree(o_group_desc);
le16_add_cpu(&es->s_reserved_gdt_blocks, -1);
- err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
+ err = ext4_handle_dirty_super_now(handle, sb);
if (err)
ext4_std_error(sb, err);
@@ -682,6 +682,8 @@ static void update_backups(struct super_block *sb,
goto exit_err;
}
+ ext4_superblock_csum_set(sb, (struct ext4_super_block *)data);
+
while ((group = ext4_list_backups(sb, &three, &five, &seven)) < last) {
struct buffer_head *bh;
@@ -925,6 +927,8 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
/* Update the global fs size fields */
sbi->s_groups_count++;
+ ext4_superblock_csum_set(sb,
+ (struct ext4_super_block *)primary->b_data);
err = ext4_handle_dirty_metadata(handle, NULL, primary);
if (unlikely(err)) {
ext4_std_error(sb, err);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 44d0c8d..b254274 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -38,6 +38,7 @@
#include <linux/ctype.h>
#include <linux/log2.h>
#include <linux/crc16.h>
+#include <linux/crc32c.h>
#include <linux/cleancache.h>
#include <asm/uaccess.h>
@@ -110,6 +111,41 @@ static struct file_system_type ext3_fs_type = {
#define IS_EXT3_SB(sb) (0)
#endif
+static __le32 ext4_superblock_csum(struct super_block *sb,
+ struct ext4_super_block *es)
+{
+ int offset = offsetof(struct ext4_super_block, s_checksum);
+ __u32 crc = 0;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 0;
+
+ crc = crc32c_le(~0, (char *)es, offset);
+
+ return cpu_to_le32(crc);
+}
+
+int ext4_superblock_csum_verify(struct super_block *sb,
+ struct ext4_super_block *es)
+{
+ if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
+ (es->s_checksum != ext4_superblock_csum(sb, es)))
+ return 0;
+ return 1;
+}
+
+void ext4_superblock_csum_set(struct super_block *sb,
+ struct ext4_super_block *es)
+{
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ es->s_checksum = ext4_superblock_csum(sb, es);
+}
+
void *ext4_kvmalloc(size_t size, gfp_t flags)
{
void *ret;
@@ -3151,6 +3187,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sb->s_magic = le16_to_cpu(es->s_magic);
if (sb->s_magic != EXT4_SUPER_MAGIC)
goto cantfind_ext4;
+ if (!ext4_superblock_csum_verify(sb, es)) {
+ ext4_msg(sb, KERN_ERR, "VFS: Found ext4 filesystem with "
+ "invalid superblock checksum. Run e2fsck?");
+ silent = 1;
+ goto cantfind_ext4;
+ }
sbi->s_kbytes_written = le64_to_cpu(es->s_kbytes_written);
/* Set defaults before we parse the mount options */
@@ -4107,6 +4149,7 @@ static int ext4_commit_super(struct super_block *sb, int sync)
&EXT4_SB(sb)->s_freeinodes_counter));
sb->s_dirt = 0;
BUFFER_TRACE(sbh, "marking dirty");
+ ext4_superblock_csum_set(sb, es);
mark_buffer_dirty(sbh);
if (sync) {
error = sync_dirty_buffer(sbh);
Extend the inode checksum to cover the empty space between the end of the
inode's data fields and the end of the space allocated for the inode. This
enables us to cover extended attribute data that might live in the empty space.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/inode.c | 4 +---
1 files changed, 1 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 44a7f88..e24ba98 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -53,7 +53,6 @@
static __le32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- struct ext4_inode_info *ei = EXT4_I(inode);
int offset = offsetof(struct ext4_inode, i_checksum);
__le32 inum = cpu_to_le32(inode->i_ino);
__u32 crc = 0;
@@ -70,8 +69,7 @@ static __le32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw)
crc = crc32c_le(crc, (__u8 *)raw, offset);
offset += sizeof(raw->i_checksum); /* skip checksum */
crc = crc32c_le(crc, (__u8 *)raw + offset,
- EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize -
- offset);
+ EXT4_INODE_SIZE(inode->i_sb) - offset);
return cpu_to_le32(crc);
}
Calculate and verify the checksum for directory index tree (htree) node blocks.
The checksum is stored in the last 4 bytes of the htree block and requires the
dx_entry array to stop 1 dx_entry short of the end of the block.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/namei.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 175 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a067835..89797bf 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -34,6 +34,7 @@
#include <linux/quotaops.h>
#include <linux/buffer_head.h>
#include <linux/bio.h>
+#include <linux/crc32c.h>
#include "ext4.h"
#include "ext4_jbd2.h"
@@ -145,6 +146,15 @@ struct dx_map_entry
u16 size;
};
+/*
+ * This goes at the end of each htree block. If you want to use the
+ * reserved field, you'll have to update the checksum code to include it.
+ */
+struct dx_tail {
+ u32 reserved;
+ u32 checksum; /* crc32c(uuid+inum+dirblock) */
+};
+
static inline ext4_lblk_t dx_get_block(struct dx_entry *entry);
static void dx_set_block(struct dx_entry *entry, ext4_lblk_t value);
static inline unsigned dx_get_hash(struct dx_entry *entry);
@@ -180,6 +190,130 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir,
static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
struct inode *inode);
+/* checksumming functions */
+static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
+ struct ext4_dir_entry *dirent,
+ int *offset)
+{
+ struct ext4_dir_entry *dp;
+ struct dx_root_info *root;
+ int count_offset;
+
+ if (le16_to_cpu(dirent->rec_len) == EXT4_BLOCK_SIZE(inode->i_sb))
+ count_offset = 8;
+ else if (le16_to_cpu(dirent->rec_len) == 12) {
+ dp = (struct ext4_dir_entry *)(((void *)dirent) + 12);
+ if (le16_to_cpu(dp->rec_len) !=
+ EXT4_BLOCK_SIZE(inode->i_sb) - 12)
+ return NULL;
+ root = (struct dx_root_info *)(((void *)dp + 12));
+ if (root->reserved_zero ||
+ root->info_length != sizeof(struct dx_root_info))
+ return NULL;
+ count_offset = 32;
+ } else
+ return NULL;
+
+ if (offset)
+ *offset = count_offset;
+ return (struct dx_countlimit *)(((void *)dirent) + count_offset);
+}
+
+static __le32 ext4_dx_csum(struct inode *inode, struct ext4_dir_entry *dirent)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ __le32 inum = cpu_to_le32(inode->i_ino);
+ __u32 crc = 0;
+ int size, count_offset, limit, count;
+ struct dx_countlimit *c;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 0;
+
+ c = get_dx_countlimit(inode, dirent, &count_offset);
+ if (!c)
+ return 0;
+ limit = le16_to_cpu(c->limit);
+ count = le16_to_cpu(c->count);
+ if (count_offset + (limit * sizeof(struct dx_entry)) >
+ EXT4_BLOCK_SIZE(inode->i_sb) - sizeof(struct dx_tail))
+ return 0;
+ size = count_offset + (count * sizeof(struct dx_entry));
+
+ crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
+ crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
+ crc = crc32c_le(crc, (__u8 *)dirent, size);
+ return cpu_to_le32(crc);
+}
+
+static int ext4_dx_csum_verify(struct inode *inode,
+ struct ext4_dir_entry *dirent)
+{
+ struct dx_countlimit *c;
+ struct dx_tail *t;
+ int count_offset, limit, count;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 1;
+
+ c = get_dx_countlimit(inode, dirent, &count_offset);
+ if (!c) {
+ EXT4_ERROR_INODE(inode, "dir seems corrupt? Run e2fsck -D.");
+ return 1;
+ }
+ limit = le16_to_cpu(c->limit);
+ count = le16_to_cpu(c->count);
+ if (count_offset + (limit * sizeof(struct dx_entry)) >
+ EXT4_BLOCK_SIZE(inode->i_sb) - sizeof(struct dx_tail)) {
+ EXT4_ERROR_INODE(inode, "metadata_csum set but no space for "
+ "tree checksum found. Run e2fsck -D.");
+ return 1;
+ }
+ t = (struct dx_tail *)(((struct dx_entry *)c) + limit);
+
+ if (t->checksum != ext4_dx_csum(inode, dirent))
+ return 0;
+ return 1;
+}
+
+static void ext4_dx_csum_set(struct inode *inode, struct ext4_dir_entry *dirent)
+{
+ struct dx_countlimit *c;
+ struct dx_tail *t;
+ int count_offset, limit, count;
+
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ c = get_dx_countlimit(inode, dirent, &count_offset);
+ if (!c) {
+ EXT4_ERROR_INODE(inode, "dir seems corrupt? Run e2fsck -D.");
+ return;
+ }
+ limit = le16_to_cpu(c->limit);
+ count = le16_to_cpu(c->count);
+ if (count_offset + (limit * sizeof(struct dx_entry)) >
+ EXT4_BLOCK_SIZE(inode->i_sb) - sizeof(struct dx_tail)) {
+ EXT4_ERROR_INODE(inode, "metadata_csum set but no space for "
+ "tree checksum. Run e2fsck -D.");
+ return;
+ }
+ t = (struct dx_tail *)(((struct dx_entry *)c) + limit);
+
+ t->checksum = ext4_dx_csum(inode, dirent);
+}
+
+static inline int ext4_handle_dirty_dx_node(handle_t *handle,
+ struct inode *inode,
+ struct buffer_head *bh)
+{
+ ext4_dx_csum_set(inode, (struct ext4_dir_entry *)bh->b_data);
+ return ext4_handle_dirty_metadata(handle, inode, bh);
+}
+
/*
* p is at least 6 bytes before the end of page
*/
@@ -239,12 +373,20 @@ static inline unsigned dx_root_limit(struct inode *dir, unsigned infosize)
{
unsigned entry_space = dir->i_sb->s_blocksize - EXT4_DIR_REC_LEN(1) -
EXT4_DIR_REC_LEN(2) - infosize;
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ entry_space -= sizeof(struct dx_tail);
return entry_space / sizeof(struct dx_entry);
}
static inline unsigned dx_node_limit(struct inode *dir)
{
unsigned entry_space = dir->i_sb->s_blocksize - EXT4_DIR_REC_LEN(0);
+
+ if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ entry_space -= sizeof(struct dx_tail);
return entry_space / sizeof(struct dx_entry);
}
@@ -390,6 +532,15 @@ dx_probe(const struct qstr *d_name, struct inode *dir,
goto fail;
}
+ if (!buffer_verified(bh) &&
+ !ext4_dx_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data)) {
+ ext4_warning(dir->i_sb, "Root failed checksum");
+ brelse(bh);
+ *err = ERR_BAD_DX_DIR;
+ goto fail;
+ }
+ set_buffer_verified(bh);
+
entries = (struct dx_entry *) (((char *)&root->info) +
root->info.info_length);
@@ -450,6 +601,17 @@ dx_probe(const struct qstr *d_name, struct inode *dir,
if (!(bh = ext4_bread (NULL,dir, dx_get_block(at), 0, err)))
goto fail2;
at = entries = ((struct dx_node *) bh->b_data)->entries;
+
+ if (!buffer_verified(bh) &&
+ !ext4_dx_csum_verify(dir,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ ext4_warning(dir->i_sb, "Node failed checksum");
+ brelse(bh);
+ *err = ERR_BAD_DX_DIR;
+ goto fail;
+ }
+ set_buffer_verified(bh);
+
if (dx_get_limit(entries) != dx_node_limit (dir)) {
ext4_warning(dir->i_sb,
"dx entry: limit != node limit");
@@ -549,6 +711,15 @@ static int ext4_htree_next_block(struct inode *dir, __u32 hash,
if (!(bh = ext4_bread(NULL, dir, dx_get_block(p->at),
0, &err)))
return err; /* Failure */
+
+ if (!buffer_verified(bh) &&
+ !ext4_dx_csum_verify(dir,
+ (struct ext4_dir_entry *)bh->b_data)) {
+ ext4_warning(dir->i_sb, "Node failed checksum");
+ return -EIO;
+ }
+ set_buffer_verified(bh);
+
p++;
brelse(p->bh);
p->bh = bh;
@@ -1223,7 +1394,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
err = ext4_handle_dirty_metadata(handle, dir, bh2);
if (err)
goto journal_error;
- err = ext4_handle_dirty_metadata(handle, dir, frame->bh);
+ err = ext4_handle_dirty_dx_node(handle, dir, frame->bh);
if (err)
goto journal_error;
brelse(bh2);
@@ -1410,7 +1581,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
frame->bh = bh;
bh = bh2;
- ext4_handle_dirty_metadata(handle, dir, frame->bh);
+ ext4_handle_dirty_dx_node(handle, dir, frame->bh);
ext4_handle_dirty_metadata(handle, dir, bh);
de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
@@ -1585,7 +1756,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
dxtrace(dx_show_index("node", frames[1].entries));
dxtrace(dx_show_index("node",
((struct dx_node *) bh2->b_data)->entries));
- err = ext4_handle_dirty_metadata(handle, dir, bh2);
+ err = ext4_handle_dirty_dx_node(handle, dir, bh2);
if (err)
goto journal_error;
brelse (bh2);
@@ -1611,7 +1782,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
if (err)
goto journal_error;
}
- err = ext4_handle_dirty_metadata(handle, dir, frames[0].bh);
+ err = ext4_handle_dirty_dx_node(handle, dir, frames[0].bh);
if (err) {
ext4_std_error(inode->i_sb, err);
goto cleanup;
This patch introduces to ext4 the ability to calculate and verify inode
checksums. This requires the use of a new ro compatibility flag and some
accompanying e2fsprogs patches to provide the relevant features in tune2fs and
e2fsck.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/ext4.h | 4 ++--
fs/ext4/inode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 64 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f79ddac..e2361cc 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -609,7 +609,7 @@ struct ext4_inode {
__le16 l_i_file_acl_high;
__le16 l_i_uid_high; /* these 2 fields */
__le16 l_i_gid_high; /* were reserved2[0] */
- __u32 l_i_reserved2;
+ __le32 l_i_checksum; /* crc32c(uuid+inum+inode) */
} linux2;
struct {
__le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */
@@ -727,7 +727,7 @@ do { \
#define i_gid_low i_gid
#define i_uid_high osd2.linux2.l_i_uid_high
#define i_gid_high osd2.linux2.l_i_gid_high
-#define i_reserved2 osd2.linux2.l_i_reserved2
+#define i_checksum osd2.linux2.l_i_checksum
#elif defined(__GNU__)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c4da98a..44a7f88 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -38,6 +38,7 @@
#include <linux/printk.h>
#include <linux/slab.h>
#include <linux/ratelimit.h>
+#include <linux/crc32c.h>
#include "ext4_jbd2.h"
#include "xattr.h"
@@ -49,6 +50,53 @@
#define MPAGE_DA_EXTENT_TAIL 0x01
+static __le32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ int offset = offsetof(struct ext4_inode, i_checksum);
+ __le32 inum = cpu_to_le32(inode->i_ino);
+ __u32 crc = 0;
+
+ if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
+ cpu_to_le32(EXT4_OS_LINUX))
+ return 0;
+ if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return 0;
+
+ crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
+ crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
+ crc = crc32c_le(crc, (__u8 *)raw, offset);
+ offset += sizeof(raw->i_checksum); /* skip checksum */
+ crc = crc32c_le(crc, (__u8 *)raw + offset,
+ EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize -
+ offset);
+ return cpu_to_le32(crc);
+}
+
+static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw)
+{
+ if (EXT4_SB(inode->i_sb)->s_es->s_creator_os ==
+ cpu_to_le32(EXT4_OS_LINUX) &&
+ EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
+ (raw->i_checksum != ext4_inode_csum(inode, raw)))
+ return 0;
+ return 1;
+}
+
+static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw)
+{
+ if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
+ cpu_to_le32(EXT4_OS_LINUX) ||
+ !EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
+ return;
+
+ raw->i_checksum = ext4_inode_csum(inode, raw);
+}
+
static inline int ext4_begin_ordered_truncate(struct inode *inode,
loff_t new_size)
{
@@ -3410,6 +3458,15 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
if (ret < 0)
goto bad_inode;
raw_inode = ext4_raw_inode(&iloc);
+
+ if (!ext4_inode_csum_verify(inode, raw_inode)) {
+ EXT4_ERROR_INODE(inode, "checksum invalid (0x%x != 0x%x)",
+ le32_to_cpu(ext4_inode_csum(inode, raw_inode)),
+ le32_to_cpu(raw_inode->i_checksum));
+ ret = -EIO;
+ goto bad_inode;
+ }
+
inode->i_mode = le16_to_cpu(raw_inode->i_mode);
inode->i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
inode->i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
@@ -3490,6 +3547,9 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
EXT4_INODE_SIZE(inode->i_sb)) {
+ EXT4_ERROR_INODE(inode, "bad extra_isize (%u != %u)",
+ EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize,
+ EXT4_INODE_SIZE(inode->i_sb));
ret = -EIO;
goto bad_inode;
}
@@ -3731,6 +3791,8 @@ static int ext4_do_update_inode(handle_t *handle,
raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
}
+ ext4_inode_csum_set(inode, raw_inode);
+
BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
rc = ext4_handle_dirty_metadata(handle, NULL, bh);
if (!err)
On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> This patch introduces to ext4 the ability to calculate and verify inode
> checksums. This requires the use of a new ro compatibility flag and some
> accompanying e2fsprogs patches to provide the relevant features in tune2fs and
> e2fsck.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/ext4/ext4.h | 4 ++--
> fs/ext4/inode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 64 insertions(+), 2 deletions(-)
>
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index f79ddac..e2361cc 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -609,7 +609,7 @@ struct ext4_inode {
> __le16 l_i_file_acl_high;
> __le16 l_i_uid_high; /* these 2 fields */
> __le16 l_i_gid_high; /* were reserved2[0] */
> - __u32 l_i_reserved2;
> + __le32 l_i_checksum; /* crc32c(uuid+inum+inode) */
> } linux2;
> struct {
> __le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */
> @@ -727,7 +727,7 @@ do { \
> #define i_gid_low i_gid
> #define i_uid_high osd2.linux2.l_i_uid_high
> #define i_gid_high osd2.linux2.l_i_gid_high
> -#define i_reserved2 osd2.linux2.l_i_reserved2
> +#define i_checksum osd2.linux2.l_i_checksum
>
> #elif defined(__GNU__)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c4da98a..44a7f88 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -38,6 +38,7 @@
> #include <linux/printk.h>
> #include <linux/slab.h>
> #include <linux/ratelimit.h>
> +#include <linux/crc32c.h>
>
> #include "ext4_jbd2.h"
> #include "xattr.h"
> @@ -49,6 +50,53 @@
>
> #define MPAGE_DA_EXTENT_TAIL 0x01
>
> +static __le32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> + struct ext4_inode_info *ei = EXT4_I(inode);
> + int offset = offsetof(struct ext4_inode, i_checksum);
This could be declared "const int" so that it is not consuming space on
the stack, or just put it inline in the code instead of a stack variable
since it is a compile time constant.
> + __le32 inum = cpu_to_le32(inode->i_ino);
> + __u32 crc = 0;
> +
> + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
> + cpu_to_le32(EXT4_OS_LINUX))
This can be marked unlikely() I think.
> + return 0;
> + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> + return 0;
> +
> + crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
> + crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
I wonder if it makes sense to pre-compute the crc32c of s_uuid (stored
in sbi) and/or s_uuid+inum (stored in struct ext4_inode_info). I suspect
precomputing the s_uuid checksum is worthwhile, but I'm not sure whether
precomputing the inode checksum is worthwhile unless it doesn't reduce
the number of ext4_inode_info structs per page in the slab.
> + crc = crc32c_le(crc, (__u8 *)raw, offset);
> + offset += sizeof(raw->i_checksum); /* skip checksum */
> + crc = crc32c_le(crc, (__u8 *)raw + offset,
> + EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize -
> + offset);
I suspect it would be more efficient to set raw->i_checksum = 0, then
compute the checksum on the whole raw inode buffer, and fill in
raw->i_checksum = cpu_to_le32(crc) at the end. That would mean the
caller ext4_inode_csum_verify() should save the original checksum for
comparison with the returned value.
The one problem with this is that it is racy w.r.t other users
> + return cpu_to_le32(crc);
> +}
> +
> +static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw)
> +{
> + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os ==
> + cpu_to_le32(EXT4_OS_LINUX) &&
> + EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
> + (raw->i_checksum != ext4_inode_csum(inode, raw)))
This check can be marked unlikely(), since the rare case of a checksum
failure can cause a stall in the execution pipeline. It might make sense
to put the unlikely() at the lone callsite to move the whole function call
overhead out-of-line.
> + return 0;
> + return 1;
> +}
> +
> +static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw)
> +{
> + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
> + cpu_to_le32(EXT4_OS_LINUX) ||
> + !EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> + return;
> +
> + raw->i_checksum = ext4_inode_csum(inode, raw);
> +}
> +
> static inline int ext4_begin_ordered_truncate(struct inode *inode,
> loff_t new_size)
> {
> @@ -3410,6 +3458,15 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> if (ret < 0)
> goto bad_inode;
> raw_inode = ext4_raw_inode(&iloc);
> +
> + if (!ext4_inode_csum_verify(inode, raw_inode)) {
> + EXT4_ERROR_INODE(inode, "checksum invalid (0x%x != 0x%x)",
> + le32_to_cpu(ext4_inode_csum(inode, raw_inode)),
> + le32_to_cpu(raw_inode->i_checksum));
> + ret = -EIO;
> + goto bad_inode;
> + }
> +
> inode->i_mode = le16_to_cpu(raw_inode->i_mode);
> inode->i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
> inode->i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
> @@ -3490,6 +3547,9 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
> if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
> EXT4_INODE_SIZE(inode->i_sb)) {
> + EXT4_ERROR_INODE(inode, "bad extra_isize (%u != %u)",
> + EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize,
> + EXT4_INODE_SIZE(inode->i_sb));
> ret = -EIO;
> goto bad_inode;
> }
> @@ -3731,6 +3791,8 @@ static int ext4_do_update_inode(handle_t *handle,
> raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
> }
>
> + ext4_inode_csum_set(inode, raw_inode);
This might warrant a comment to always be the last function before
submitting the inode to the journal.
> BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> rc = ext4_handle_dirty_metadata(handle, NULL, bh);
> if (!err)
Also, rather than just making the checksum be updated at commit time, it
makes more sense to have ext4_do_update_inode() only be called once per
commit, since this is an expensive function.
Cheers, Andreas
On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> Compute and verify the checksum of the inode bitmap; the checkum is stored in
> the block group descriptor.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/ext4/ext4.h | 3 ++-
> fs/ext4/ialloc.c | 33 ++++++++++++++++++++++++++++++---
> 2 files changed, 32 insertions(+), 4 deletions(-)
>
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index bc7ace1..248cbd2 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -279,7 +279,8 @@ struct ext4_group_desc
> __le16 bg_free_inodes_count_hi;/* Free inodes count MSB */
> __le16 bg_used_dirs_count_hi; /* Directories count MSB */
> __le16 bg_itable_unused_hi; /* Unused inodes count MSB */
> - __u32 bg_reserved2[3];
> + __le32 bg_inode_bitmap_csum; /* crc32c(uuid+group+ibitmap) */
> + __u32 bg_reserved2[2];
> };
I would prefer if there was a 16-bit checksum for the (most common)
32-byte group descriptors, and this was extended to a 32-bit checksum
for the (much less common) 64-byte+ group descriptors. For filesystems
that are newly formatted with the 64bit feature it makes no difference,
but virtually all ext3/4 filesystems have only the smaller group descriptors.
Regardless of whether using half of the crc32c is better or worse than
using crc16 for the bitmap blocks, storing _any_ checksum is better than
storing nothing at all. I would propose the following:
struct ext4_group_desc
{
__le32 bg_block_bitmap_lo; /* Blocks bitmap block */
__le32 bg_inode_bitmap_lo; /* Inodes bitmap block */
__le32 bg_inode_table_lo; /* Inodes table block */
__le16 bg_free_blocks_count_lo; /* Free blocks count */
__le16 bg_free_inodes_count_lo; /* Free inodes count */
__le16 bg_used_dirs_count_lo; /* Directories count */
__le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */
__le32 bg_exclude_bitmap_lo; /* Exclude bitmap block */
__le16 bg_block_bitmap_csum_lo; /* Block bitmap checksum */
__le16 bg_inode_bitmap_csum_lo; /* Inode bitmap checksum */
__le16 bg_itable_unused_lo; /* Unused inodes count */
__le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
__le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
__le32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */
__le32 bg_inode_table_hi; /* Inodes table block MSB */
__le16 bg_free_blocks_count_hi; /* Free blocks count MSB */
__le16 bg_free_inodes_count_hi; /* Free inodes count MSB */
__le16 bg_used_dirs_count_hi; /* Directories count MSB */
__le16 bg_itable_unused_hi; /* Unused inodes count MSB */
__le32 bg_exclude_bitmap_hi; /* Exclude bitmap block MSB */
__le16 bg_block_bitmap_csum_hi; /* Blocks bitmap checksum MSB */
__le16 bg_inode_bitmap_csum_hi; /* Inodes bitmap checksum MSB */
__le32 bg_reserved2;
};
This is also different from your layout because it locates the block bitmap
checksum field before the inode bitmap checksum, to more closely match the
order of other fields in this structure.
> /*
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index 9c63f27..53faffc 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -82,12 +82,18 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb,
> ext4_free_inodes_set(sb, gdp, 0);
> ext4_itable_unused_set(sb, gdp, 0);
> memset(bh->b_data, 0xff, sb->s_blocksize);
> + ext4_bitmap_csum_set(sb, block_group,
> + &gdp->bg_inode_bitmap_csum, bh,
> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
The number of inodes per group is already always a multiple of 8.
> return 0;
> }
>
> memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> ext4_mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), sb->s_blocksize * 8,
> bh->b_data);
> + ext4_bitmap_csum_set(sb, block_group, &gdp->bg_inode_bitmap_csum, bh,
> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> + gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
>
> return EXT4_INODES_PER_GROUP(sb);
> }
> @@ -118,12 +124,12 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> return NULL;
> }
> if (bitmap_uptodate(bh))
> - return bh;
> + goto verify;
>
> lock_buffer(bh);
> if (bitmap_uptodate(bh)) {
> unlock_buffer(bh);
> - return bh;
> + goto verify;
> }
>
> ext4_lock_group(sb, block_group);
> @@ -131,6 +137,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> ext4_init_inode_bitmap(sb, bh, block_group, desc);
> set_bitmap_uptodate(bh);
> set_buffer_uptodate(bh);
> + set_buffer_verified(bh);
> ext4_unlock_group(sb, block_group);
> unlock_buffer(bh);
> return bh;
> @@ -144,7 +151,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> */
> set_bitmap_uptodate(bh);
> unlock_buffer(bh);
> - return bh;
> + goto verify;
> }
> /*
> * submit the buffer_head for read. We can
> @@ -161,6 +168,21 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> block_group, bitmap_blk);
> return NULL;
> }
> +
> +verify:
> + ext4_lock_group(sb, block_group);
> + if (!buffer_verified(bh) &&
> + !ext4_bitmap_csum_verify(sb, block_group,
> + desc->bg_inode_bitmap_csum, bh,
> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8)) {
> + ext4_unlock_group(sb, block_group);
> + put_bh(bh);
> + ext4_error(sb, "Corrupt inode bitmap - block_group = %u, "
> + "inode_bitmap = %llu", block_group, bitmap_blk);
> + return NULL;
At some point we should add a flag like EXT4_BG_INODE_ERROR so that the
group can be marked in error on disk, and skipped for future allocations,
but the whole filesystem does not need to be remounted read-only. That's
for another patch, however.
> + }
> + ext4_unlock_group(sb, block_group);
> + set_buffer_verified(bh);
> return bh;
> }
>
> @@ -265,6 +287,8 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> ext4_used_dirs_set(sb, gdp, count);
> percpu_counter_dec(&sbi->s_dirs_counter);
> }
> + ext4_bitmap_csum_set(sb, block_group, &gdp->bg_inode_bitmap_csum,
> + bitmap_bh, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
> ext4_unlock_group(sb, block_group);
>
> @@ -784,6 +808,9 @@ static int ext4_claim_inode(struct super_block *sb,
> atomic_inc(&sbi->s_flex_groups[f].used_dirs);
> }
> }
> + ext4_bitmap_csum_set(sb, group, &gdp->bg_inode_bitmap_csum,
> + inode_bitmap_bh,
> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
> err_ret:
> ext4_unlock_group(sb, group);
>
On Wed, Aug 31, 2011 at 8:30 PM, Darrick J. Wong <[email protected]> wrote:
> Hi all,
>
> This patchset adds crc32c checksums to most of the ext4 metadata objects. ?A
> full design document is on the ext4 wiki[1] but I will summarize that document here.
>
> As much as we wish our storage hardware was totally reliable, it is still
> quite possible for data to be corrupted on disk, corrupted during transfer over
> a wire, or written to the wrong places. ?To protect against this sort of
> non-hostile corruption, it is desirable to store checksums of metadata objects
> on the filesystem to prevent broken metadata from shredding the filesystem.
>
> The crc32c polynomial was chosen for its improved error detection capabilities
> over crc32 and crc16, and because of its hardware acceleration on current and
> upcoming Intel and Sparc chips.
>
> Each type of metadata object has been retrofitted to store a checksum as follows:
>
> - The superblock stores a crc32c of itself.
> - Each inode stores crc32c(fs_uuid + inode_num + inode + slack_space_after_inode)
> - Block and inode bitmaps each get their own crc32c(fs_uuid + group_num +
> ?bitmap), stored in the block group descriptor.
> - Each extent tree block stores a crc32c(fs_uuid + inode_num + extent_entries)
> ?in unused space at the end of the block.
> - Each directory leaf block has an unused-looking directory entry big enough to
> ?store a crc32c(fs_uuid + inode_num + block) at the end of the block.
> - Each directory htree block is shortened to contain a crc32c(fs_uuid +
> ?inode_num + block) at the end of the block.
> - Extended attribute blocks store crc32c(fs_uuid + block_no + ea_block) in the
> ?header.
> - Journal commit blocks can be converted to use crc32c to checksum all blocks
> ?in the transaction, if journal_checksum is given.
>
> The first four patches in the kernel patchset fix existing bugs in ext4 that
> cause incorrect checkums to be written. ?I think Ted already took them, but
> with recent instability I'm resending them to be cautious. ?The subsequent 12
> patches add the necessary code to support checksumming in ext4 and jbd2.
>
> I also have a set of three patches that provide a faster crc32c implementation
> based on Bob Pearson's earlier crc32 patchset. ?This will be sent under
> separate cover to the crypto list and to lkml/linux-ext4.
>
> The patchset for e2fsprogs will be sent under separate cover only to linux-ext4
> as it is quite lengthy (~36 patches).
>
> As far as performance impact goes, I see nearly no change with a standard mail
> server ffsb simulation. ?On a test that involves only file creation and
> deletion and extent tree modifications, I see a drop of about 50 percent with
> the current kernel crc32c implementation; this improves to a drop of about 20
> percent with the enclosed crc32c implementation. ?However, given that metadata
> is usually a small fraction of total IO, it doesn't seem like the cost of
> enabling this feature is unreasonable.
>
> There are of course unresolved issues:
>
> - What to do when the block group descriptor isn't big enough to hold 2 crc32s
> ?(which is the case with 32-bit ext4 filesystems, sadly). ?I'm not quite
> ?convinced that truncating a 32-bit checksum to 16-bits is a safe idea. ?Right
> ?now, one has to enable the 64bit feature to enable any bitmap checksums.
> ?I'm not sure how effective crc16 is at checksumming 32768-bit bitmaps.
>
> - Using the journal commit hooks to delay crc32c calculation until dirty
> ?buffers are actually being written to disk.
>
> - Can we get away with using a (hw accelerated) LE crc32c for jbd2, which
> ?stores its data in BE order?
>
> - Interaction with online resize code. ?Yongqiang seems to be in the process of
> ?rewriting this, so I haven't looked at it very closely yet.
>
> - If block group descriptors can now exceed 32 bytes (when 64bit filesystem
> ?support is enabled), should we use crc32c instead of crc16? ?From what I've
> ?read of the literature, crc16 is not very effective on datasets exceeding 256
> ?bytes.
>
> Please have a look at the design document and patches, and please feel free to
> suggest any changes. ?I will be at LPC next week if anyone wishes to discuss,
> debate, or protest.
>
> --D
>
> [1] https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums
Derrick,
Brainstorming only:
Another thing you might consider is to somehow tie into the data
integrity patches that went into the kernel a couple years ago. Those
are tied to specialized storage devices (typically scsi) that can
actually have the checksum live on the disk, but not in the normal
data area. ie. in the sector header / footer or some other out of
band area.
At a minimum, it may make sense to use the same CRC which that API
does. Then you could calculate the CRC once and put it both in-band
in the inode and out-of-band in the dedicated integrity area of
supporting storage devices.
That if the data is corrupted on the wire as an example, the
controller itself may be able to verify its a bad crc and ask for a
re-read without even involving the kernel.
I believe supporting hardware is rare, but if the kernel is going to
have a data integrity API to support it at all, then code like this is
exactly the kind of code that should layer on top of it.
Greg
On Fri, Sep 02, 2011 at 10:15:27AM -0400, Greg Freemyer wrote:
> On Wed, Aug 31, 2011 at 8:30 PM, Darrick J. Wong <[email protected]> wrote:
> > Hi all,
> >
> > This patchset adds crc32c checksums to most of the ext4 metadata objects. ?A
> > full design document is on the ext4 wiki[1] but I will summarize that document here.
> >
> > As much as we wish our storage hardware was totally reliable, it is still
> > quite possible for data to be corrupted on disk, corrupted during transfer over
> > a wire, or written to the wrong places. ?To protect against this sort of
> > non-hostile corruption, it is desirable to store checksums of metadata objects
> > on the filesystem to prevent broken metadata from shredding the filesystem.
> >
> > The crc32c polynomial was chosen for its improved error detection capabilities
> > over crc32 and crc16, and because of its hardware acceleration on current and
> > upcoming Intel and Sparc chips.
> >
> > Each type of metadata object has been retrofitted to store a checksum as follows:
> >
> > - The superblock stores a crc32c of itself.
> > - Each inode stores crc32c(fs_uuid + inode_num + inode + slack_space_after_inode)
> > - Block and inode bitmaps each get their own crc32c(fs_uuid + group_num +
> > ?bitmap), stored in the block group descriptor.
> > - Each extent tree block stores a crc32c(fs_uuid + inode_num + extent_entries)
> > ?in unused space at the end of the block.
> > - Each directory leaf block has an unused-looking directory entry big enough to
> > ?store a crc32c(fs_uuid + inode_num + block) at the end of the block.
> > - Each directory htree block is shortened to contain a crc32c(fs_uuid +
> > ?inode_num + block) at the end of the block.
> > - Extended attribute blocks store crc32c(fs_uuid + block_no + ea_block) in the
> > ?header.
> > - Journal commit blocks can be converted to use crc32c to checksum all blocks
> > ?in the transaction, if journal_checksum is given.
> >
> > The first four patches in the kernel patchset fix existing bugs in ext4 that
> > cause incorrect checkums to be written. ?I think Ted already took them, but
> > with recent instability I'm resending them to be cautious. ?The subsequent 12
> > patches add the necessary code to support checksumming in ext4 and jbd2.
> >
> > I also have a set of three patches that provide a faster crc32c implementation
> > based on Bob Pearson's earlier crc32 patchset. ?This will be sent under
> > separate cover to the crypto list and to lkml/linux-ext4.
> >
> > The patchset for e2fsprogs will be sent under separate cover only to linux-ext4
> > as it is quite lengthy (~36 patches).
> >
> > As far as performance impact goes, I see nearly no change with a standard mail
> > server ffsb simulation. ?On a test that involves only file creation and
> > deletion and extent tree modifications, I see a drop of about 50 percent with
> > the current kernel crc32c implementation; this improves to a drop of about 20
> > percent with the enclosed crc32c implementation. ?However, given that metadata
> > is usually a small fraction of total IO, it doesn't seem like the cost of
> > enabling this feature is unreasonable.
> >
> > There are of course unresolved issues:
> >
> > - What to do when the block group descriptor isn't big enough to hold 2 crc32s
> > ?(which is the case with 32-bit ext4 filesystems, sadly). ?I'm not quite
> > ?convinced that truncating a 32-bit checksum to 16-bits is a safe idea. ?Right
> > ?now, one has to enable the 64bit feature to enable any bitmap checksums.
> > ?I'm not sure how effective crc16 is at checksumming 32768-bit bitmaps.
> >
> > - Using the journal commit hooks to delay crc32c calculation until dirty
> > ?buffers are actually being written to disk.
> >
> > - Can we get away with using a (hw accelerated) LE crc32c for jbd2, which
> > ?stores its data in BE order?
> >
> > - Interaction with online resize code. ?Yongqiang seems to be in the process of
> > ?rewriting this, so I haven't looked at it very closely yet.
> >
> > - If block group descriptors can now exceed 32 bytes (when 64bit filesystem
> > ?support is enabled), should we use crc32c instead of crc16? ?From what I've
> > ?read of the literature, crc16 is not very effective on datasets exceeding 256
> > ?bytes.
> >
> > Please have a look at the design document and patches, and please feel free to
> > suggest any changes. ?I will be at LPC next week if anyone wishes to discuss,
> > debate, or protest.
> >
> > --D
> >
> > [1] https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums
>
> Derrick,
>
> Brainstorming only:
>
> Another thing you might consider is to somehow tie into the data
> integrity patches that went into the kernel a couple years ago. Those
> are tied to specialized storage devices (typically scsi) that can
> actually have the checksum live on the disk, but not in the normal
> data area. ie. in the sector header / footer or some other out of
> band area.
>
> At a minimum, it may make sense to use the same CRC which that API
> does. Then you could calculate the CRC once and put it both in-band
> in the inode and out-of-band in the dedicated integrity area of
> supporting storage devices.
If you have the necessary DIF/DIX hardware and kernel support then every block
in the FS is already checksummed and you don't need metadata_csum at all. This
patchset was really intended for setups where we don't have DIF/DIX.
Furthermore, the nice thing about the in-filesystem checksum is that we bake in
other things like the FS UUID and the inode number, which gives you a somewhat
better assurance that the data block belongs to the fs and the file that the
code think it belongs to. The DIX interface allows for a 32-bit block number
and a 16-bit application tag ... which is unfortunately small given 64-bit
block numbers and 32-bit inode numbers.
I guess there's also an argument that from a layering perspective it's
desirable to have a FS image whose integrity features remain intact even if you
copy the image to a different device that doesn't support the hardware feature.
As a side note, the crc-t10dif implementation is quite slow -- the hardware
accelerated crc32c is 15x faster, and the sw implementation is usually 3-6x
faster. I suspect somebody will want to fix that before DIF becomes more
widespread...
> That if the data is corrupted on the wire as an example, the
> controller itself may be able to verify its a bad crc and ask for a
> re-read without even involving the kernel.
>
> I believe supporting hardware is rare, but if the kernel is going to
> have a data integrity API to support it at all, then code like this is
> exactly the kind of code that should layer on top of it.
The good news is that if you're really worried about integrity, metadata_csum
and DIF/DIX aren't mutually exclusive features. Rejecting corrupted write
commands at write time seems like a useful feature. :)
--D
On Thu, Sep 01, 2011 at 02:11:26AM -0600, Andreas Dilger wrote:
> On 2011-08-31, at 6:32 PM, Darrick J. Wong wrote:
> > The CRC32c polynomial provides better error detection and can be hardware
> > accelerated on certain machines. To that end, add support for it to jbd2.
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/jbd2/Kconfig | 1 +
> > fs/jbd2/commit.c | 6 +++---
> > fs/jbd2/recovery.c | 20 +++++++++++++++++---
> > include/linux/jbd2.h | 1 +
> > 4 files changed, 22 insertions(+), 6 deletions(-)
> >
> >
> > diff --git a/fs/jbd2/Kconfig b/fs/jbd2/Kconfig
> > index f32f346..40a126b 100644
> > --- a/fs/jbd2/Kconfig
> > +++ b/fs/jbd2/Kconfig
> > @@ -1,6 +1,7 @@
> > config JBD2
> > tristate
> > select CRC32
> > + select LIBCRC32C
> > help
> > This is a generic journaling layer for block devices that support
> > both 32-bit and 64-bit block numbers. It is currently used by
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index eef6979..00387be 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -21,7 +21,7 @@
> > #include <linux/mm.h>
> > #include <linux/pagemap.h>
> > #include <linux/jiffies.h>
> > -#include <linux/crc32.h>
> > +#include <linux/crc32c.h>
> > #include <linux/writeback.h>
> > #include <linux/backing-dev.h>
> > #include <linux/bio.h>
> > @@ -125,7 +125,7 @@ static int journal_submit_commit_record(journal_t *journal,
> >
> > if (JBD2_HAS_COMPAT_FEATURE(journal,
> > JBD2_FEATURE_COMPAT_CHECKSUM)) {
> > - tmp->h_chksum_type = JBD2_CRC32_CHKSUM;
> > + tmp->h_chksum_type = JBD2_CRC32C_CHKSUM;
> > tmp->h_chksum_size = JBD2_CRC32_CHKSUM_SIZE;
> > tmp->h_chksum[0] = cpu_to_be32(crc32_sum);
> > }
> > @@ -287,7 +287,7 @@ static __u32 jbd2_checksum_data(__u32 crc32_sum, struct buffer_head *bh)
> > __u32 checksum;
> >
> > addr = kmap_atomic(page, KM_USER0);
> > - checksum = crc32_be(crc32_sum,
> > + checksum = crc32c_le(crc32_sum,
> > (void *)(addr + offset_in_page(bh->b_data)), bh->b_size);
> > kunmap_atomic(addr, KM_USER0);
> >
> > diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
> > index 1cad869..4bab4dd 100644
> > --- a/fs/jbd2/recovery.c
> > +++ b/fs/jbd2/recovery.c
> > @@ -21,6 +21,7 @@
> > #include <linux/jbd2.h>
> > #include <linux/errno.h>
> > #include <linux/crc32.h>
> > +#include <linux/crc32c.h>
> > #endif
> >
> > /*
> > @@ -323,7 +324,8 @@ static inline unsigned long long read_tag_block(int tag_bytes, journal_block_tag
> > * descriptor block.
> > */
> > static int calc_chksums(journal_t *journal, struct buffer_head *bh,
> > - unsigned long *next_log_block, __u32 *crc32_sum)
> > + unsigned long *next_log_block, __u32 *crc32_sum,
> > + __u32 *crc32c_sum)
> > {
> > int i, num_blks, err;
> > unsigned long io_block;
> > @@ -332,6 +334,7 @@ static int calc_chksums(journal_t *journal, struct buffer_head *bh,
> > num_blks = count_tags(journal, bh);
> > /* Calculate checksum of the descriptor block. */
> > *crc32_sum = crc32_be(*crc32_sum, (void *)bh->b_data, bh->b_size);
> > + *crc32c_sum = crc32c_le(*crc32c_sum, (void *)bh->b_data, bh->b_size);
>
> We definitely do not want to compute both the crc32 and crc32c for every
> block written to the journal. That would be needlessly expensive.
>
> It looks like the missing factor is not knowing the checksum type before
> the commit block is accessed. This can be fixed by storing the checksum
> type in the journal superblock (JBD2_CKSUM_TYPE_CRC32 = 0 for compatibility).
>
> During recovery, the journal superblock s_chksum_type should remain as
> set by the previous kernel (so the existing commit block checksums can
> be verified) and as soon as the journal recovery has completed and a new
> block written to the journal the checksum type can be updated to the latest
> default value.
Okay, that does sound better than my current weird approach. :) I admit to
being a bit timid about messing with jbd2 after proposing a lot of changes to
ext4.
> > for (i = 0; i < num_blks; i++) {
> > io_block = (*next_log_block)++;
> > @@ -344,6 +347,9 @@ static int calc_chksums(journal_t *journal, struct buffer_head *bh,
> > } else {
> > *crc32_sum = crc32_be(*crc32_sum, (void *)obh->b_data,
> > obh->b_size);
> > + *crc32c_sum = crc32c_le(*crc32c_sum,
> > + (void *)obh->b_data,
> > + obh->b_size);
>
> (style) the indenting here is funky. Continued lines should follow the
> '(' on the previous line and not be gratuitously wrapped.
Yeah, I wondered about that.
--D
> > @@ -617,7 +625,12 @@ static int do_one_pass(journal_t *journal,
> > cbh->h_chksum_type == JBD2_CRC32_CHKSUM &&
> > cbh->h_chksum_size ==
> > JBD2_CRC32_CHKSUM_SIZE)
> > - chksum_seen = 1;
> > + chksum_seen = 1;
> > + else if (crc32c_sum == found_chksum &&
> > + cbh->h_chksum_type == JBD2_CRC32C_CHKSUM &&
> > + cbh->h_chksum_size ==
> > + JBD2_CRC32_CHKSUM_SIZE)
> > + chksum_seen = 1;
> > else if (!(cbh->h_chksum_type == 0 &&
> > cbh->h_chksum_size == 0 &&
> > found_chksum == 0 &&
> > @@ -646,6 +659,7 @@ static int do_one_pass(journal_t *journal,
> > }
> > }
> > crc32_sum = ~0;
> > + crc32c_sum = ~0;
> > }
> > brelse(bh);
> > next_commit_ID++;
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index 38f307b..de3ec23 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -147,6 +147,7 @@ typedef struct journal_header_s
> > #define JBD2_CRC32_CHKSUM 1
> > #define JBD2_MD5_CHKSUM 2
> > #define JBD2_SHA1_CHKSUM 3
> > +#define JBD2_CRC32C_CHKSUM 4
> >
> > #define JBD2_CRC32_CHKSUM_SIZE 4
> >
> >
>
On Thu, Sep 01, 2011 at 01:52:43AM -0600, Andreas Dilger wrote:
> On 2011-08-31, at 6:32 PM, Darrick J. Wong wrote:
> > Calculate and verify the superblock checksum. Since the UUID and block group
> > number are embedded in each copy of the superblock, we need only checksum the
> > entire block. Refactor some of the code to eliminate open-coding of the
> > checksum update call.
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/ext4/ext4.h | 8 +++++++-
> > fs/ext4/ext4_jbd2.c | 9 ++++++++-
> > fs/ext4/ext4_jbd2.h | 7 +++++--
> > fs/ext4/inode.c | 3 +--
> > fs/ext4/namei.c | 4 ++--
> > fs/ext4/resize.c | 6 +++++-
> > fs/ext4/super.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> > 7 files changed, 71 insertions(+), 9 deletions(-)
> >
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index b7aa5b5..1e93410 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1067,7 +1067,9 @@ struct ext4_super_block {
> > __u8 s_last_error_func[32]; /* function where the error happened */
> > #define EXT4_S_ERR_END offsetof(struct ext4_super_block, s_mount_opts)
> > __u8 s_mount_opts[64];
> > - __le32 s_reserved[112]; /* Padding to the end of the block */
> > + __u32 s_reserved1[3]; /* Padding */
>
> Rather than mark these as reserved, it would be better to fill in the
> intended field names so that there is no confusion later on. I believe
> that there is s_usr_quota_inum and s_grp_quota_inum immediately following
> s_mount_opts, but I'm not sure what the 3rd reserved field is for.
"s_overhead_blocks", which I think is somehow related to bigalloc. They
haven't appeared in the kernel yet, which made me wary of adding fields that
aren't used by this patchset.
> > + __u32 s_checksum; /* crc32c(superblock) */
> > + __le32 s_reserved2[108]; /* Padding to the end of the block */
> > };
> >
> > #define EXT4_S_ERR_LEN (EXT4_S_ERR_END - EXT4_S_ERR_START)
> > @@ -1901,6 +1903,10 @@ extern int ext4_group_extend(struct super_block *sb,
> > ext4_fsblk_t n_blocks_count);
> >
> > /* super.c */
> > +extern int ext4_superblock_csum_verify(struct super_block *sb,
> > + struct ext4_super_block *es);
> > +extern void ext4_superblock_csum_set(struct super_block *sb,
> > + struct ext4_super_block *es);
> > extern void *ext4_kvmalloc(size_t size, gfp_t flags);
> > extern void *ext4_kvzalloc(size_t size, gfp_t flags);
> > extern void ext4_kvfree(void *ptr);
> > diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> > index f5240aa..04ddc97 100644
> > --- a/fs/ext4/ext4_jbd2.c
> > +++ b/fs/ext4/ext4_jbd2.c
> > @@ -136,16 +136,23 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
> > }
> >
> > int __ext4_handle_dirty_super(const char *where, unsigned int line,
> > - handle_t *handle, struct super_block *sb)
> > + handle_t *handle, struct super_block *sb,
> > + int now)
> > {
> > struct buffer_head *bh = EXT4_SB(sb)->s_sbh;
> > int err = 0;
> >
> > if (ext4_handle_valid(handle)) {
> > + ext4_superblock_csum_set(sb,
> > + (struct ext4_super_block *)bh->b_data);
> > err = jbd2_journal_dirty_metadata(handle, bh);
> > if (err)
> > ext4_journal_abort_handle(where, line, __func__,
> > bh, handle, err);
> > + } else if (now) {
> > + ext4_superblock_csum_set(sb,
> > + (struct ext4_super_block *)bh->b_data);
> > + mark_buffer_dirty(bh);
> > } else
> > sb->s_dirt = 1;
> > return err;
> > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > index 5802fa1..ed9b78d 100644
> > --- a/fs/ext4/ext4_jbd2.h
> > +++ b/fs/ext4/ext4_jbd2.h
> > @@ -141,7 +141,8 @@ int __ext4_handle_dirty_metadata(const char *where, unsigned int line,
> > struct buffer_head *bh);
> >
> > int __ext4_handle_dirty_super(const char *where, unsigned int line,
> > - handle_t *handle, struct super_block *sb);
> > + handle_t *handle, struct super_block *sb,
> > + int now);
> >
> > #define ext4_journal_get_write_access(handle, bh) \
> > __ext4_journal_get_write_access(__func__, __LINE__, (handle), (bh))
> > @@ -153,8 +154,10 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
> > #define ext4_handle_dirty_metadata(handle, inode, bh) \
> > __ext4_handle_dirty_metadata(__func__, __LINE__, (handle), (inode), \
> > (bh))
> > +#define ext4_handle_dirty_super_now(handle, sb) \
> > + __ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb), 1)
> > #define ext4_handle_dirty_super(handle, sb) \
> > - __ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb))
> > + __ext4_handle_dirty_super(__func__, __LINE__, (handle), (sb), 0)
> >
> > handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks);
> > int __ext4_journal_stop(const char *where, unsigned int line, handle_t *handle);
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index e24ba98..52e9b67 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3761,8 +3761,7 @@ static int ext4_do_update_inode(handle_t *handle,
> > EXT4_FEATURE_RO_COMPAT_LARGE_FILE);
> > sb->s_dirt = 1;
> > ext4_handle_sync(handle);
> > - err = ext4_handle_dirty_metadata(handle, NULL,
> > - EXT4_SB(sb)->s_sbh);
> > + err = ext4_handle_dirty_super_now(handle, sb);
> > }
> > }
> > raw_inode->i_generation = cpu_to_le32(inode->i_generation);
> > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> > index 2d0fdb9..5ebf281 100644
> > --- a/fs/ext4/namei.c
> > +++ b/fs/ext4/namei.c
> > @@ -2407,7 +2407,7 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode)
> > /* Insert this inode at the head of the on-disk orphan list... */
> > NEXT_ORPHAN(inode) = le32_to_cpu(EXT4_SB(sb)->s_es->s_last_orphan);
> > EXT4_SB(sb)->s_es->s_last_orphan = cpu_to_le32(inode->i_ino);
> > - err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
> > + err = ext4_handle_dirty_super_now(handle, sb);
> > rc = ext4_mark_iloc_dirty(handle, inode, &iloc);
> > if (!err)
> > err = rc;
> > @@ -2480,7 +2480,7 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode)
> > if (err)
> > goto out_brelse;
> > sbi->s_es->s_last_orphan = cpu_to_le32(ino_next);
> > - err = ext4_handle_dirty_metadata(handle, NULL, sbi->s_sbh);
> > + err = ext4_handle_dirty_super_now(handle, inode->i_sb);
> > } else {
> > struct ext4_iloc iloc2;
> > struct inode *i_prev =
> > diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> > index 707d3f1..2ad7008 100644
> > --- a/fs/ext4/resize.c
> > +++ b/fs/ext4/resize.c
> > @@ -511,7 +511,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,
> > ext4_kvfree(o_group_desc);
> >
> > le16_add_cpu(&es->s_reserved_gdt_blocks, -1);
> > - err = ext4_handle_dirty_metadata(handle, NULL, EXT4_SB(sb)->s_sbh);
> > + err = ext4_handle_dirty_super_now(handle, sb);
> > if (err)
> > ext4_std_error(sb, err);
> >
> > @@ -682,6 +682,8 @@ static void update_backups(struct super_block *sb,
> > goto exit_err;
> > }
> >
> > + ext4_superblock_csum_set(sb, (struct ext4_super_block *)data);
> > +
> > while ((group = ext4_list_backups(sb, &three, &five, &seven)) < last) {
> > struct buffer_head *bh;
> >
> > @@ -925,6 +927,8 @@ int ext4_group_add(struct super_block *sb, struct ext4_new_group_data *input)
> > /* Update the global fs size fields */
> > sbi->s_groups_count++;
> >
> > + ext4_superblock_csum_set(sb,
> > + (struct ext4_super_block *)primary->b_data);
> > err = ext4_handle_dirty_metadata(handle, NULL, primary);
> > if (unlikely(err)) {
> > ext4_std_error(sb, err);
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 44d0c8d..b254274 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -38,6 +38,7 @@
> > #include <linux/ctype.h>
> > #include <linux/log2.h>
> > #include <linux/crc16.h>
> > +#include <linux/crc32c.h>
> > #include <linux/cleancache.h>
> > #include <asm/uaccess.h>
> >
> > @@ -110,6 +111,41 @@ static struct file_system_type ext3_fs_type = {
> > #define IS_EXT3_SB(sb) (0)
> > #endif
> >
> > +static __le32 ext4_superblock_csum(struct super_block *sb,
> > + struct ext4_super_block *es)
> > +{
> > + int offset = offsetof(struct ext4_super_block, s_checksum);
> > + __u32 crc = 0;
> > +
> > + if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return 0;
> > +
> > + crc = crc32c_le(~0, (char *)es, offset);
>
> For consistency, shouldn't this also checksum the fields _after_ offset?
> Otherwise, if new fields are used after s_checksum the update to this
> function may easily be missed, and those new fields would not be covered
> by the checksum. It also means the superblock checksum would be different
> based on which kernel is being used (based on which fields it knows about.
Actually, I had pondered putting the checksum at the very end of the 1k
superblock, which makes the checksum cover the padding too. On the other hand,
I thought I might save us ~400b of crc32c computation time. I don't have a big
objection to crc'ing all the padding.
--D
>
> > +
> > + return cpu_to_le32(crc);
> > +}
> > +
> > +int ext4_superblock_csum_verify(struct super_block *sb,
> > + struct ext4_super_block *es)
> > +{
> > + if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
> > + (es->s_checksum != ext4_superblock_csum(sb, es)))
> > + return 0;
> > + return 1;
> > +}
> > +
> > +void ext4_superblock_csum_set(struct super_block *sb,
> > + struct ext4_super_block *es)
> > +{
> > + if (!EXT4_HAS_RO_COMPAT_FEATURE(sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return;
> > +
> > + es->s_checksum = ext4_superblock_csum(sb, es);
> > +}
> > +
> > void *ext4_kvmalloc(size_t size, gfp_t flags)
> > {
> > void *ret;
> > @@ -3151,6 +3187,12 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> > sb->s_magic = le16_to_cpu(es->s_magic);
> > if (sb->s_magic != EXT4_SUPER_MAGIC)
> > goto cantfind_ext4;
> > + if (!ext4_superblock_csum_verify(sb, es)) {
> > + ext4_msg(sb, KERN_ERR, "VFS: Found ext4 filesystem with "
> > + "invalid superblock checksum. Run e2fsck?");
> > + silent = 1;
> > + goto cantfind_ext4;
> > + }
> > sbi->s_kbytes_written = le64_to_cpu(es->s_kbytes_written);
> >
> > /* Set defaults before we parse the mount options */
> > @@ -4107,6 +4149,7 @@ static int ext4_commit_super(struct super_block *sb, int sync)
> > &EXT4_SB(sb)->s_freeinodes_counter));
> > sb->s_dirt = 0;
> > BUFFER_TRACE(sbh, "marking dirty");
> > + ext4_superblock_csum_set(sb, es);
> > mark_buffer_dirty(sbh);
> > if (sync) {
> > error = sync_dirty_buffer(sbh);
> >
>
On Thu, Sep 01, 2011 at 01:43:53AM -0600, Andreas Dilger wrote:
> On 2011-08-31, at 6:32 PM, Darrick J. Wong wrote:
> > Extend the inode checksum to cover the empty space between the end of the
> > inode's data fields and the end of the space allocated for the inode. This
> > enables us to cover extended attribute data that might live in the empty space.
>
> I'm not sure that this should be a separate patch from the first inode
> checksum patch, but probably isn't harmful.
The only reason why it's separate is that this patch enables checksumming of
in-inode EA blocks. If someone wants me to reduce the patch count I can merge
them, but ... there's plenty of other things to work on.
--D
>
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/ext4/inode.c | 4 +---
> > 1 files changed, 1 insertions(+), 3 deletions(-)
> >
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 44a7f88..e24ba98 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -53,7 +53,6 @@
> > static __le32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw)
> > {
> > struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > - struct ext4_inode_info *ei = EXT4_I(inode);
> > int offset = offsetof(struct ext4_inode, i_checksum);
> > __le32 inum = cpu_to_le32(inode->i_ino);
> > __u32 crc = 0;
> > @@ -70,8 +69,7 @@ static __le32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw)
> > crc = crc32c_le(crc, (__u8 *)raw, offset);
> > offset += sizeof(raw->i_checksum); /* skip checksum */
> > crc = crc32c_le(crc, (__u8 *)raw + offset,
> > - EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize -
> > - offset);
> > + EXT4_INODE_SIZE(inode->i_sb) - offset);
> > return cpu_to_le32(crc);
> > }
> >
> >
>
On Thu, Sep 01, 2011 at 01:42:26AM -0600, Andreas Dilger wrote:
> On 2011-08-31, at 6:32 PM, Darrick J. Wong wrote:
> > Calculate and verify the checksums of extended attribute blocks. This only
> > applies to separate EA blocks that are pointed to by inode->i_file_acl (i.e.
> > external EA blocks); the checksum lives in the EA header.
> >
> > diff --git a/fs/ext4/xattr.h b/fs/ext4/xattr.h
> > index 25b7387..b2b20af 100644
> > --- a/fs/ext4/xattr.h
> > +++ b/fs/ext4/xattr.h
> > @@ -27,7 +27,8 @@ struct ext4_xattr_header {
> > __le32 h_refcount; /* reference count */
> > __le32 h_blocks; /* number of disk blocks used */
> > __le32 h_hash; /* hash value of all attributes */
> > - __u32 h_reserved[4]; /* zero right now */
> > + __le32 h_checksum; /* crc32c(uuid+inum+xattrblock) */
>
> This comment is incorrect - the inum cannot be part of the checksum if
> the block is shared. That said, I wouldn't object if the checksum DID
> include the inum if the block was not shared (h_refcount == 1). Since
> the h_refcount needs to be modified and the checksum recomputed if other
> inodes share this block, this doesn't impose any extra overhead.
Oops, I guess I was typing too fast. "inum" should be "ea_block_num". Though
I suspect that in the common case files don't share EA blocks, so we could do
inum when refcount==1.
--D
>
> Cheers, Andreas
>
On Thu, Sep 01, 2011 at 01:36:50AM -0600, Andreas Dilger wrote:
> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> > Calculate and verify the checksums for directory leaf blocks (i.e. blocks that
> > only contain actual directory entries). The checksum lives in what looks to be
> > an unused directory entry with a 0 name_len at the end of the block. This
> > scheme is not used for internal htree nodes because the mechanism in place
> > there only costs one dx_entry, whereas the "empty" directory entry would cost
> > two dx_entries.
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/ext4/dir.c | 12 +++
> > fs/ext4/ext4.h | 13 +++
> > fs/ext4/namei.c | 259 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
> > 3 files changed, 269 insertions(+), 15 deletions(-)
> >
> >
> > diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
> > index 164c560..bc40c9e 100644
> > --- a/fs/ext4/dir.c
> > +++ b/fs/ext4/dir.c
> > @@ -180,6 +180,18 @@ static int ext4_readdir(struct file *filp,
> > continue;
> > }
> >
> > + /* Check the checksum */
> > + if (!buffer_verified(bh) &&
> > + !ext4_dirent_csum_verify(inode,
> > + (struct ext4_dir_entry *)bh->b_data)) {
> > + EXT4_ERROR_FILE(filp, 0, "directory fails checksum "
> > + "at offset %llu",
> > + (unsigned long long)filp->f_pos);
> > + filp->f_pos += sb->s_blocksize - offset;
> > + continue;
> > + }
> > + set_buffer_verified(bh);
> > +
> > revalidate:
> > /* If the dir block has changed since the last call to
> > * readdir(2), then we might be pointing to an invalid
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index df149b3..b7aa5b5 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1471,6 +1471,17 @@ struct ext4_dir_entry_2 {
> > };
> >
> > /*
> > + * This is a bogus directory entry at the end of each leaf block that
> > + * records checksums.
> > + */
> > +struct ext4_dir_entry_tail {
> > + __le32 reserved_zero1; /* Pretend to be unused */
> > + __le16 rec_len; /* 12 */
> > + __le16 reserved_zero2; /* Zero name length */
> > + __le32 checksum; /* crc32c(uuid+inum+dirblock) */
> > +};
>
> Since this field is stored inline with existing directory entries, it
> may make sense to also add a magic value to this entry (preferably one
> with non-ASCII values) so that it can be distinguished from an empty
> dirent that happens to be at the end of the block.
I could set the file type to 0xDE since currently there's only 8 file types
defined.
> > +/*
> > * Ext4 directory file types. Only the low 3 bits are used. The
> > * other bits are reserved for now.
> > */
> > @@ -1875,6 +1886,8 @@ extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
> > extern int ext4_ext_migrate(struct inode *);
> >
> > /* namei.c */
> > +extern int ext4_dirent_csum_verify(struct inode *inode,
> > + struct ext4_dir_entry *dirent);
> > extern int ext4_orphan_add(handle_t *, struct inode *);
> > extern int ext4_orphan_del(handle_t *, struct inode *);
> > extern int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
> > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> > index 89797bf..2d0fdb9 100644
> > --- a/fs/ext4/namei.c
> > +++ b/fs/ext4/namei.c
> > @@ -191,6 +191,104 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
> > struct inode *inode);
> >
> > /* checksumming functions */
> > +static struct ext4_dir_entry_tail *get_dirent_tail(struct inode *inode,
> > + struct ext4_dir_entry *de)
> > +{
> > + struct ext4_dir_entry *d, *top;
> > + struct ext4_dir_entry_tail *t;
> > +
> > + d = de;
> > + top = (struct ext4_dir_entry *)(((void *)de) +
> > + (EXT4_BLOCK_SIZE(inode->i_sb) -
> > + sizeof(struct ext4_dir_entry_tail)));
> > + while (d < top && d->rec_len)
> > + d = (struct ext4_dir_entry *)(((void *)d) +
> > + le16_to_cpu(d->rec_len));
>
> Calling get_dirent_tail() is fairly expensive, because it has to walk
> the whole directory block each time. When filling a block it would
> be O(n^2) for the number of entries in the block.
>
> It would be more efficient to just cast the end of the directory block
> to the ext4_dir_entry_tail and check its validity, which is especially
> easy if there is a magic value in it.
>
> > + if (d != top)
> > + return NULL;
> > +
> > + t = (struct ext4_dir_entry_tail *)d;
> > + if (t->reserved_zero1 ||
> > + le16_to_cpu(t->rec_len) != sizeof(struct ext4_dir_entry_tail) ||
> > + t->reserved_zero2)
>
> I'd prefer these reserved_zero[12] fields be explicitly compared to zero
> instead of treated as boolean values.
Ok.
> > + return NULL;
> > +
> > + return t;
> > +}
> > +
> > +static __le32 ext4_dirent_csum(struct inode *inode,
> > + struct ext4_dir_entry *dirent)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > + struct ext4_dir_entry_tail *t;
> > + __le32 inum = cpu_to_le32(inode->i_ino);
> > + int size;
> > + __u32 crc = 0;
> > +
> > + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return 0;
> > +
> > + t = get_dirent_tail(inode, dirent);
>
> > + if (!t)
> > + return 0;
> > +
> > + size = (void *)t - (void *)dirent;
> > + crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
> > + crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
>
> Based on the number of times that the s_uuid+inum checksum is used in this
> code, and since it is constant for the life of the inode, it probably
> makes sense to precompute it and store it in ext4_inode_info.
Agreed.
> Also, now that I think about it, these checksums that contain the inum
> should also contain i_generation, so that there is no confusion with
> accessing old blocks on disk.
i_generation only gets updated when inodes are created or SETVERSION ioctl is
called, correct? I guess it wouldn't be too difficult to rewrite all file
metadata, though it could get a little expensive.
> > + crc = crc32c_le(crc, (__u8 *)dirent, size);
> > + return cpu_to_le32(crc);
> > +}
> > +
> > +int ext4_dirent_csum_verify(struct inode *inode, struct ext4_dir_entry *dirent)
> > +{
> > + struct ext4_dir_entry_tail *t;
> > +
> > + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return 1;
> > +
> > + t = get_dirent_tail(inode, dirent);
> > + if (!t) {
> > + EXT4_ERROR_INODE(inode, "metadata_csum set but no space in dir "
> > + "leaf for checksum. Please run e2fsck -D.");
> > + return 0;
> > + }
>
> I don't think this should necessarily be considered an error. That
> there is no space in the directory block is not a sign of corruption.
I was trying to steer users towards running fsck, which will notice the lack of
space and rebuild the dir. With a somewhat large mallet. :)
> > +
> > + if (t->checksum != ext4_dirent_csum(inode, dirent))
> > + return 0;
> > +
> > + return 1;
> > +}
> > +
> > +static void ext4_dirent_csum_set(struct inode *inode,
> > + struct ext4_dir_entry *dirent)
> > +{
> > + struct ext4_dir_entry_tail *t;
> > +
> > + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return;
> > +
> > + t = get_dirent_tail(inode, dirent);
> > + if (!t) {
> > + EXT4_ERROR_INODE(inode, "metadata_csum set but no space in dir "
> > + "leaf for checksum. Please run e2fsck -D.");
> > + return;
> > + }
> > +
> > + t->checksum = ext4_dirent_csum(inode, dirent);
> > +}
> > +
> > +static inline int ext4_handle_dirty_dirent_node(handle_t *handle,
> > + struct inode *inode,
> > + struct buffer_head *bh)
> > +{
> > + ext4_dirent_csum_set(inode, (struct ext4_dir_entry *)bh->b_data);
> > + return ext4_handle_dirty_metadata(handle, inode, bh);
> > +}
> > +
> > static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
> > struct ext4_dir_entry *dirent,
> > int *offset)
> > @@ -748,6 +846,11 @@ static int htree_dirblock_to_tree(struct file *dir_file,
> > if (!(bh = ext4_bread (NULL, dir, block, 0, &err)))
> > return err;
> >
> > + if (!buffer_verified(bh) &&
> > + !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data))
> > + return -EIO;
> > + set_buffer_verified(bh);
> > +
> > de = (struct ext4_dir_entry_2 *) bh->b_data;
>
> You might as well set de before calling ext4_dirent_csum_verify() to avoid
> having another unsightly cast.
Ok.
> > top = (struct ext4_dir_entry_2 *) ((char *) de +
> > dir->i_sb->s_blocksize -
> > @@ -1106,6 +1209,15 @@ restart:
> > brelse(bh);
> > goto next;
> > }
> > + if (!buffer_verified(bh) &&
> > + !ext4_dirent_csum_verify(dir,
> > + (struct ext4_dir_entry *)bh->b_data)) {
> > + EXT4_ERROR_INODE(dir, "checksumming directory "
> > + "block %lu", (unsigned long)block);
> > + brelse(bh);
> > + goto next;
> > + }
> > + set_buffer_verified(bh);
> > i = search_dirblock(bh, dir, d_name,
> > block << EXT4_BLOCK_SIZE_BITS(sb), res_dir);
> > if (i == 1) {
> > @@ -1157,6 +1269,16 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir, const struct q
> > if (!(bh = ext4_bread(NULL, dir, block, 0, err)))
> > goto errout;
> >
> > + if (!buffer_verified(bh) &&
> > + !ext4_dirent_csum_verify(dir,
> > + (struct ext4_dir_entry *)bh->b_data)) {
> > + EXT4_ERROR_INODE(dir, "checksumming directory "
> > + "block %lu", (unsigned long)block);
> > + brelse(bh);
> > + *err = -EIO;
> > + goto errout;
> > + }
> > + set_buffer_verified(bh);
> > retval = search_dirblock(bh, dir, d_name,
> > block << EXT4_BLOCK_SIZE_BITS(sb),
> > res_dir);
> > @@ -1329,8 +1451,14 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> > char *data1 = (*bh)->b_data, *data2;
> > unsigned split, move, size;
> > struct ext4_dir_entry_2 *de = NULL, *de2;
> > + struct ext4_dir_entry_tail *t;
> > + int csum_size = 0;
> > int err = 0, i;
> >
> > + if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + csum_size = sizeof(struct ext4_dir_entry_tail);
> > +
> > bh2 = ext4_append (handle, dir, &newblock, &err);
> > if (!(bh2)) {
> > brelse(*bh);
> > @@ -1377,10 +1505,24 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> > /* Fancy dance to stay within two buffers */
> > de2 = dx_move_dirents(data1, data2, map + split, count - split, blocksize);
> > de = dx_pack_dirents(data1, blocksize);
> > - de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de,
> > + de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) -
> > + (char *) de,
> > blocksize);
>
> (style) This should be "(char *)de, blocksize);"
>
> > - de2->rec_len = ext4_rec_len_to_disk(data2 + blocksize - (char *) de2,
> > + de2->rec_len = ext4_rec_len_to_disk(data2 + (blocksize - csum_size) -
> > + (char *) de2,
> > blocksize);
>
> (style) likewise
>
> > + if (csum_size) {
> > + t = (struct ext4_dir_entry_tail *)(data2 +
> > + (blocksize - csum_size));
> > + memset(t, 0, csum_size);
> > + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
> > +
> > + t = (struct ext4_dir_entry_tail *)(data1 +
> > + (blocksize - csum_size));
> > + memset(t, 0, csum_size);
> > + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
> > + }
> > +
> > dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data1, blocksize, 1));
> > dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data2, blocksize, 1));
> >
> > @@ -1391,7 +1533,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> > de = de2;
> > }
> > dx_insert_block(frame, hash2 + continued, newblock);
> > - err = ext4_handle_dirty_metadata(handle, dir, bh2);
> > + err = ext4_handle_dirty_dirent_node(handle, dir, bh2);
> > if (err)
> > goto journal_error;
> > err = ext4_handle_dirty_dx_node(handle, dir, frame->bh);
> > @@ -1431,11 +1573,16 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
> > unsigned short reclen;
> > int nlen, rlen, err;
> > char *top;
> > + int csum_size = 0;
> > +
> > + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + csum_size = sizeof(struct ext4_dir_entry_tail);
> >
> > reclen = EXT4_DIR_REC_LEN(namelen);
> > if (!de) {
> > de = (struct ext4_dir_entry_2 *)bh->b_data;
> > - top = bh->b_data + blocksize - reclen;
> > + top = bh->b_data + (blocksize - csum_size) - reclen;
> > while ((char *) de <= top) {
> > if (ext4_check_dir_entry(dir, NULL, de, bh, offset))
> > return -EIO;
> > @@ -1491,7 +1638,7 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
> > dir->i_version++;
> > ext4_mark_inode_dirty(handle, dir);
> > BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> > - err = ext4_handle_dirty_metadata(handle, dir, bh);
> > + err = ext4_handle_dirty_dirent_node(handle, dir, bh);
> > if (err)
> > ext4_std_error(dir->i_sb, err);
> > return 0;
> > @@ -1512,6 +1659,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
> > struct dx_frame frames[2], *frame;
> > struct dx_entry *entries;
> > struct ext4_dir_entry_2 *de, *de2;
> > + struct ext4_dir_entry_tail *t;
> > char *data1, *top;
> > unsigned len;
> > int retval;
> > @@ -1519,6 +1667,11 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
> > struct dx_hash_info hinfo;
> > ext4_lblk_t block;
> > struct fake_dirent *fde;
> > + int csum_size = 0;
> > +
> > + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + csum_size = sizeof(struct ext4_dir_entry_tail);
> >
> > blocksize = dir->i_sb->s_blocksize;
> > dxtrace(printk(KERN_DEBUG "Creating index: inode %lu\n", dir->i_ino));
> > @@ -1539,7 +1692,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
> > brelse(bh);
> > return -EIO;
> > }
> > - len = ((char *) root) + blocksize - (char *) de;
> > + len = ((char *) root) + (blocksize - csum_size) - (char *) de;
>
> (style) "(char *)root)" and "(char *)de".
>
> > /* Allocate new block for the 0th block's dirents */
> > bh2 = ext4_append(handle, dir, &block, &retval);
> > @@ -1555,8 +1708,17 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
> > top = data1 + len;
> > while ((char *)(de2 = ext4_next_entry(de, blocksize)) < top)
> > de = de2;
> > - de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de,
> > + de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) -
> > + (char *) de,
> > blocksize);
>
> Likewise.
Ok, I'll fix the style complaints. Should checkpatch find these things?
--D
> > +
> > + if (csum_size) {
> > + t = (struct ext4_dir_entry_tail *)(data1 +
> > + (blocksize - csum_size));
> > + memset(t, 0, csum_size);
> > + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
> > + }
> > +
> > /* Initialize the root; the dot dirents already exist */
> > de = (struct ext4_dir_entry_2 *) (&root->dotdot);
> > de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(2),
> > @@ -1582,7 +1744,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
> > bh = bh2;
> >
> > ext4_handle_dirty_dx_node(handle, dir, frame->bh);
> > - ext4_handle_dirty_metadata(handle, dir, bh);
> > + ext4_handle_dirty_dirent_node(handle, dir, bh);
> >
> > de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
> > if (!de) {
> > @@ -1618,11 +1780,17 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
> > struct inode *dir = dentry->d_parent->d_inode;
> > struct buffer_head *bh;
> > struct ext4_dir_entry_2 *de;
> > + struct ext4_dir_entry_tail *t;
> > struct super_block *sb;
> > int retval;
> > int dx_fallback=0;
> > unsigned blocksize;
> > ext4_lblk_t block, blocks;
> > + int csum_size = 0;
> > +
> > + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + csum_size = sizeof(struct ext4_dir_entry_tail);
> >
> > sb = dir->i_sb;
> > blocksize = sb->s_blocksize;
> > @@ -1641,6 +1809,11 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
> > bh = ext4_bread(handle, dir, block, 0, &retval);
> > if(!bh)
> > return retval;
> > + if (!buffer_verified(bh) &&
> > + !ext4_dirent_csum_verify(dir,
> > + (struct ext4_dir_entry *)bh->b_data))
> > + return -EIO;
> > + set_buffer_verified(bh);
> > retval = add_dirent_to_buf(handle, dentry, inode, NULL, bh);
> > if (retval != -ENOSPC) {
> > brelse(bh);
> > @@ -1657,7 +1830,15 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
> > return retval;
> > de = (struct ext4_dir_entry_2 *) bh->b_data;
> > de->inode = 0;
> > - de->rec_len = ext4_rec_len_to_disk(blocksize, blocksize);
> > + de->rec_len = ext4_rec_len_to_disk(blocksize - csum_size, blocksize);
> > +
> > + if (csum_size) {
> > + t = (struct ext4_dir_entry_tail *)(((void *)bh->b_data) +
> > + (blocksize - csum_size));
> > + memset(t, 0, csum_size);
> > + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
> > + }
> > +
> > retval = add_dirent_to_buf(handle, dentry, inode, de, bh);
> > brelse(bh);
> > if (retval == 0)
> > @@ -1689,6 +1870,11 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
> > if (!(bh = ext4_bread(handle,dir, dx_get_block(frame->at), 0, &err)))
> > goto cleanup;
> >
> > + if (!buffer_verified(bh) &&
> > + !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data))
> > + goto journal_error;
> > + set_buffer_verified(bh);
> > +
> > BUFFER_TRACE(bh, "get_write_access");
> > err = ext4_journal_get_write_access(handle, bh);
> > if (err)
> > @@ -1814,12 +2000,17 @@ static int ext4_delete_entry(handle_t *handle,
> > {
> > struct ext4_dir_entry_2 *de, *pde;
> > unsigned int blocksize = dir->i_sb->s_blocksize;
> > + int csum_size = 0;
> > int i, err;
> >
> > + if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + csum_size = sizeof(struct ext4_dir_entry_tail);
> > +
> > i = 0;
> > pde = NULL;
> > de = (struct ext4_dir_entry_2 *) bh->b_data;
> > - while (i < bh->b_size) {
> > + while (i < bh->b_size - csum_size) {
> > if (ext4_check_dir_entry(dir, NULL, de, bh, i))
> > return -EIO;
> > if (de == de_del) {
> > @@ -1840,7 +2031,7 @@ static int ext4_delete_entry(handle_t *handle,
> > de->inode = 0;
> > dir->i_version++;
> > BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> > - err = ext4_handle_dirty_metadata(handle, dir, bh);
> > + err = ext4_handle_dirty_dirent_node(handle, dir, bh);
> > if (unlikely(err)) {
> > ext4_std_error(dir->i_sb, err);
> > return err;
> > @@ -1983,9 +2174,15 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, int mode)
> > struct inode *inode;
> > struct buffer_head *dir_block = NULL;
> > struct ext4_dir_entry_2 *de;
> > + struct ext4_dir_entry_tail *t;
> > unsigned int blocksize = dir->i_sb->s_blocksize;
> > + int csum_size = 0;
> > int err, retries = 0;
> >
> > + if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + csum_size = sizeof(struct ext4_dir_entry_tail);
> > +
> > if (EXT4_DIR_LINK_MAX(dir))
> > return -EMLINK;
> >
> > @@ -2026,16 +2223,26 @@ retry:
> > ext4_set_de_type(dir->i_sb, de, S_IFDIR);
> > de = ext4_next_entry(de, blocksize);
> > de->inode = cpu_to_le32(dir->i_ino);
> > - de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(1),
> > + de->rec_len = ext4_rec_len_to_disk(blocksize -
> > + (csum_size + EXT4_DIR_REC_LEN(1)),
> > blocksize);
> > de->name_len = 2;
> > strcpy(de->name, "..");
> > ext4_set_de_type(dir->i_sb, de, S_IFDIR);
> > inode->i_nlink = 2;
> > +
> > + if (csum_size) {
> > + t = (struct ext4_dir_entry_tail *)(((void *)dir_block->b_data) +
> > + (blocksize - csum_size));
> > + memset(t, 0, csum_size);
> > + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
> > + }
> > +
> > BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
> > - err = ext4_handle_dirty_metadata(handle, inode, dir_block);
> > + err = ext4_handle_dirty_dirent_node(handle, inode, dir_block);
> > if (err)
> > goto out_clear_inode;
> > + set_buffer_verified(dir_block);
> > err = ext4_mark_inode_dirty(handle, inode);
> > if (!err)
> > err = ext4_add_entry(handle, dentry, inode);
> > @@ -2085,6 +2292,14 @@ static int empty_dir(struct inode *inode)
> > inode->i_ino);
> > return 1;
> > }
> > + if (!buffer_verified(bh) &&
> > + !ext4_dirent_csum_verify(inode,
> > + (struct ext4_dir_entry *)bh->b_data)) {
> > + EXT4_ERROR_INODE(inode, "checksum error reading directory "
> > + "lblock 0");
> > + return -EIO;
> > + }
> > + set_buffer_verified(bh);
> > de = (struct ext4_dir_entry_2 *) bh->b_data;
> > de1 = ext4_next_entry(de, sb->s_blocksize);
> > if (le32_to_cpu(de->inode) != inode->i_ino ||
> > @@ -2116,6 +2331,14 @@ static int empty_dir(struct inode *inode)
> > offset += sb->s_blocksize;
> > continue;
> > }
> > + if (!buffer_verified(bh) &&
> > + !ext4_dirent_csum_verify(inode,
> > + (struct ext4_dir_entry *)bh->b_data)) {
> > + EXT4_ERROR_INODE(inode, "checksum error "
> > + "reading directory lblock 0");
> > + return -EIO;
> > + }
> > + set_buffer_verified(bh);
> > de = (struct ext4_dir_entry_2 *) bh->b_data;
> > }
> > if (ext4_check_dir_entry(inode, NULL, de, bh, offset)) {
> > @@ -2616,6 +2839,11 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
> > dir_bh = ext4_bread(handle, old_inode, 0, 0, &retval);
> > if (!dir_bh)
> > goto end_rename;
> > + if (!buffer_verified(dir_bh) &&
> > + !ext4_dirent_csum_verify(old_inode,
> > + (struct ext4_dir_entry *)dir_bh->b_data))
> > + goto end_rename;
> > + set_buffer_verified(dir_bh);
> > if (le32_to_cpu(PARENT_INO(dir_bh->b_data,
> > old_dir->i_sb->s_blocksize)) != old_dir->i_ino)
> > goto end_rename;
> > @@ -2646,7 +2874,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
> > ext4_current_time(new_dir);
> > ext4_mark_inode_dirty(handle, new_dir);
> > BUFFER_TRACE(new_bh, "call ext4_handle_dirty_metadata");
> > - retval = ext4_handle_dirty_metadata(handle, new_dir, new_bh);
> > + retval = ext4_handle_dirty_dirent_node(handle, new_dir, new_bh);
> > if (unlikely(retval)) {
> > ext4_std_error(new_dir->i_sb, retval);
> > goto end_rename;
> > @@ -2700,7 +2928,8 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
> > PARENT_INO(dir_bh->b_data, new_dir->i_sb->s_blocksize) =
> > cpu_to_le32(new_dir->i_ino);
> > BUFFER_TRACE(dir_bh, "call ext4_handle_dirty_metadata");
> > - retval = ext4_handle_dirty_metadata(handle, old_inode, dir_bh);
> > + retval = ext4_handle_dirty_dirent_node(handle, old_inode,
> > + dir_bh);
> > if (retval) {
> > ext4_std_error(old_dir->i_sb, retval);
> > goto end_rename;
> >
>
On Thu, Sep 01, 2011 at 12:40:06AM -0600, Andreas Dilger wrote:
> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> > Calculate and verify the checksum for each extent tree block. The checksum is
> > located immediately after the last ext4_extent in the block, which is typically
> > 4-8 bytes in size.
>
> It would be more correct to write "... located in the space immediately
> following the last possible ext4_extent in the block."
Agreed.
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/ext4/ext4_extents.h | 25 ++++++++++++++++++-
> > fs/ext4/extents.c | 64 +++++++++++++++++++++++++++++++++++++++++++++---
> > 2 files changed, 84 insertions(+), 5 deletions(-)
> >
> >
> > diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
> > index 095c36f..24b106a 100644
> > --- a/fs/ext4/ext4_extents.h
> > +++ b/fs/ext4/ext4_extents.h
> > @@ -62,10 +62,22 @@
> > /*
> > * ext4_inode has i_block array (60 bytes total).
> > * The first 12 bytes store ext4_extent_header;
> > - * the remainder stores an array of ext4_extent.
> > + * the remainder stores an array of ext4_extent,
> > + * followed by ext4_extent_tail.
> > */
> >
> > /*
> > + * This is the extent tail on-disk structure.
> > + * All other extent structures are 12 bytes long. It turns out that
> > + * block_size % 12 >= 4 for all valid block sizes (1k, 2k, 4k).
> > + * Therefore, this tail structure can be crammed into the end of the block
> > + * without having to rebalance the tree.
> > + */
> > +struct ext4_extent_tail {
> > + __le32 et_checksum; /* crc32c(uuid+inum+extent_block) */
> > +};
>
> Did you do any analysis of extent blocks to see whether there is enough
> space in most extents to have a larger extent tail that stores the inode
> number and generation? This would be the same as with some directory
> blocks needing to add a directory entry to hold the checksum.
With 12-byte structures we're guaranteed 4 bytes that can be inserted without
needing to change any on-disk structures.
Most extent blocks aren't full, and provided that you decrease eh_max, you
could free up 8 more bytes for extra info. Of course then you'd have to write
code to rebalance the extent tree whenever you find a full extent block.
It's also possible that we could simply bake i_generation into the checksum
whenever we bake in i_inum.
> > +/*
> > * This is the extent on-disk structure.
> > * It's used at the bottom of the tree.
> > */
> > @@ -101,6 +113,17 @@ struct ext4_extent_header {
> >
> > #define EXT4_EXT_MAGIC cpu_to_le16(0xf30a)
> >
> > +#define EXT4_EXTENT_TAIL_OFFSET(hdr) \
> > + (sizeof(struct ext4_extent_header) + \
> > + (sizeof(struct ext4_extent) * le16_to_cpu((hdr)->eh_max)))
> > +
> > +static inline struct ext4_extent_tail *
> > +find_ext4_extent_tail(struct ext4_extent_header *eh)
>
> I don't really like using "find" in this function name, since it implies
> a search is needed. Maybe a name like ext4_extent_tail_ptr()?
Ok.
--D
> > +{
> > + return (struct ext4_extent_tail *)(((void *)eh) +
> > + EXT4_EXTENT_TAIL_OFFSET(eh));
> > +}
> > +
> > /*
> > * Array of ext4_ext_path contains path to some extent.
> > * Creation/lookup routines use it for traversal/splitting/etc.
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index 4ac4303..94f09ce 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -41,11 +41,57 @@
> > #include <linux/falloc.h>
> > #include <asm/uaccess.h>
> > #include <linux/fiemap.h>
> > +#include <linux/crc32c.h>
> > #include "ext4_jbd2.h"
> > #include "ext4_extents.h"
> >
> > #include <trace/events/ext4.h>
> >
> > +static __le32 ext4_extent_block_csum(struct inode *inode,
> > + struct ext4_extent_header *eh)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > + __le32 inum = cpu_to_le32(inode->i_ino);
> > + __u32 crc = 0;
> > +
> > + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return 0;
> > +
> > + crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
> > + crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
> > + crc = crc32c_le(crc, (__u8 *)eh, EXT4_EXTENT_TAIL_OFFSET(eh));
> > + return cpu_to_le32(crc);
> > +}
> > +
> > +static int ext4_extent_block_csum_verify(struct inode *inode,
> > + struct ext4_extent_header *eh)
> > +{
> > + struct ext4_extent_tail *et;
> > +
> > + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return 1;
> > +
> > + et = find_ext4_extent_tail(eh);
> > + if (et->et_checksum != ext4_extent_block_csum(inode, eh))
> > + return 0;
> > + return 1;
> > +}
> > +
> > +static void ext4_extent_block_csum_set(struct inode *inode,
> > + struct ext4_extent_header *eh)
> > +{
> > + struct ext4_extent_tail *et;
> > +
> > + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return;
> > +
> > + et = find_ext4_extent_tail(eh);
> > + et->et_checksum = ext4_extent_block_csum(inode, eh);
> > +}
> > +
> > static int ext4_split_extent(handle_t *handle,
> > struct inode *inode,
> > struct ext4_ext_path *path,
> > @@ -101,6 +147,7 @@ static int ext4_ext_dirty(handle_t *handle, struct inode *inode,
> > {
> > int err;
> > if (path->p_bh) {
> > + ext4_extent_block_csum_set(inode, ext_block_hdr(path->p_bh));
> > /* path points to block */
> > err = ext4_handle_dirty_metadata(handle, inode, path->p_bh);
> > } else {
> > @@ -382,6 +429,12 @@ static int __ext4_ext_check(const char *function, unsigned int line,
> > error_msg = "invalid extent entries";
> > goto corrupted;
> > }
> > + /* Verify checksum on non-root extent tree nodes */
> > + if (ext_depth(inode) != depth &&
> > + !ext4_extent_block_csum_verify(inode, eh)) {
> > + error_msg = "extent tree corrupted";
> > + goto corrupted;
> > + }
> > return 0;
> >
> > corrupted:
> > @@ -922,6 +975,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
> > le16_add_cpu(&neh->eh_entries, m);
> > }
> >
> > + ext4_extent_block_csum_set(inode, neh);
> > set_buffer_uptodate(bh);
> > unlock_buffer(bh);
> >
> > @@ -1000,6 +1054,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
> > sizeof(struct ext4_extent_idx) * m);
> > le16_add_cpu(&neh->eh_entries, m);
> > }
> > + ext4_extent_block_csum_set(inode, neh);
> > set_buffer_uptodate(bh);
> > unlock_buffer(bh);
> >
> > @@ -1098,6 +1153,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
> > else
> > neh->eh_max = cpu_to_le16(ext4_ext_space_block(inode, 0));
> > neh->eh_magic = EXT4_EXT_MAGIC;
> > + ext4_extent_block_csum_set(inode, neh);
> > set_buffer_uptodate(bh);
> > unlock_buffer(bh);
> >
> > @@ -2458,10 +2514,6 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
> > if (uninitialized && num)
> > ext4_ext_mark_uninitialized(ex);
> >
> > - err = ext4_ext_dirty(handle, inode, path + depth);
> > - if (err)
> > - goto out;
> > -
> > /*
> > * If the extent was completely released,
> > * we need to remove it from the leaf
> > @@ -2483,6 +2535,10 @@ ext4_ext_rm_leaf(handle_t *handle, struct inode *inode,
> > le16_add_cpu(&eh->eh_entries, -1);
> > }
> >
> > + err = ext4_ext_dirty(handle, inode, path + depth);
> > + if (err)
> > + goto out;
> > +
> > ext_debug("new extent: %u:%u:%llu\n", block, num,
> > ext4_ext_pblock(ex));
> > ex--;
> >
>
On Thu, Sep 01, 2011 at 12:08:44AM -0600, Andreas Dilger wrote:
> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> > Compute and verify the checksum of the block bitmap; this checksum is stored in
> > the block group descriptor.
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/ext4/balloc.c | 43 ++++++++++++++++++++++++++++++++++---------
> > fs/ext4/ext4.h | 7 ++++++-
> > fs/ext4/ialloc.c | 5 +++++
> > fs/ext4/mballoc.c | 34 ++++++++++++++++++++++++++++++++++
> > 4 files changed, 79 insertions(+), 10 deletions(-)
> >
> >
> > diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> > index f8224ad..36d3020 100644
> > --- a/fs/ext4/balloc.c
> > +++ b/fs/ext4/balloc.c
> > @@ -105,6 +105,10 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
> > ext4_free_inodes_set(sb, gdp, 0);
> > ext4_itable_unused_set(sb, gdp, 0);
> > memset(bh->b_data, 0xff, sb->s_blocksize);
> > + ext4_bitmap_csum_set(sb, block_group,
> > + &gdp->bg_block_bitmap_csum, bh,
> > + (EXT4_BLOCKS_PER_GROUP(sb) + 7) /
> > + 8);
> > return 0;
> > }
> > memset(bh->b_data, 0, sb->s_blocksize);
> > @@ -175,6 +179,11 @@ unsigned ext4_init_block_bitmap(struct super_block *sb, struct buffer_head *bh,
> > */
> > ext4_mark_bitmap_end(group_blocks, sb->s_blocksize * 8,
> > bh->b_data);
> > + ext4_bitmap_csum_set(sb, block_group,
> > + &gdp->bg_block_bitmap_csum, bh,
> > + (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8);
> > + gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group,
> > + gdp);
> > }
> > return free_blocks - ext4_group_used_meta_blocks(sb, block_group, gdp);
> > }
> > @@ -232,10 +241,10 @@ struct ext4_group_desc * ext4_get_group_desc(struct super_block *sb,
> > return desc;
> > }
> >
> > -static int ext4_valid_block_bitmap(struct super_block *sb,
> > - struct ext4_group_desc *desc,
> > - unsigned int block_group,
> > - struct buffer_head *bh)
> > +int ext4_valid_block_bitmap(struct super_block *sb,
> > + struct ext4_group_desc *desc,
> > + unsigned int block_group,
> > + struct buffer_head *bh)
> > {
> > ext4_grpblk_t offset;
> > ext4_grpblk_t next_zero_bit;
> > @@ -312,12 +321,12 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
> > }
> >
> > if (bitmap_uptodate(bh))
> > - return bh;
> > + goto verify;
> >
> > lock_buffer(bh);
> > if (bitmap_uptodate(bh)) {
> > unlock_buffer(bh);
> > - return bh;
> > + goto verify;
> > }
> > ext4_lock_group(sb, block_group);
> > if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> > @@ -336,7 +345,7 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
> > */
> > set_bitmap_uptodate(bh);
> > unlock_buffer(bh);
> > - return bh;
> > + goto verify;
> > }
> > /*
> > * submit the buffer_head for read. We can
> > @@ -353,11 +362,27 @@ ext4_read_block_bitmap(struct super_block *sb, ext4_group_t block_group)
> > block_group, bitmap_blk);
> > return NULL;
> > }
> > - ext4_valid_block_bitmap(sb, desc, block_group, bh);
> > +
> > +verify:
> > + if (buffer_verified(bh))
> > + return bh;
> > /*
> > * file system mounted not to panic on error,
> > - * continue with corrupt bitmap
> > + * -EIO with corrupt bitmap
> > */
> > + ext4_lock_group(sb, block_group);
> > + if (!ext4_valid_block_bitmap(sb, desc, block_group, bh) ||
> > + !ext4_bitmap_csum_verify(sb, block_group,
> > + desc->bg_block_bitmap_csum, bh,
> > + (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8)) {
> > + ext4_unlock_group(sb, block_group);
> > + put_bh(bh);
> > + ext4_error(sb, "Corrupt block bitmap - block_group = %u, "
> > + "block_bitmap = %llu", block_group, bitmap_blk);
> > + return NULL;
> > + }
> > + ext4_unlock_group(sb, block_group);
> > + set_buffer_verified(bh);
> > return bh;
> > }
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 248cbd2..df149b3 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -269,7 +269,8 @@ struct ext4_group_desc
> > __le16 bg_free_inodes_count_lo;/* Free inodes count */
> > __le16 bg_used_dirs_count_lo; /* Directories count */
> > __le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */
> > - __u32 bg_reserved[2]; /* Likely block/inode bitmap checksum */
> > + __u32 bg_reserved[1]; /* unclaimed */
> > + __le32 bg_block_bitmap_csum; /* crc32c(uuid+group+bbitmap) */
>
> Same comment as for the inode bitmap checksum - it should be split into
> two __le16 fields, so that we get at least some coverage for the vast
> majority of existing filesystems.
>
> > __le16 bg_itable_unused_lo; /* Unused inodes count */
> > __le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
> > __le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
> > @@ -1731,6 +1732,10 @@ void ext4_bitmap_csum_set(struct super_block *sb, ext4_group_t group,
> > __le32 *csum, struct buffer_head *bh, int sz);
> >
> > /* balloc.c */
> > +extern int ext4_valid_block_bitmap(struct super_block *sb,
> > + struct ext4_group_desc *desc,
> > + unsigned int block_group,
> > + struct buffer_head *bh);
> > extern unsigned int ext4_block_group(struct super_block *sb,
> > ext4_fsblk_t blocknr);
> > extern ext4_grpblk_t ext4_block_group_offset(struct super_block *sb,
> > diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> > index 53faffc..a335d19 100644
> > --- a/fs/ext4/ialloc.c
> > +++ b/fs/ext4/ialloc.c
> > @@ -984,6 +984,11 @@ got:
> > free = ext4_free_blocks_after_init(sb, group, gdp);
> > gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
> > ext4_free_blks_set(sb, gdp, free);
> > + ext4_bitmap_csum_set(sb, group,
> > + &gdp->bg_block_bitmap_csum,
> > + block_bitmap_bh,
> > + (EXT4_BLOCKS_PER_GROUP(sb) + 7) /
> > + 8);
> > gdp->bg_checksum = ext4_group_desc_csum(sbi, group,
> > gdp);
> > }
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 17a5a57..8dc3055 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -895,6 +895,33 @@ static int ext4_mb_init_cache(struct page *page, char *incore)
> > if (bh[i] && !buffer_uptodate(bh[i]))
> > goto out;
> >
> > + for (i = 0; i < groups_per_page; i++) {
> > + struct ext4_group_desc *desc;
> > +
> > + if (!bh[i] || !bh[i]->b_end_io)
> > + continue;
>
> Please don't treat pointers as boolean values. I'd prefer to see a
> proper comparison like "bh[i] == NULL" here.
>
> Also, it isn't obvious why the check for b_end_io is needed here?
If b_end_io is set then the block is being read in and needs checking.
BH_Verified would work just as well here and be a little clearer. I think I
had this patch written before I added the BH_Verified flag and forgot to update
this patch. Oops. Good catch.
> > + desc = ext4_get_group_desc(sb, first_group + i, NULL);
> > + if (!desc)
> > + goto out;
> > +
> > + if (buffer_verified(bh[i]))
> > + continue;
> > + ext4_lock_group(sb, first_group + i);
> > + if (!ext4_valid_block_bitmap(sb, desc, first_group + i,
> > + bh[i]) ||
> > + !ext4_bitmap_csum_verify(sb, first_group + i,
> > + desc->bg_block_bitmap_csum, bh[i],
> > + (EXT4_BLOCKS_PER_GROUP(sb) + 7) /
> > + 8)) {
> > + ext4_unlock_group(sb, first_group + i);
> > + ext4_error(sb, "Corrupt block bitmap, group = %u",
> > + first_group + i);
> > + goto out;
> > + }
> > + ext4_unlock_group(sb, first_group + i);
> > + set_buffer_verified(bh[i]);
> > + }
>
> Since this is CPU intensive, it might make sense to start computing the
> block bitmap checksums as soon as the buffer is uptodate, instead of
> waiting for all of the buffers to be read and _then_ doing the checksums.
>
> Even better might be to do move all of the above code to do the checksum
> to be in a new the b_end_io callback, so that it can start as soon as each
> buffer is read from disk, to maximize CPU and IO overlap, like:
Good suggestion. I'll put it into the next rev.
--D
> struct ext4_csum_data {
> struct superblock *cd_sb;
> ext4_group_t cd_group;
> };
>
> static void ext4_end_buffered_read_sync_csum(struct buffer_head *bh,
> int uptodate)
> {
> struct superblock *sb = (struct ext4_csum_data *)(bh->b_private)->cd_sb;
> ext4_group_t group = (struct ext4_csum_data *)(bh->b_private)->cd_group;
>
> end_buffered_read_sync(bh, uptodate);
>
> if (uptodate) {
> struct ext4_group_desc *desc;
>
>
> desc = ext4_get_group_desc(sb, group, NULL);
> if (!desc)
> return;
>
> ext4_lock_group(sb, group);
> if (ext4_valid_block_bitmap(sb, desc, group, bh) &&
> ext4_bitmap_csum_verify(sb, group,
> desc->bg_block_bitmap_csum, bh,
> (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8))
> set_buffer_verified(bh);
>
> ext4_unlock_group(rcd->rcd_sb, rcd->rcd_group);
> }
> }
>
> Then later in the code can just check buffer_verified() in the caller:
>
> ext4_read_block_bitmap()
> {
> /* read all groups the page covers into the cache */
> for (i = 0; i < groups_per_page; i++) {
> :
> :
> set_bitmap_uptodate(bh[i]);
> ecd[i].cd_sb = sb;
> ecd[i].cd_group = first_group + i;
> bh[i]->b_end_io = ext4_end_buffer_read_sync_csum;
> submit_bh(READ, bh[i]);
> mb_debug(1, "read bitmap for group %u\n", first_group + i);
> }
>
> err = 0;
> /* always wait for I/O completion before returning */
> for (i = 0; i < groups_per_page; i++) {
> if (bh[i]) {
> wait_on_buffer(bh[i]);
> if (!buffer_uptodate(bh[i]) ||
> !buffer_verified(bh[i]))
> err = -EIO;
> }
> }
>
>
> > err = 0;
> > first_block = page->index * blocks_per_page;
> > for (i = 0; i < blocks_per_page; i++) {
> > @@ -2829,6 +2856,9 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
> > }
> > len = ext4_free_blks_count(sb, gdp) - ac->ac_b_ex.fe_len;
> > ext4_free_blks_set(sb, gdp, len);
> > + ext4_bitmap_csum_set(sb, ac->ac_b_ex.fe_group,
> > + &gdp->bg_block_bitmap_csum, bitmap_bh,
> > + (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8);
> > gdp->bg_checksum = ext4_group_desc_csum(sbi, ac->ac_b_ex.fe_group, gdp);
> >
> > ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
> > @@ -4638,6 +4668,8 @@ do_more:
> >
> > ret = ext4_free_blks_count(sb, gdp) + count;
> > ext4_free_blks_set(sb, gdp, ret);
> > + ext4_bitmap_csum_set(sb, block_group, &gdp->bg_block_bitmap_csum,
> > + bitmap_bh, (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8);
> > gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
> > ext4_unlock_group(sb, block_group);
> > percpu_counter_add(&sbi->s_freeblocks_counter, count);
> > @@ -4780,6 +4812,8 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
> > mb_free_blocks(NULL, &e4b, bit, count);
> > blk_free_count = blocks_freed + ext4_free_blks_count(sb, desc);
> > ext4_free_blks_set(sb, desc, blk_free_count);
> > + ext4_bitmap_csum_set(sb, block_group, &desc->bg_block_bitmap_csum,
> > + bitmap_bh, (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8);
> > desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc);
> > ext4_unlock_group(sb, block_group);
> > percpu_counter_add(&sbi->s_freeblocks_counter, blocks_freed);
> >
>
On Wed, Aug 31, 2011 at 10:49:05PM -0600, Andreas Dilger wrote:
> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> > Compute and verify the checksum of the inode bitmap; the checkum is stored in
> > the block group descriptor.
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/ext4/ext4.h | 3 ++-
> > fs/ext4/ialloc.c | 33 ++++++++++++++++++++++++++++++---
> > 2 files changed, 32 insertions(+), 4 deletions(-)
> >
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index bc7ace1..248cbd2 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -279,7 +279,8 @@ struct ext4_group_desc
> > __le16 bg_free_inodes_count_hi;/* Free inodes count MSB */
> > __le16 bg_used_dirs_count_hi; /* Directories count MSB */
> > __le16 bg_itable_unused_hi; /* Unused inodes count MSB */
> > - __u32 bg_reserved2[3];
> > + __le32 bg_inode_bitmap_csum; /* crc32c(uuid+group+ibitmap) */
> > + __u32 bg_reserved2[2];
> > };
>
> I would prefer if there was a 16-bit checksum for the (most common)
> 32-byte group descriptors, and this was extended to a 32-bit checksum
> for the (much less common) 64-byte+ group descriptors. For filesystems
> that are newly formatted with the 64bit feature it makes no difference,
> but virtually all ext3/4 filesystems have only the smaller group descriptors.
>
> Regardless of whether using half of the crc32c is better or worse than
> using crc16 for the bitmap blocks, storing _any_ checksum is better than
> storing nothing at all. I would propose the following:
That's an interesting reframing of the argument that I hadn't considered. I'd
fallen into the idea of needing crc32c because of its bit error guarantees (all
corruptions of odd numbers of bits and all corruptions of fewer than ...4?
bits) that I hadn't quite realized that even if crc16 can't guarantee to find
any corruption at all, it still _might_, and that's better than nothing.
Ok, let's split the 32-bit fields and use crc16 for the case of 32-byte block
group descriptors.
> struct ext4_group_desc
> {
> __le32 bg_block_bitmap_lo; /* Blocks bitmap block */
> __le32 bg_inode_bitmap_lo; /* Inodes bitmap block */
> __le32 bg_inode_table_lo; /* Inodes table block */
> __le16 bg_free_blocks_count_lo; /* Free blocks count */
> __le16 bg_free_inodes_count_lo; /* Free inodes count */
> __le16 bg_used_dirs_count_lo; /* Directories count */
> __le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */
> __le32 bg_exclude_bitmap_lo; /* Exclude bitmap block */
> __le16 bg_block_bitmap_csum_lo; /* Block bitmap checksum */
> __le16 bg_inode_bitmap_csum_lo; /* Inode bitmap checksum */
> __le16 bg_itable_unused_lo; /* Unused inodes count */
> __le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
> __le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
> __le32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */
> __le32 bg_inode_table_hi; /* Inodes table block MSB */
> __le16 bg_free_blocks_count_hi; /* Free blocks count MSB */
> __le16 bg_free_inodes_count_hi; /* Free inodes count MSB */
> __le16 bg_used_dirs_count_hi; /* Directories count MSB */
> __le16 bg_itable_unused_hi; /* Unused inodes count MSB */
> __le32 bg_exclude_bitmap_hi; /* Exclude bitmap block MSB */
> __le16 bg_block_bitmap_csum_hi; /* Blocks bitmap checksum MSB */
> __le16 bg_inode_bitmap_csum_hi; /* Inodes bitmap checksum MSB */
> __le32 bg_reserved2;
> };
>
> This is also different from your layout because it locates the block bitmap
> checksum field before the inode bitmap checksum, to more closely match the
> order of other fields in this structure.
Er.. I reversed the order in the structure definition just prior to publishing,
and forgot to update the wiki page. Well I guess I'm about to update it again.
:)
> > /*
> > diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> > index 9c63f27..53faffc 100644
> > --- a/fs/ext4/ialloc.c
> > +++ b/fs/ext4/ialloc.c
> > @@ -82,12 +82,18 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb,
> > ext4_free_inodes_set(sb, gdp, 0);
> > ext4_itable_unused_set(sb, gdp, 0);
> > memset(bh->b_data, 0xff, sb->s_blocksize);
> > + ext4_bitmap_csum_set(sb, block_group,
> > + &gdp->bg_inode_bitmap_csum, bh,
> > + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
>
> The number of inodes per group is already always a multiple of 8.
Ok. I suppose we can fix that in the lines below too.
> > return 0;
> > }
> >
> > memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> > ext4_mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), sb->s_blocksize * 8,
> > bh->b_data);
> > + ext4_bitmap_csum_set(sb, block_group, &gdp->bg_inode_bitmap_csum, bh,
> > + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> > + gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
> >
> > return EXT4_INODES_PER_GROUP(sb);
> > }
> > @@ -118,12 +124,12 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> > return NULL;
> > }
> > if (bitmap_uptodate(bh))
> > - return bh;
> > + goto verify;
> >
> > lock_buffer(bh);
> > if (bitmap_uptodate(bh)) {
> > unlock_buffer(bh);
> > - return bh;
> > + goto verify;
> > }
> >
> > ext4_lock_group(sb, block_group);
> > @@ -131,6 +137,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> > ext4_init_inode_bitmap(sb, bh, block_group, desc);
> > set_bitmap_uptodate(bh);
> > set_buffer_uptodate(bh);
> > + set_buffer_verified(bh);
> > ext4_unlock_group(sb, block_group);
> > unlock_buffer(bh);
> > return bh;
> > @@ -144,7 +151,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> > */
> > set_bitmap_uptodate(bh);
> > unlock_buffer(bh);
> > - return bh;
> > + goto verify;
> > }
> > /*
> > * submit the buffer_head for read. We can
> > @@ -161,6 +168,21 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> > block_group, bitmap_blk);
> > return NULL;
> > }
> > +
> > +verify:
> > + ext4_lock_group(sb, block_group);
> > + if (!buffer_verified(bh) &&
> > + !ext4_bitmap_csum_verify(sb, block_group,
> > + desc->bg_inode_bitmap_csum, bh,
> > + (EXT4_INODES_PER_GROUP(sb) + 7) / 8)) {
> > + ext4_unlock_group(sb, block_group);
> > + put_bh(bh);
> > + ext4_error(sb, "Corrupt inode bitmap - block_group = %u, "
> > + "inode_bitmap = %llu", block_group, bitmap_blk);
> > + return NULL;
>
> At some point we should add a flag like EXT4_BG_INODE_ERROR so that the
> group can be marked in error on disk, and skipped for future allocations,
> but the whole filesystem does not need to be remounted read-only. That's
> for another patch, however.
Agreed. :)
--D
> > + }
> > + ext4_unlock_group(sb, block_group);
> > + set_buffer_verified(bh);
> > return bh;
> > }
> >
> > @@ -265,6 +287,8 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> > ext4_used_dirs_set(sb, gdp, count);
> > percpu_counter_dec(&sbi->s_dirs_counter);
> > }
> > + ext4_bitmap_csum_set(sb, block_group, &gdp->bg_inode_bitmap_csum,
> > + bitmap_bh, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> > gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
> > ext4_unlock_group(sb, block_group);
> >
> > @@ -784,6 +808,9 @@ static int ext4_claim_inode(struct super_block *sb,
> > atomic_inc(&sbi->s_flex_groups[f].used_dirs);
> > }
> > }
> > + ext4_bitmap_csum_set(sb, group, &gdp->bg_inode_bitmap_csum,
> > + inode_bitmap_bh,
> > + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> > gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
> > err_ret:
> > ext4_unlock_group(sb, group);
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 31, 2011 at 08:30:25PM -0600, Andreas Dilger wrote:
> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> > This patch introduces to ext4 the ability to calculate and verify inode
> > checksums. This requires the use of a new ro compatibility flag and some
> > accompanying e2fsprogs patches to provide the relevant features in tune2fs and
> > e2fsck.
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/ext4/ext4.h | 4 ++--
> > fs/ext4/inode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 64 insertions(+), 2 deletions(-)
> >
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index f79ddac..e2361cc 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -609,7 +609,7 @@ struct ext4_inode {
> > __le16 l_i_file_acl_high;
> > __le16 l_i_uid_high; /* these 2 fields */
> > __le16 l_i_gid_high; /* were reserved2[0] */
> > - __u32 l_i_reserved2;
> > + __le32 l_i_checksum; /* crc32c(uuid+inum+inode) */
> > } linux2;
> > struct {
> > __le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */
> > @@ -727,7 +727,7 @@ do { \
> > #define i_gid_low i_gid
> > #define i_uid_high osd2.linux2.l_i_uid_high
> > #define i_gid_high osd2.linux2.l_i_gid_high
> > -#define i_reserved2 osd2.linux2.l_i_reserved2
> > +#define i_checksum osd2.linux2.l_i_checksum
> >
> > #elif defined(__GNU__)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index c4da98a..44a7f88 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -38,6 +38,7 @@
> > #include <linux/printk.h>
> > #include <linux/slab.h>
> > #include <linux/ratelimit.h>
> > +#include <linux/crc32c.h>
> >
> > #include "ext4_jbd2.h"
> > #include "xattr.h"
> > @@ -49,6 +50,53 @@
> >
> > #define MPAGE_DA_EXTENT_TAIL 0x01
> >
> > +static __le32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw)
> > +{
> > + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > + struct ext4_inode_info *ei = EXT4_I(inode);
> > + int offset = offsetof(struct ext4_inode, i_checksum);
>
> This could be declared "const int" so that it is not consuming space on
> the stack, or just put it inline in the code instead of a stack variable
> since it is a compile time constant.
>
> > + __le32 inum = cpu_to_le32(inode->i_ino);
> > + __u32 crc = 0;
> > +
> > + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
> > + cpu_to_le32(EXT4_OS_LINUX))
>
> This can be marked unlikely() I think.
Ok.
> > + return 0;
> > + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return 0;
> > +
> > + crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
> > + crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
>
> I wonder if it makes sense to pre-compute the crc32c of s_uuid (stored
> in sbi) and/or s_uuid+inum (stored in struct ext4_inode_info). I suspect
> precomputing the s_uuid checksum is worthwhile, but I'm not sure whether
> precomputing the inode checksum is worthwhile unless it doesn't reduce
> the number of ext4_inode_info structs per page in the slab.
Sounds like a good idea, I'll look into it.
> > + crc = crc32c_le(crc, (__u8 *)raw, offset);
> > + offset += sizeof(raw->i_checksum); /* skip checksum */
> > + crc = crc32c_le(crc, (__u8 *)raw + offset,
> > + EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize -
> > + offset);
>
> I suspect it would be more efficient to set raw->i_checksum = 0, then
> compute the checksum on the whole raw inode buffer, and fill in
> raw->i_checksum = cpu_to_le32(crc) at the end. That would mean the
> caller ext4_inode_csum_verify() should save the original checksum for
> comparison with the returned value.
You mean to avoid the overhead of the add/store and the second function call?
> The one problem with this is that it is racy w.r.t other users
Yeah, I was thinking that if I move the *_csum_set() calls to a jbd2 callback
(for journal mode, obviously) then this might clash with that. Maybe a better
approach would be to calculate/verify an entire block's worth of inodes at a
time. Then again, if you only want to touch /one/ inode out of a whole block,
that's a lot of unnecessary work.
> > + return cpu_to_le32(crc);
> > +}
> > +
> > +static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw)
> > +{
> > + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os ==
> > + cpu_to_le32(EXT4_OS_LINUX) &&
> > + EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
> > + (raw->i_checksum != ext4_inode_csum(inode, raw)))
>
> This check can be marked unlikely(), since the rare case of a checksum
> failure can cause a stall in the execution pipeline. It might make sense
> to put the unlikely() at the lone callsite to move the whole function call
> overhead out-of-line.
I suppose so, both for this and for all the other _verify() functions.
> > + return 0;
> > + return 1;
> > +}
> > +
> > +static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw)
> > +{
> > + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
> > + cpu_to_le32(EXT4_OS_LINUX) ||
> > + !EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> > + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> > + return;
> > +
> > + raw->i_checksum = ext4_inode_csum(inode, raw);
> > +}
> > +
> > static inline int ext4_begin_ordered_truncate(struct inode *inode,
> > loff_t new_size)
> > {
> > @@ -3410,6 +3458,15 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> > if (ret < 0)
> > goto bad_inode;
> > raw_inode = ext4_raw_inode(&iloc);
> > +
> > + if (!ext4_inode_csum_verify(inode, raw_inode)) {
> > + EXT4_ERROR_INODE(inode, "checksum invalid (0x%x != 0x%x)",
> > + le32_to_cpu(ext4_inode_csum(inode, raw_inode)),
> > + le32_to_cpu(raw_inode->i_checksum));
> > + ret = -EIO;
> > + goto bad_inode;
> > + }
> > +
> > inode->i_mode = le16_to_cpu(raw_inode->i_mode);
> > inode->i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
> > inode->i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
> > @@ -3490,6 +3547,9 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> > ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
> > if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
> > EXT4_INODE_SIZE(inode->i_sb)) {
> > + EXT4_ERROR_INODE(inode, "bad extra_isize (%u != %u)",
> > + EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize,
> > + EXT4_INODE_SIZE(inode->i_sb));
> > ret = -EIO;
> > goto bad_inode;
> > }
> > @@ -3731,6 +3791,8 @@ static int ext4_do_update_inode(handle_t *handle,
> > raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
> > }
> >
> > + ext4_inode_csum_set(inode, raw_inode);
>
> This might warrant a comment to always be the last function before
> submitting the inode to the journal.
Ok.
> > BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> > rc = ext4_handle_dirty_metadata(handle, NULL, bh);
> > if (!err)
>
> Also, rather than just making the checksum be updated at commit time, it
> makes more sense to have ext4_do_update_inode() only be called once per
> commit, since this is an expensive function.
If I made jbd2 responsible for calling back into ext4 to apply checksums just
prior to submit_bh()ing metadata blocks, I think that would take care of this.
--D
>
> Cheers, Andreas
>
On 2011-09-02, at 12:57 PM, Darrick J. Wong wrote:
> On Thu, Sep 01, 2011 at 01:36:50AM -0600, Andreas Dilger wrote:
>> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
>>> /*
>>> + * This is a bogus directory entry at the end of each leaf block that
>>> + * records checksums.
>>> + */
>>> +struct ext4_dir_entry_tail {
>>> + __le32 reserved_zero1; /* Pretend to be unused */
>>> + __le16 rec_len; /* 12 */
>>> + __le16 reserved_zero2; /* Zero name length */
>>> + __le32 checksum; /* crc32c(uuid+inum+dirblock) */
>>> +};
>>
>> Since this field is stored inline with existing directory entries, it
>> may make sense to also add a magic value to this entry (preferably one
>> with non-ASCII values) so that it can be distinguished from an empty
>> dirent that happens to be at the end of the block.
>
> I could set the file type to 0xDE since currently there's only 8 file types
> defined.
This seems possible, since the dirent is empty the value stored in
file_type is largely irrelevant.
It also definitely makes sense to declare this as an "ext4_dir_entry_2"
style structure, since this is the only dirent that is used in the ext4
code. I'd be happy if you also deleted the ext4_dir_entry structure
definition from ext4.h, since it is unused and only serves to potentially
cause confusion if used accidentally.
You could do the whole sanity check for this tail dirent by treating
it as a 64-bit magic number:
struct ext4_dirent_tail {
union {
struct {
__le32 inode_zero; /* Pretend to be unused */
__le16 rec_len; /* 12 */
__u8 name_zero; /* Zero name length */
__u8 file_type; /* 0xde */
};
__le64 det_magic;
};
__le32 det_checksum;
};
#define EXT4_DIRENT_TAIL_MAGIC 0xde000c000000
That said, looking at this magic number doesn't give me a world of
confidence that it will not be accidentally duplicated, though at
the same time consecutive NUL bytes do not happen in filenames, so
maybe it is OK.
>>> /* checksumming functions */
>>> +static struct ext4_dir_entry_tail *get_dirent_tail(struct inode *inode,
>>> + struct ext4_dir_entry *de)
>>> +{
>>> + struct ext4_dir_entry *d, *top;
>>> + struct ext4_dir_entry_tail *t;
>>> +
>>> + d = de;
>>> + top = (struct ext4_dir_entry *)(((void *)de) +
>>> + (EXT4_BLOCK_SIZE(inode->i_sb) -
>>> + sizeof(struct ext4_dir_entry_tail)));
>>> + while (d < top && d->rec_len)
>>> + d = (struct ext4_dir_entry *)(((void *)d) +
>>> + le16_to_cpu(d->rec_len));
>>
>> Calling get_dirent_tail() is fairly expensive, because it has to walk
>> the whole directory block each time. When filling a block it would
>> be O(n^2) for the number of entries in the block.
>>
>> It would be more efficient to just cast the end of the directory block
>> to the ext4_dir_entry_tail and check its validity, which is especially
>> easy if there is a magic value in it.
>>
>>> + if (d != top)
>>> + return NULL;
>>> +
>>> + t = (struct ext4_dir_entry_tail *)d;
>>> + if (t->reserved_zero1 ||
>>> + le16_to_cpu(t->rec_len) != sizeof(struct ext4_dir_entry_tail) ||
>>> + t->reserved_zero2)
>>
>> I'd prefer these reserved_zero[12] fields be explicitly compared to zero
>> instead of treated as boolean values.
>
> Ok.
>
>>> + return NULL;
>>> +
>>> + return t;
>>> +}
>>> +
>>> +static __le32 ext4_dirent_csum(struct inode *inode,
>>> + struct ext4_dir_entry *dirent)
>>> +{
>>> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>>> + struct ext4_dir_entry_tail *t;
>>> + __le32 inum = cpu_to_le32(inode->i_ino);
>>> + int size;
>>> + __u32 crc = 0;
>>> +
>>> + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + return 0;
>>> +
>>> + t = get_dirent_tail(inode, dirent);
>>
>>> + if (!t)
>>> + return 0;
>>> +
>>> + size = (void *)t - (void *)dirent;
>>> + crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
>>> + crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
>>
>> Based on the number of times that the s_uuid+inum checksum is used in this
>> code, and since it is constant for the life of the inode, it probably
>> makes sense to precompute it and store it in ext4_inode_info.
>
> Agreed.
>
>> Also, now that I think about it, these checksums that contain the inum
>> should also contain i_generation, so that there is no confusion with
>> accessing old blocks on disk.
>
> i_generation only gets updated when inodes are created or SETVERSION ioctl is
> called, correct? I guess it wouldn't be too difficult to rewrite all file
> metadata, though it could get a little expensive.
I don't think SETVERSION is used in any regular cases (at least I'm not
aware of any applications/tools that use it). Note that, despite the
name, this sets the i_generation field and not the NFSv4 i_version field,
so it should be constant for the life of the inode. In the short term you
could just disable SETVERSION on an inode if checksums are enabled, and
see if anyone complains about it at all.
Cheers, Andreas
>>> + crc = crc32c_le(crc, (__u8 *)dirent, size);
>>> + return cpu_to_le32(crc);
>>> +}
>>> +
>>> +int ext4_dirent_csum_verify(struct inode *inode, struct ext4_dir_entry *dirent)
>>> +{
>>> + struct ext4_dir_entry_tail *t;
>>> +
>>> + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + return 1;
>>> +
>>> + t = get_dirent_tail(inode, dirent);
>>> + if (!t) {
>>> + EXT4_ERROR_INODE(inode, "metadata_csum set but no space in dir "
>>> + "leaf for checksum. Please run e2fsck -D.");
>>> + return 0;
>>> + }
>>
>> I don't think this should necessarily be considered an error. That
>> there is no space in the directory block is not a sign of corruption.
>
> I was trying to steer users towards running fsck, which will notice the
> lack of space and rebuild the dir. With a somewhat large mallet. :)
Yes, but this would cause a service interruption because the filesystem
will be mounted read-only and/or panic the kernel, which is not justified
for a situation which is legitimately possible and does not indicate data
corruption of the filesystem.
>>> +
>>> + if (t->checksum != ext4_dirent_csum(inode, dirent))
>>> + return 0;
>>> +
>>> + return 1;
>>> +}
>>> +
>>> +static void ext4_dirent_csum_set(struct inode *inode,
>>> + struct ext4_dir_entry *dirent)
>>> +{
>>> + struct ext4_dir_entry_tail *t;
>>> +
>>> + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + return;
>>> +
>>> + t = get_dirent_tail(inode, dirent);
>>> + if (!t) {
>>> + EXT4_ERROR_INODE(inode, "metadata_csum set but no space in dir "
>>> + "leaf for checksum. Please run e2fsck -D.");
>>> + return;
>>> + }
>>> +
>>> + t->checksum = ext4_dirent_csum(inode, dirent);
>>> +}
>>> +
>>> +static inline int ext4_handle_dirty_dirent_node(handle_t *handle,
>>> + struct inode *inode,
>>> + struct buffer_head *bh)
>>> +{
>>> + ext4_dirent_csum_set(inode, (struct ext4_dir_entry *)bh->b_data);
>>> + return ext4_handle_dirty_metadata(handle, inode, bh);
>>> +}
>>> +
>>> static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
>>> struct ext4_dir_entry *dirent,
>>> int *offset)
>>> @@ -748,6 +846,11 @@ static int htree_dirblock_to_tree(struct file *dir_file,
>>> if (!(bh = ext4_bread (NULL, dir, block, 0, &err)))
>>> return err;
>>>
>>> + if (!buffer_verified(bh) &&
>>> + !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data))
>>> + return -EIO;
>>> + set_buffer_verified(bh);
>>> +
>>> de = (struct ext4_dir_entry_2 *) bh->b_data;
>>
>> You might as well set de before calling ext4_dirent_csum_verify() to avoid
>> having another unsightly cast.
>
> Ok.
>
>>> top = (struct ext4_dir_entry_2 *) ((char *) de +
>>> dir->i_sb->s_blocksize -
>>> @@ -1106,6 +1209,15 @@ restart:
>>> brelse(bh);
>>> goto next;
>>> }
>>> + if (!buffer_verified(bh) &&
>>> + !ext4_dirent_csum_verify(dir,
>>> + (struct ext4_dir_entry *)bh->b_data)) {
>>> + EXT4_ERROR_INODE(dir, "checksumming directory "
>>> + "block %lu", (unsigned long)block);
>>> + brelse(bh);
>>> + goto next;
>>> + }
>>> + set_buffer_verified(bh);
>>> i = search_dirblock(bh, dir, d_name,
>>> block << EXT4_BLOCK_SIZE_BITS(sb), res_dir);
>>> if (i == 1) {
>>> @@ -1157,6 +1269,16 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir, const struct q
>>> if (!(bh = ext4_bread(NULL, dir, block, 0, err)))
>>> goto errout;
>>>
>>> + if (!buffer_verified(bh) &&
>>> + !ext4_dirent_csum_verify(dir,
>>> + (struct ext4_dir_entry *)bh->b_data)) {
>>> + EXT4_ERROR_INODE(dir, "checksumming directory "
>>> + "block %lu", (unsigned long)block);
>>> + brelse(bh);
>>> + *err = -EIO;
>>> + goto errout;
>>> + }
>>> + set_buffer_verified(bh);
>>> retval = search_dirblock(bh, dir, d_name,
>>> block << EXT4_BLOCK_SIZE_BITS(sb),
>>> res_dir);
>>> @@ -1329,8 +1451,14 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
>>> char *data1 = (*bh)->b_data, *data2;
>>> unsigned split, move, size;
>>> struct ext4_dir_entry_2 *de = NULL, *de2;
>>> + struct ext4_dir_entry_tail *t;
>>> + int csum_size = 0;
>>> int err = 0, i;
>>>
>>> + if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + csum_size = sizeof(struct ext4_dir_entry_tail);
>>> +
>>> bh2 = ext4_append (handle, dir, &newblock, &err);
>>> if (!(bh2)) {
>>> brelse(*bh);
>>> @@ -1377,10 +1505,24 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
>>> /* Fancy dance to stay within two buffers */
>>> de2 = dx_move_dirents(data1, data2, map + split, count - split, blocksize);
>>> de = dx_pack_dirents(data1, blocksize);
>>> - de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de,
>>> + de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) -
>>> + (char *) de,
>>> blocksize);
>>
>> (style) This should be "(char *)de, blocksize);"
>>
>>> - de2->rec_len = ext4_rec_len_to_disk(data2 + blocksize - (char *) de2,
>>> + de2->rec_len = ext4_rec_len_to_disk(data2 + (blocksize - csum_size) -
>>> + (char *) de2,
>>> blocksize);
>>
>> (style) likewise
>>
>>> + if (csum_size) {
>>> + t = (struct ext4_dir_entry_tail *)(data2 +
>>> + (blocksize - csum_size));
>>> + memset(t, 0, csum_size);
>>> + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
>>> +
>>> + t = (struct ext4_dir_entry_tail *)(data1 +
>>> + (blocksize - csum_size));
>>> + memset(t, 0, csum_size);
>>> + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
>>> + }
>>> +
>>> dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data1, blocksize, 1));
>>> dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data2, blocksize, 1));
>>>
>>> @@ -1391,7 +1533,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
>>> de = de2;
>>> }
>>> dx_insert_block(frame, hash2 + continued, newblock);
>>> - err = ext4_handle_dirty_metadata(handle, dir, bh2);
>>> + err = ext4_handle_dirty_dirent_node(handle, dir, bh2);
>>> if (err)
>>> goto journal_error;
>>> err = ext4_handle_dirty_dx_node(handle, dir, frame->bh);
>>> @@ -1431,11 +1573,16 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
>>> unsigned short reclen;
>>> int nlen, rlen, err;
>>> char *top;
>>> + int csum_size = 0;
>>> +
>>> + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + csum_size = sizeof(struct ext4_dir_entry_tail);
>>>
>>> reclen = EXT4_DIR_REC_LEN(namelen);
>>> if (!de) {
>>> de = (struct ext4_dir_entry_2 *)bh->b_data;
>>> - top = bh->b_data + blocksize - reclen;
>>> + top = bh->b_data + (blocksize - csum_size) - reclen;
>>> while ((char *) de <= top) {
>>> if (ext4_check_dir_entry(dir, NULL, de, bh, offset))
>>> return -EIO;
>>> @@ -1491,7 +1638,7 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
>>> dir->i_version++;
>>> ext4_mark_inode_dirty(handle, dir);
>>> BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
>>> - err = ext4_handle_dirty_metadata(handle, dir, bh);
>>> + err = ext4_handle_dirty_dirent_node(handle, dir, bh);
>>> if (err)
>>> ext4_std_error(dir->i_sb, err);
>>> return 0;
>>> @@ -1512,6 +1659,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
>>> struct dx_frame frames[2], *frame;
>>> struct dx_entry *entries;
>>> struct ext4_dir_entry_2 *de, *de2;
>>> + struct ext4_dir_entry_tail *t;
>>> char *data1, *top;
>>> unsigned len;
>>> int retval;
>>> @@ -1519,6 +1667,11 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
>>> struct dx_hash_info hinfo;
>>> ext4_lblk_t block;
>>> struct fake_dirent *fde;
>>> + int csum_size = 0;
>>> +
>>> + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + csum_size = sizeof(struct ext4_dir_entry_tail);
>>>
>>> blocksize = dir->i_sb->s_blocksize;
>>> dxtrace(printk(KERN_DEBUG "Creating index: inode %lu\n", dir->i_ino));
>>> @@ -1539,7 +1692,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
>>> brelse(bh);
>>> return -EIO;
>>> }
>>> - len = ((char *) root) + blocksize - (char *) de;
>>> + len = ((char *) root) + (blocksize - csum_size) - (char *) de;
>>
>> (style) "(char *)root)" and "(char *)de".
>>
>>> /* Allocate new block for the 0th block's dirents */
>>> bh2 = ext4_append(handle, dir, &block, &retval);
>>> @@ -1555,8 +1708,17 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
>>> top = data1 + len;
>>> while ((char *)(de2 = ext4_next_entry(de, blocksize)) < top)
>>> de = de2;
>>> - de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de,
>>> + de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) -
>>> + (char *) de,
>>> blocksize);
>>
>> Likewise.
>
> Ok, I'll fix the style complaints. Should checkpatch find these things?
>
> --D
>
>>> +
>>> + if (csum_size) {
>>> + t = (struct ext4_dir_entry_tail *)(data1 +
>>> + (blocksize - csum_size));
>>> + memset(t, 0, csum_size);
>>> + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
>>> + }
>>> +
>>> /* Initialize the root; the dot dirents already exist */
>>> de = (struct ext4_dir_entry_2 *) (&root->dotdot);
>>> de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(2),
>>> @@ -1582,7 +1744,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
>>> bh = bh2;
>>>
>>> ext4_handle_dirty_dx_node(handle, dir, frame->bh);
>>> - ext4_handle_dirty_metadata(handle, dir, bh);
>>> + ext4_handle_dirty_dirent_node(handle, dir, bh);
>>>
>>> de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
>>> if (!de) {
>>> @@ -1618,11 +1780,17 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
>>> struct inode *dir = dentry->d_parent->d_inode;
>>> struct buffer_head *bh;
>>> struct ext4_dir_entry_2 *de;
>>> + struct ext4_dir_entry_tail *t;
>>> struct super_block *sb;
>>> int retval;
>>> int dx_fallback=0;
>>> unsigned blocksize;
>>> ext4_lblk_t block, blocks;
>>> + int csum_size = 0;
>>> +
>>> + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + csum_size = sizeof(struct ext4_dir_entry_tail);
>>>
>>> sb = dir->i_sb;
>>> blocksize = sb->s_blocksize;
>>> @@ -1641,6 +1809,11 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
>>> bh = ext4_bread(handle, dir, block, 0, &retval);
>>> if(!bh)
>>> return retval;
>>> + if (!buffer_verified(bh) &&
>>> + !ext4_dirent_csum_verify(dir,
>>> + (struct ext4_dir_entry *)bh->b_data))
>>> + return -EIO;
>>> + set_buffer_verified(bh);
>>> retval = add_dirent_to_buf(handle, dentry, inode, NULL, bh);
>>> if (retval != -ENOSPC) {
>>> brelse(bh);
>>> @@ -1657,7 +1830,15 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
>>> return retval;
>>> de = (struct ext4_dir_entry_2 *) bh->b_data;
>>> de->inode = 0;
>>> - de->rec_len = ext4_rec_len_to_disk(blocksize, blocksize);
>>> + de->rec_len = ext4_rec_len_to_disk(blocksize - csum_size, blocksize);
>>> +
>>> + if (csum_size) {
>>> + t = (struct ext4_dir_entry_tail *)(((void *)bh->b_data) +
>>> + (blocksize - csum_size));
>>> + memset(t, 0, csum_size);
>>> + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
>>> + }
>>> +
>>> retval = add_dirent_to_buf(handle, dentry, inode, de, bh);
>>> brelse(bh);
>>> if (retval == 0)
>>> @@ -1689,6 +1870,11 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
>>> if (!(bh = ext4_bread(handle,dir, dx_get_block(frame->at), 0, &err)))
>>> goto cleanup;
>>>
>>> + if (!buffer_verified(bh) &&
>>> + !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data))
>>> + goto journal_error;
>>> + set_buffer_verified(bh);
>>> +
>>> BUFFER_TRACE(bh, "get_write_access");
>>> err = ext4_journal_get_write_access(handle, bh);
>>> if (err)
>>> @@ -1814,12 +2000,17 @@ static int ext4_delete_entry(handle_t *handle,
>>> {
>>> struct ext4_dir_entry_2 *de, *pde;
>>> unsigned int blocksize = dir->i_sb->s_blocksize;
>>> + int csum_size = 0;
>>> int i, err;
>>>
>>> + if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + csum_size = sizeof(struct ext4_dir_entry_tail);
>>> +
>>> i = 0;
>>> pde = NULL;
>>> de = (struct ext4_dir_entry_2 *) bh->b_data;
>>> - while (i < bh->b_size) {
>>> + while (i < bh->b_size - csum_size) {
>>> if (ext4_check_dir_entry(dir, NULL, de, bh, i))
>>> return -EIO;
>>> if (de == de_del) {
>>> @@ -1840,7 +2031,7 @@ static int ext4_delete_entry(handle_t *handle,
>>> de->inode = 0;
>>> dir->i_version++;
>>> BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
>>> - err = ext4_handle_dirty_metadata(handle, dir, bh);
>>> + err = ext4_handle_dirty_dirent_node(handle, dir, bh);
>>> if (unlikely(err)) {
>>> ext4_std_error(dir->i_sb, err);
>>> return err;
>>> @@ -1983,9 +2174,15 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, int mode)
>>> struct inode *inode;
>>> struct buffer_head *dir_block = NULL;
>>> struct ext4_dir_entry_2 *de;
>>> + struct ext4_dir_entry_tail *t;
>>> unsigned int blocksize = dir->i_sb->s_blocksize;
>>> + int csum_size = 0;
>>> int err, retries = 0;
>>>
>>> + if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + csum_size = sizeof(struct ext4_dir_entry_tail);
>>> +
>>> if (EXT4_DIR_LINK_MAX(dir))
>>> return -EMLINK;
>>>
>>> @@ -2026,16 +2223,26 @@ retry:
>>> ext4_set_de_type(dir->i_sb, de, S_IFDIR);
>>> de = ext4_next_entry(de, blocksize);
>>> de->inode = cpu_to_le32(dir->i_ino);
>>> - de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(1),
>>> + de->rec_len = ext4_rec_len_to_disk(blocksize -
>>> + (csum_size + EXT4_DIR_REC_LEN(1)),
>>> blocksize);
>>> de->name_len = 2;
>>> strcpy(de->name, "..");
>>> ext4_set_de_type(dir->i_sb, de, S_IFDIR);
>>> inode->i_nlink = 2;
>>> +
>>> + if (csum_size) {
>>> + t = (struct ext4_dir_entry_tail *)(((void *)dir_block->b_data) +
>>> + (blocksize - csum_size));
>>> + memset(t, 0, csum_size);
>>> + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
>>> + }
>>> +
>>> BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
>>> - err = ext4_handle_dirty_metadata(handle, inode, dir_block);
>>> + err = ext4_handle_dirty_dirent_node(handle, inode, dir_block);
>>> if (err)
>>> goto out_clear_inode;
>>> + set_buffer_verified(dir_block);
>>> err = ext4_mark_inode_dirty(handle, inode);
>>> if (!err)
>>> err = ext4_add_entry(handle, dentry, inode);
>>> @@ -2085,6 +2292,14 @@ static int empty_dir(struct inode *inode)
>>> inode->i_ino);
>>> return 1;
>>> }
>>> + if (!buffer_verified(bh) &&
>>> + !ext4_dirent_csum_verify(inode,
>>> + (struct ext4_dir_entry *)bh->b_data)) {
>>> + EXT4_ERROR_INODE(inode, "checksum error reading directory "
>>> + "lblock 0");
>>> + return -EIO;
>>> + }
>>> + set_buffer_verified(bh);
>>> de = (struct ext4_dir_entry_2 *) bh->b_data;
>>> de1 = ext4_next_entry(de, sb->s_blocksize);
>>> if (le32_to_cpu(de->inode) != inode->i_ino ||
>>> @@ -2116,6 +2331,14 @@ static int empty_dir(struct inode *inode)
>>> offset += sb->s_blocksize;
>>> continue;
>>> }
>>> + if (!buffer_verified(bh) &&
>>> + !ext4_dirent_csum_verify(inode,
>>> + (struct ext4_dir_entry *)bh->b_data)) {
>>> + EXT4_ERROR_INODE(inode, "checksum error "
>>> + "reading directory lblock 0");
>>> + return -EIO;
>>> + }
>>> + set_buffer_verified(bh);
>>> de = (struct ext4_dir_entry_2 *) bh->b_data;
>>> }
>>> if (ext4_check_dir_entry(inode, NULL, de, bh, offset)) {
>>> @@ -2616,6 +2839,11 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
>>> dir_bh = ext4_bread(handle, old_inode, 0, 0, &retval);
>>> if (!dir_bh)
>>> goto end_rename;
>>> + if (!buffer_verified(dir_bh) &&
>>> + !ext4_dirent_csum_verify(old_inode,
>>> + (struct ext4_dir_entry *)dir_bh->b_data))
>>> + goto end_rename;
>>> + set_buffer_verified(dir_bh);
>>> if (le32_to_cpu(PARENT_INO(dir_bh->b_data,
>>> old_dir->i_sb->s_blocksize)) != old_dir->i_ino)
>>> goto end_rename;
>>> @@ -2646,7 +2874,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
>>> ext4_current_time(new_dir);
>>> ext4_mark_inode_dirty(handle, new_dir);
>>> BUFFER_TRACE(new_bh, "call ext4_handle_dirty_metadata");
>>> - retval = ext4_handle_dirty_metadata(handle, new_dir, new_bh);
>>> + retval = ext4_handle_dirty_dirent_node(handle, new_dir, new_bh);
>>> if (unlikely(retval)) {
>>> ext4_std_error(new_dir->i_sb, retval);
>>> goto end_rename;
>>> @@ -2700,7 +2928,8 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
>>> PARENT_INO(dir_bh->b_data, new_dir->i_sb->s_blocksize) =
>>> cpu_to_le32(new_dir->i_ino);
>>> BUFFER_TRACE(dir_bh, "call ext4_handle_dirty_metadata");
>>> - retval = ext4_handle_dirty_metadata(handle, old_inode, dir_bh);
>>> + retval = ext4_handle_dirty_dirent_node(handle, old_inode,
>>> + dir_bh);
>>> if (retval) {
>>> ext4_std_error(old_dir->i_sb, retval);
>>> goto end_rename;
>>>
>>
On 2011-09-02, at 1:08 PM, Darrick J. Wong wrote:
> On Thu, Sep 01, 2011 at 12:08:44AM -0600, Andreas Dilger wrote:
>> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
>>> Compute and verify the checksum of the block bitmap; this checksum is stored in the block group descriptor.
>>
>> Since this is CPU intensive, it might make sense to start computing the
>> block bitmap checksums as soon as the buffer is uptodate, instead of
>> waiting for all of the buffers to be read and _then_ doing the checksums.
>>
>> Even better might be to do move all of the above code to do the checksum
>> to be in a new the b_end_io callback, so that it can start as soon as each
>> buffer is read from disk, to maximize CPU and IO overlap, like:
>
> Good suggestion. I'll put it into the next rev.
>
> --D
>
>> struct ext4_csum_data {
>> struct superblock *cd_sb;
>> ext4_group_t cd_group;
>> };
>>
>> static void ext4_end_buffered_read_sync_csum(struct buffer_head *bh,
>> int uptodate)
>> {
>> struct superblock *sb = (struct ext4_csum_data *)(bh->b_private)->cd_sb;
>> ext4_group_t group = (struct ext4_csum_data *)(bh->b_private)->cd_group;
>>
>> end_buffer_read_sync(bh, uptodate);
Actually, the call to end_buffer_read_sync() should go _after_ the
checksum is calculated, so that no other thread can start using
the buffer before the checksum is verified (i.e. any checks on
buffer_uptodate() could incorrectly succeed before the checksum
is computed and set_buffer_verified(bh) is called, and it would
incorrectly return an IO error thinking the checksum failed).
Also, end_buffer_sync_read() drops the reference to bh, but this
code is accessing bh->b_data, so another reason to move it after
the checksum is computed.
Cheers, Andreas
>> if (uptodate) {
>> struct ext4_group_desc *desc;
>>
>> desc = ext4_get_group_desc(sb, group, NULL);
>> if (!desc)
>> return;
>>
>> ext4_lock_group(sb, group);
>> if (ext4_valid_block_bitmap(sb, desc, group, bh) &&
>> ext4_bitmap_csum_verify(sb, group,
>> desc->bg_block_bitmap_csum, bh,
>> (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8))
>> set_buffer_verified(bh);
>>
>> ext4_unlock_group(rcd->rcd_sb, rcd->rcd_group);
>> }
>> }
>>
>> Then later in the code can just check buffer_verified() in the caller:
>>
>> ext4_read_block_bitmap()
>> {
>> /* read all groups the page covers into the cache */
>> for (i = 0; i < groups_per_page; i++) {
>> :
>> :
>> set_bitmap_uptodate(bh[i]);
>> ecd[i].cd_sb = sb;
>> ecd[i].cd_group = first_group + i;
>> bh[i]->b_end_io = ext4_end_buffer_read_sync_csum;
>> submit_bh(READ, bh[i]);
>> mb_debug(1, "read bitmap for group %u\n", first_group + i);
>> }
>>
>> err = 0;
>> /* always wait for I/O completion before returning */
>> for (i = 0; i < groups_per_page; i++) {
>> if (bh[i]) {
>> wait_on_buffer(bh[i]);
>> if (!buffer_uptodate(bh[i]) ||
>> !buffer_verified(bh[i]))
>> err = -EIO;
>> }
>> }
>>
>>
>>> err = 0;
>>> first_block = page->index * blocks_per_page;
>>> for (i = 0; i < blocks_per_page; i++) {
>>> @@ -2829,6 +2856,9 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
>>> }
>>> len = ext4_free_blks_count(sb, gdp) - ac->ac_b_ex.fe_len;
>>> ext4_free_blks_set(sb, gdp, len);
>>> + ext4_bitmap_csum_set(sb, ac->ac_b_ex.fe_group,
>>> + &gdp->bg_block_bitmap_csum, bitmap_bh,
>>> + (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8);
>>> gdp->bg_checksum = ext4_group_desc_csum(sbi, ac->ac_b_ex.fe_group, gdp);
>>>
>>> ext4_unlock_group(sb, ac->ac_b_ex.fe_group);
>>> @@ -4638,6 +4668,8 @@ do_more:
>>>
>>> ret = ext4_free_blks_count(sb, gdp) + count;
>>> ext4_free_blks_set(sb, gdp, ret);
>>> + ext4_bitmap_csum_set(sb, block_group, &gdp->bg_block_bitmap_csum,
>>> + bitmap_bh, (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8);
>>> gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
>>> ext4_unlock_group(sb, block_group);
>>> percpu_counter_add(&sbi->s_freeblocks_counter, count);
>>> @@ -4780,6 +4812,8 @@ int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
>>> mb_free_blocks(NULL, &e4b, bit, count);
>>> blk_free_count = blocks_freed + ext4_free_blks_count(sb, desc);
>>> ext4_free_blks_set(sb, desc, blk_free_count);
>>> + ext4_bitmap_csum_set(sb, block_group, &desc->bg_block_bitmap_csum,
>>> + bitmap_bh, (EXT4_BLOCKS_PER_GROUP(sb) + 7) / 8);
>>> desc->bg_checksum = ext4_group_desc_csum(sbi, block_group, desc);
>>> ext4_unlock_group(sb, block_group);
>>> percpu_counter_add(&sbi->s_freeblocks_counter, blocks_freed);
>>>
>>
On 2011-09-02, at 1:18 PM, Darrick J. Wong wrote:
> On Wed, Aug 31, 2011 at 10:49:05PM -0600, Andreas Dilger wrote:
>> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
>>> Compute and verify the checksum of the inode bitmap; the checkum is stored in
>>> the block group descriptor.
>>
>> I would prefer if there was a 16-bit checksum for the (most common)
>> 32-byte group descriptors, and this was extended to a 32-bit checksum
>> for the (much less common) 64-byte+ group descriptors. For filesystems
>> that are newly formatted with the 64bit feature it makes no difference,
>> but virtually all ext3/4 filesystems have only the smaller group descriptors.
>>
>> Regardless of whether using half of the crc32c is better or worse than
>> using crc16 for the bitmap blocks, storing _any_ checksum is better than
>> storing nothing at all. I would propose the following:
>
> That's an interesting reframing of the argument that I hadn't considered.
> I'd fallen into the idea of needing crc32c because of its bit error
> guarantees (all corruptions of odd numbers of bits and all corruptions of
> fewer than ...4? bits) that I hadn't quite realized that even if crc16
> can't guarantee to find any corruption at all, it still _might_, and that's
> better than nothing.
>
> Ok, let's split the 32-bit fields and use crc16 for the case of 32-byte block
> group descriptors.
I noticed the crc16 calculation is actually _slower_ than crc32c,
probably because the CPU cannot use 32-bit values when computing the
result, so it has to do a lot of word masking, per your table at
https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums.
Also, there is the question of whether computing two different
checksums is needlessly complicating the code, or if it is easier
to just compute crc32c all the time and only make the storing of
the high 16 bits conditional.
What I'm suggesting is always computing the crc32c, but for filesystems
that are not formatted with the 64bit option just store the low 16 bits
of the crc32c value into bg_{block,inode}_bitmap_csum_lo. This is much
better than not computing a checksum here at all. The only open question
is whether 1/2 of crc32c is substantially worse at detecting errors than
crc16 or not?
I was also thinking whether the EXT4_FEATURE_RO_COMPAT_METADATA_CSUM
feature should also cause the bg_checksum to do the same (store only
low 16 bits of crc32c) just for the improved speed?
It might be interesting to redo the table that you computed, but
using a loop that is only computing the checksums for small blocks
of data (e.g. 32 bytes and 4096 bytes in a loop for a total of 512MB)
to see what the overhead of the cryptoapi and hardware calls are.
>> struct ext4_group_desc
>> {
>> __le32 bg_block_bitmap_lo; /* Blocks bitmap block */
>> __le32 bg_inode_bitmap_lo; /* Inodes bitmap block */
>> __le32 bg_inode_table_lo; /* Inodes table block */
>> __le16 bg_free_blocks_count_lo; /* Free blocks count */
>> __le16 bg_free_inodes_count_lo; /* Free inodes count */
>> __le16 bg_used_dirs_count_lo; /* Directories count */
>> __le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */
>> __le32 bg_exclude_bitmap_lo; /* Exclude bitmap block */
>> __le16 bg_block_bitmap_csum_lo; /* Block bitmap checksum */
>> __le16 bg_inode_bitmap_csum_lo; /* Inode bitmap checksum */
>> __le16 bg_itable_unused_lo; /* Unused inodes count */
>> __le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
>> __le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
>> __le32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */
>> __le32 bg_inode_table_hi; /* Inodes table block MSB */
>> __le16 bg_free_blocks_count_hi; /* Free blocks count MSB */
>> __le16 bg_free_inodes_count_hi; /* Free inodes count MSB */
>> __le16 bg_used_dirs_count_hi; /* Directories count MSB */
>> __le16 bg_itable_unused_hi; /* Unused inodes count MSB */
>> __le32 bg_exclude_bitmap_hi; /* Exclude bitmap block MSB */
>> __le16 bg_block_bitmap_csum_hi; /* Blocks bitmap checksum MSB */
>> __le16 bg_inode_bitmap_csum_hi; /* Inodes bitmap checksum MSB */
>> __le32 bg_reserved2;
>> };
>>
>> This is also different from your layout because it locates the block bitmap
>> checksum field before the inode bitmap checksum, to more closely match the
>> order of other fields in this structure.
>
> Er.. I reversed the order in the structure definition just prior to publishing,
> and forgot to update the wiki page. Well I guess I'm about to update it again.
> :)
>
>>> /*
>>> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
>>> index 9c63f27..53faffc 100644
>>> --- a/fs/ext4/ialloc.c
>>> +++ b/fs/ext4/ialloc.c
>>> @@ -82,12 +82,18 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb,
>>> ext4_free_inodes_set(sb, gdp, 0);
>>> ext4_itable_unused_set(sb, gdp, 0);
>>> memset(bh->b_data, 0xff, sb->s_blocksize);
>>> + ext4_bitmap_csum_set(sb, block_group,
>>> + &gdp->bg_inode_bitmap_csum, bh,
>>> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
>>
>> The number of inodes per group is already always a multiple of 8.
>
> Ok. I suppose we can fix that in the lines below too.
>
>>> return 0;
>>> }
>>>
>>> memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
>>> ext4_mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), sb->s_blocksize * 8,
>>> bh->b_data);
>>> + ext4_bitmap_csum_set(sb, block_group, &gdp->bg_inode_bitmap_csum, bh,
>>> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
>>> + gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
>>>
>>> return EXT4_INODES_PER_GROUP(sb);
>>> }
>>> @@ -118,12 +124,12 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
>>> return NULL;
>>> }
>>> if (bitmap_uptodate(bh))
>>> - return bh;
>>> + goto verify;
>>>
>>> lock_buffer(bh);
>>> if (bitmap_uptodate(bh)) {
>>> unlock_buffer(bh);
>>> - return bh;
>>> + goto verify;
>>> }
>>>
>>> ext4_lock_group(sb, block_group);
>>> @@ -131,6 +137,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
>>> ext4_init_inode_bitmap(sb, bh, block_group, desc);
>>> set_bitmap_uptodate(bh);
>>> set_buffer_uptodate(bh);
>>> + set_buffer_verified(bh);
>>> ext4_unlock_group(sb, block_group);
>>> unlock_buffer(bh);
>>> return bh;
>>> @@ -144,7 +151,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
>>> */
>>> set_bitmap_uptodate(bh);
>>> unlock_buffer(bh);
>>> - return bh;
>>> + goto verify;
>>> }
>>> /*
>>> * submit the buffer_head for read. We can
>>> @@ -161,6 +168,21 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
>>> block_group, bitmap_blk);
>>> return NULL;
>>> }
>>> +
>>> +verify:
>>> + ext4_lock_group(sb, block_group);
>>> + if (!buffer_verified(bh) &&
>>> + !ext4_bitmap_csum_verify(sb, block_group,
>>> + desc->bg_inode_bitmap_csum, bh,
>>> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8)) {
>>> + ext4_unlock_group(sb, block_group);
>>> + put_bh(bh);
>>> + ext4_error(sb, "Corrupt inode bitmap - block_group = %u, "
>>> + "inode_bitmap = %llu", block_group, bitmap_blk);
>>> + return NULL;
>>
>> At some point we should add a flag like EXT4_BG_INODE_ERROR so that the
>> group can be marked in error on disk, and skipped for future allocations,
>> but the whole filesystem does not need to be remounted read-only. That's
>> for another patch, however.
>
> Agreed. :)
>
> --D
>
>>> + }
>>> + ext4_unlock_group(sb, block_group);
>>> + set_buffer_verified(bh);
>>> return bh;
>>> }
>>>
>>> @@ -265,6 +287,8 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
>>> ext4_used_dirs_set(sb, gdp, count);
>>> percpu_counter_dec(&sbi->s_dirs_counter);
>>> }
>>> + ext4_bitmap_csum_set(sb, block_group, &gdp->bg_inode_bitmap_csum,
>>> + bitmap_bh, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
>>> gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
>>> ext4_unlock_group(sb, block_group);
>>>
>>> @@ -784,6 +808,9 @@ static int ext4_claim_inode(struct super_block *sb,
>>> atomic_inc(&sbi->s_flex_groups[f].used_dirs);
>>> }
>>> }
>>> + ext4_bitmap_csum_set(sb, group, &gdp->bg_inode_bitmap_csum,
>>> + inode_bitmap_bh,
>>> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
>>> gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
>>> err_ret:
>>> ext4_unlock_group(sb, group);
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2011-09-02, at 1:32 PM, Darrick J. Wong wrote:
> On Wed, Aug 31, 2011 at 08:30:25PM -0600, Andreas Dilger wrote:
>> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
>>> This patch introduces to ext4 the ability to calculate and verify inode
>>> checksums. This requires the use of a new ro compatibility flag and some
>>> accompanying e2fsprogs patches to provide the relevant features in tune2fs and e2fsck.
>>>
>>>
>>> +static __le32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw)
>>> +{
>>> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>>> + struct ext4_inode_info *ei = EXT4_I(inode);
>>> + int offset = offsetof(struct ext4_inode, i_checksum);
>>
>> This could be declared "const int" so that it is not consuming space on
>> the stack, or just put it inline in the code instead of a stack variable
>> since it is a compile time constant.
>>
>>> + __le32 inum = cpu_to_le32(inode->i_ino);
>>> + __u32 crc = 0;
>>> +
>>> + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
>>> + cpu_to_le32(EXT4_OS_LINUX))
>>
>> This can be marked unlikely() I think.
>
> Ok.
>
>>> + return 0;
>>> + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + return 0;
>>> +
>>> + crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
>>> + crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
>>
>> I wonder if it makes sense to pre-compute the crc32c of s_uuid (stored
>> in sbi) and/or s_uuid+inum (stored in struct ext4_inode_info). I suspect
>> precomputing the s_uuid checksum is worthwhile, but I'm not sure whether
>> precomputing the inode checksum is worthwhile unless it doesn't reduce
>> the number of ext4_inode_info structs per page in the slab.
>
> Sounds like a good idea, I'll look into it.
Looking more closely at the cryptoapi code, I'm fairly confident that
storing the partial crc32c for the uuid+inum+generation into the inode
is going to be worthwhile, compared to calling crc32c_le() 3 extra times.
>>> + crc = crc32c_le(crc, (__u8 *)raw, offset);
>>> + offset += sizeof(raw->i_checksum); /* skip checksum */
>>> + crc = crc32c_le(crc, (__u8 *)raw + offset,
>>> + EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize -
>>> + offset);
>>
>> I suspect it would be more efficient to set raw->i_checksum = 0, then
>> compute the checksum on the whole raw inode buffer, and fill in
>> raw->i_checksum = cpu_to_le32(crc) at the end. That would mean the
>> caller ext4_inode_csum_verify() should save the original checksum for
>> comparison with the returned value.
>
> You mean to avoid the overhead of the add/store and the second function call?
Mostly the overhead of the extra calls into crc32c_le() and the cryptoapi.
There are a lot of extra pointer indirections in that code, and calling
into cryptoapi for 4-byte values adds (vaguely) 60-100 operations per word
on top of the actual checksum operations, unless it all disappears at
compile time (hard to see at first glance).
>> The one problem with this is that it is racy w.r.t other users
>
> Yeah, I was thinking that if I move the *_csum_set() calls to a jbd2 callback
> (for journal mode, obviously) then this might clash with that. Maybe a better
> approach would be to calculate/verify an entire block's worth of inodes at a
> time. Then again, if you only want to touch /one/ inode out of a whole block,
> that's a lot of unnecessary work.
However, if you are doing that from the jbd2 callback, the code also has
exclusive control over the buffer at that time, so computing the checksum
on the zeroed bytes in a single pass is not racy, and would definitely be
less overhead for such a small number of bytes.
>>> + return cpu_to_le32(crc);
>>> +}
>>> +
>>> +static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw)
>>> +{
>>> + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os ==
>>> + cpu_to_le32(EXT4_OS_LINUX) &&
>>> + EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
>>> + (raw->i_checksum != ext4_inode_csum(inode, raw)))
>>
>> This check can be marked unlikely(), since the rare case of a checksum
>> failure can cause a stall in the execution pipeline. It might make sense
>> to put the unlikely() at the lone callsite to move the whole function call
>> overhead out-of-line.
>
> I suppose so, both for this and for all the other _verify() functions.
Right.
>>> + return 0;
>>> + return 1;
>>> +}
>>> +
>>> +static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw)
>>> +{
>>> + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
>>> + cpu_to_le32(EXT4_OS_LINUX) ||
>>> + !EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
>>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
>>> + return;
>>> +
>>> + raw->i_checksum = ext4_inode_csum(inode, raw);
>>> +}
>>> +
>>> static inline int ext4_begin_ordered_truncate(struct inode *inode,
>>> loff_t new_size)
>>> {
>>> @@ -3410,6 +3458,15 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>>> if (ret < 0)
>>> goto bad_inode;
>>> raw_inode = ext4_raw_inode(&iloc);
>>> +
>>> + if (!ext4_inode_csum_verify(inode, raw_inode)) {
>>> + EXT4_ERROR_INODE(inode, "checksum invalid (0x%x != 0x%x)",
>>> + le32_to_cpu(ext4_inode_csum(inode, raw_inode)),
>>> + le32_to_cpu(raw_inode->i_checksum));
>>> + ret = -EIO;
>>> + goto bad_inode;
>>> + }
>>> +
>>> inode->i_mode = le16_to_cpu(raw_inode->i_mode);
>>> inode->i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
>>> inode->i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
>>> @@ -3490,6 +3547,9 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>>> ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
>>> if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
>>> EXT4_INODE_SIZE(inode->i_sb)) {
>>> + EXT4_ERROR_INODE(inode, "bad extra_isize (%u != %u)",
>>> + EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize,
>>> + EXT4_INODE_SIZE(inode->i_sb));
>>> ret = -EIO;
>>> goto bad_inode;
>>> }
>>> @@ -3731,6 +3791,8 @@ static int ext4_do_update_inode(handle_t *handle,
>>> raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
>>> }
>>>
>>> + ext4_inode_csum_set(inode, raw_inode);
>>
>> This might warrant a comment to always be the last function before
>> submitting the inode to the journal.
>
> Ok.
>
>>> BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
>>> rc = ext4_handle_dirty_metadata(handle, NULL, bh);
>>> if (!err)
>>
>> Also, rather than just making the checksum be updated at commit time, it
>> makes more sense to have ext4_do_update_inode() only be called once per
>> commit, since this is an expensive function.
>
> If I made jbd2 responsible for calling back into ext4 to apply checksums just
> prior to submit_bh()ing metadata blocks, I think that would take care of this.
Yes, that would be the most desirable case, but it also means that the
journal code needs to pin all of these inodes in memory until after it
commits. Possibly the new ext4 ordered journal mode already does this,
but not sure about other journal modes.
I definitely like the idea of using the jbd2 pre-commit callbacks, but
I don't think it is necessarily needed for the first version of the
patches. Better to get the "simple" implementation working correctly
(so that we are sure it is doing the right thing), and then migrate it
over to using the commit callbacks so that we can verify it is still
correct.
Cheers, Andreas
>>>>> "Darrick" == Darrick J Wong <[email protected]> writes:
Darrick,
Darrick> Furthermore, the nice thing about the in-filesystem checksum is
Darrick> that we bake in other things like the FS UUID and the inode
Darrick> number, which gives you a somewhat better assurance that the
Darrick> data block belongs to the fs and the file that the code think
Darrick> it belongs to.
Yeah, I view DIF/DIX mostly as in-flight protection for writes. Whereas
FS metadata checksumming is great for problem detection at read time.
Another problem with using the DIF app tag to store filesystem metadata
is that many array vendors use it internally and thus only disk drives
are likely to provide the app tag space.
Darrick> The DIX interface allows for a 32-bit block number and a 16-bit
Darrick> application tag ... which is unfortunately small given 64-bit
Darrick> block numbers and 32-bit inode numbers.
I never understood the 32-bit ref tag. Seems silly to have a check that
wraps at the exact boundary where problems are most likely to occur.
I advocated for a DIF Type with 16-bit guard tag and 48-bit ref tag but
that never went anywhere. Too bad - would have been easy for the storage
vendors to implement.
Darrick> As a side note, the crc-t10dif implementation is quite slow --
Darrick> the hardware accelerated crc32c is 15x faster, and the sw
Darrick> implementation is usually 3-6x faster. I suspect somebody will
Darrick> want to fix that before DIF becomes more widespread...
The CRC32C op on Nehalem and beyond is really, really fast. It's
essentially free except for pulling the data through the cache. So it's
not entirely fair to use that as baseline for a pure software
implementation. What is the faster sw implementation are you referring
to, btw.?
lib/crc-t10dif is a regular 256-entry table-based CRC implementation. It
is done pretty much like all our other software CRCs. I seem to recall
attempting a bigger table but that yielded worse real life results due
to cache pollution.
On Westmere and beyond it is possible to accelerate generic CRC
calculation using the PCLMULQDQ operation. There are many of our CRC
functions that could benefit from this. However, so far intel have not
been willing to contribute the relevant code to Linux.
Darrick> The good news is that if you're really worried about integrity,
Darrick> metadata_csum and DIF/DIX aren't mutually exclusive features.
Darrick> Rejecting corrupted write commands at write time seems like a
Darrick> useful feature. :)
Yup!
--
Martin K. Petersen Oracle Linux Engineering
> On Westmere and beyond it is possible to accelerate generic CRC
> calculation using the PCLMULQDQ operation. There are many of our CRC
Faster than the lookup table? That's hard to believe.
-Andi
>>>>> "Andi" == Andi Kleen <[email protected]> writes:
>> On Westmere and beyond it is possible to accelerate generic CRC
>> calculation using the PCLMULQDQ operation. There are many of our CRC
Andi> Faster than the lookup table? That's hard to believe.
Using PCLMULQDQ you can parallelize the calculation. You can even boost
hw CRC32C performance that way.
http://download.intel.com/design/intarch/papers/323405.pdf
http://download.intel.com/design/intarch/papers/323102.pdf
--
Martin K. Petersen Oracle Linux Engineering
>>>>>> "Andi" == Andi Kleen <[email protected]> writes:
>
>>> On Westmere and beyond it is possible to accelerate generic CRC
>>> calculation using the PCLMULQDQ operation. There are many of our CRC
>
> Andi> Faster than the lookup table? That's hard to believe.
>
> Using PCLMULQDQ you can parallelize the calculation. You can even boost
> hw CRC32C performance that way.
>
> http://download.intel.com/design/intarch/papers/323405.pdf
>
> http://download.intel.com/design/intarch/papers/323102.pdf
Doesn't have any performance numbers.
You need to keep in mind that PCLMULQDQ uses FPU state, so any
speedup for the kernel must be large enough to amortize the cost of
saving the FPU state.
Typically that only works out for quite large buffers, but
kernel buffers are relatively small.
For the ext4 metadata a better approach is probably some sort of
incremental CRC, or possibly separate CRCs for very commonly
changed fields. When I looked at this most changes were only for
small fields.
-Andi
>>>>> "Andi" == Andi Kleen <[email protected]> writes:
Andi> Doesn't have any performance numbers.
It's been a while since I read them. I thought they had some compelling
numbers. Anyway, made a big difference in real life testing here. For
sustained I/O we're talking an order of magnitude.
Andi> You need to keep in mind that PCLMULQDQ uses FPU state, so any
Andi> speedup for the kernel must be large enough to amortize the cost
Andi> of saving the FPU state.
Yeah, my test cases were for bulk database I/O, not for writing a
handful of fs metadata blocks. Plus for the DB tests the CRC was
generated in userland.
I seem to recall Joel picking something other than the hw-accelerated
CRC32C for ocfs2 metadata and that didn't cause any problems.
That said, I do see a difference between IP checksum and CRC on normal
FS workloads with DIX enabled here.
Andi> Typically that only works out for quite large buffers, but kernel
Andi> buffers are relatively small.
*nod*
--
Martin K. Petersen Oracle Linux Engineering
On Fri, Sep 02, 2011 at 04:02:17PM -0600, Andreas Dilger wrote:
> On 2011-09-02, at 1:32 PM, Darrick J. Wong wrote:
> > On Wed, Aug 31, 2011 at 08:30:25PM -0600, Andreas Dilger wrote:
> >> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> >>> This patch introduces to ext4 the ability to calculate and verify inode
> >>> checksums. This requires the use of a new ro compatibility flag and some
> >>> accompanying e2fsprogs patches to provide the relevant features in tune2fs and e2fsck.
> >>>
> >>>
> >>> +static __le32 ext4_inode_csum(struct inode *inode, struct ext4_inode *raw)
> >>> +{
> >>> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> >>> + struct ext4_inode_info *ei = EXT4_I(inode);
> >>> + int offset = offsetof(struct ext4_inode, i_checksum);
> >>
> >> This could be declared "const int" so that it is not consuming space on
> >> the stack, or just put it inline in the code instead of a stack variable
> >> since it is a compile time constant.
> >>
> >>> + __le32 inum = cpu_to_le32(inode->i_ino);
> >>> + __u32 crc = 0;
> >>> +
> >>> + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
> >>> + cpu_to_le32(EXT4_OS_LINUX))
> >>
> >> This can be marked unlikely() I think.
> >
> > Ok.
> >
> >>> + return 0;
> >>> + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> >>> + return 0;
> >>> +
> >>> + crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
> >>> + crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
> >>
> >> I wonder if it makes sense to pre-compute the crc32c of s_uuid (stored
> >> in sbi) and/or s_uuid+inum (stored in struct ext4_inode_info). I suspect
> >> precomputing the s_uuid checksum is worthwhile, but I'm not sure whether
> >> precomputing the inode checksum is worthwhile unless it doesn't reduce
> >> the number of ext4_inode_info structs per page in the slab.
> >
> > Sounds like a good idea, I'll look into it.
>
> Looking more closely at the cryptoapi code, I'm fairly confident that
> storing the partial crc32c for the uuid+inum+generation into the inode
> is going to be worthwhile, compared to calling crc32c_le() 3 extra times.
Hmm, can the FS UUID change while the FS is mounted? Or, to look at this from
the other side, does anyone mind if tune2fs -U can tell you to umount before
changing UUID?
I think we need that anyway, to prevent races between tune2fs checksum rewrite
and kernel writing stuff.
I found a bug where if you mount a fs and write to it, then dumpe2fs -h will
complain about superblock checksum errors. Will have to look into that...
> >>> + crc = crc32c_le(crc, (__u8 *)raw, offset);
> >>> + offset += sizeof(raw->i_checksum); /* skip checksum */
> >>> + crc = crc32c_le(crc, (__u8 *)raw + offset,
> >>> + EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize -
> >>> + offset);
> >>
> >> I suspect it would be more efficient to set raw->i_checksum = 0, then
> >> compute the checksum on the whole raw inode buffer, and fill in
> >> raw->i_checksum = cpu_to_le32(crc) at the end. That would mean the
> >> caller ext4_inode_csum_verify() should save the original checksum for
> >> comparison with the returned value.
> >
> > You mean to avoid the overhead of the add/store and the second function call?
>
> Mostly the overhead of the extra calls into crc32c_le() and the cryptoapi.
> There are a lot of extra pointer indirections in that code, and calling
> into cryptoapi for 4-byte values adds (vaguely) 60-100 operations per word
> on top of the actual checksum operations, unless it all disappears at
> compile time (hard to see at first glance).
>
> >> The one problem with this is that it is racy w.r.t other users
> >
> > Yeah, I was thinking that if I move the *_csum_set() calls to a jbd2 callback
> > (for journal mode, obviously) then this might clash with that. Maybe a better
> > approach would be to calculate/verify an entire block's worth of inodes at a
> > time. Then again, if you only want to touch /one/ inode out of a whole block,
> > that's a lot of unnecessary work.
>
> However, if you are doing that from the jbd2 callback, the code also has
> exclusive control over the buffer at that time, so computing the checksum
> on the zeroed bytes in a single pass is not racy, and would definitely be
> less overhead for such a small number of bytes.
>
> >>> + return cpu_to_le32(crc);
> >>> +}
> >>> +
> >>> +static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw)
> >>> +{
> >>> + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os ==
> >>> + cpu_to_le32(EXT4_OS_LINUX) &&
> >>> + EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) &&
> >>> + (raw->i_checksum != ext4_inode_csum(inode, raw)))
> >>
> >> This check can be marked unlikely(), since the rare case of a checksum
> >> failure can cause a stall in the execution pipeline. It might make sense
> >> to put the unlikely() at the lone callsite to move the whole function call
> >> overhead out-of-line.
> >
> > I suppose so, both for this and for all the other _verify() functions.
>
> Right.
>
> >>> + return 0;
> >>> + return 1;
> >>> +}
> >>> +
> >>> +static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw)
> >>> +{
> >>> + if (EXT4_SB(inode->i_sb)->s_es->s_creator_os !=
> >>> + cpu_to_le32(EXT4_OS_LINUX) ||
> >>> + !EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> >>> + return;
> >>> +
> >>> + raw->i_checksum = ext4_inode_csum(inode, raw);
> >>> +}
> >>> +
> >>> static inline int ext4_begin_ordered_truncate(struct inode *inode,
> >>> loff_t new_size)
> >>> {
> >>> @@ -3410,6 +3458,15 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> >>> if (ret < 0)
> >>> goto bad_inode;
> >>> raw_inode = ext4_raw_inode(&iloc);
> >>> +
> >>> + if (!ext4_inode_csum_verify(inode, raw_inode)) {
> >>> + EXT4_ERROR_INODE(inode, "checksum invalid (0x%x != 0x%x)",
> >>> + le32_to_cpu(ext4_inode_csum(inode, raw_inode)),
> >>> + le32_to_cpu(raw_inode->i_checksum));
> >>> + ret = -EIO;
> >>> + goto bad_inode;
> >>> + }
> >>> +
> >>> inode->i_mode = le16_to_cpu(raw_inode->i_mode);
> >>> inode->i_uid = (uid_t)le16_to_cpu(raw_inode->i_uid_low);
> >>> inode->i_gid = (gid_t)le16_to_cpu(raw_inode->i_gid_low);
> >>> @@ -3490,6 +3547,9 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
> >>> ei->i_extra_isize = le16_to_cpu(raw_inode->i_extra_isize);
> >>> if (EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize >
> >>> EXT4_INODE_SIZE(inode->i_sb)) {
> >>> + EXT4_ERROR_INODE(inode, "bad extra_isize (%u != %u)",
> >>> + EXT4_GOOD_OLD_INODE_SIZE + ei->i_extra_isize,
> >>> + EXT4_INODE_SIZE(inode->i_sb));
> >>> ret = -EIO;
> >>> goto bad_inode;
> >>> }
> >>> @@ -3731,6 +3791,8 @@ static int ext4_do_update_inode(handle_t *handle,
> >>> raw_inode->i_extra_isize = cpu_to_le16(ei->i_extra_isize);
> >>> }
> >>>
> >>> + ext4_inode_csum_set(inode, raw_inode);
> >>
> >> This might warrant a comment to always be the last function before
> >> submitting the inode to the journal.
> >
> > Ok.
> >
> >>> BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> >>> rc = ext4_handle_dirty_metadata(handle, NULL, bh);
> >>> if (!err)
> >>
> >> Also, rather than just making the checksum be updated at commit time, it
> >> makes more sense to have ext4_do_update_inode() only be called once per
> >> commit, since this is an expensive function.
> >
> > If I made jbd2 responsible for calling back into ext4 to apply checksums just
> > prior to submit_bh()ing metadata blocks, I think that would take care of this.
>
> Yes, that would be the most desirable case, but it also means that the
> journal code needs to pin all of these inodes in memory until after it
> commits. Possibly the new ext4 ordered journal mode already does this,
> but not sure about other journal modes.
>
> I definitely like the idea of using the jbd2 pre-commit callbacks, but
> I don't think it is necessarily needed for the first version of the
> patches. Better to get the "simple" implementation working correctly
> (so that we are sure it is doing the right thing), and then migrate it
> over to using the commit callbacks so that we can verify it is still
> correct.
I wasn't planning to start on this optimization until I finish addressing all
the other comments/complaints.
--D
>
> Cheers, Andreas--
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Fri, Sep 02, 2011 at 03:27:21PM -0600, Andreas Dilger wrote:
> On 2011-09-02, at 1:18 PM, Darrick J. Wong wrote:
> > On Wed, Aug 31, 2011 at 10:49:05PM -0600, Andreas Dilger wrote:
> >> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> >>> Compute and verify the checksum of the inode bitmap; the checkum is stored in
> >>> the block group descriptor.
> >>
> >> I would prefer if there was a 16-bit checksum for the (most common)
> >> 32-byte group descriptors, and this was extended to a 32-bit checksum
> >> for the (much less common) 64-byte+ group descriptors. For filesystems
> >> that are newly formatted with the 64bit feature it makes no difference,
> >> but virtually all ext3/4 filesystems have only the smaller group descriptors.
> >>
> >> Regardless of whether using half of the crc32c is better or worse than
> >> using crc16 for the bitmap blocks, storing _any_ checksum is better than
> >> storing nothing at all. I would propose the following:
> >
> > That's an interesting reframing of the argument that I hadn't considered.
> > I'd fallen into the idea of needing crc32c because of its bit error
> > guarantees (all corruptions of odd numbers of bits and all corruptions of
> > fewer than ...4? bits) that I hadn't quite realized that even if crc16
> > can't guarantee to find any corruption at all, it still _might_, and that's
> > better than nothing.
> >
> > Ok, let's split the 32-bit fields and use crc16 for the case of 32-byte block
> > group descriptors.
>
> I noticed the crc16 calculation is actually _slower_ than crc32c,
> probably because the CPU cannot use 32-bit values when computing the
> result, so it has to do a lot of word masking, per your table at
> https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums.
> Also, there is the question of whether computing two different
> checksums is needlessly complicating the code, or if it is easier
> to just compute crc32c all the time and only make the storing of
> the high 16 bits conditional.
>
> What I'm suggesting is always computing the crc32c, but for filesystems
> that are not formatted with the 64bit option just store the low 16 bits
> of the crc32c value into bg_{block,inode}_bitmap_csum_lo. This is much
> better than not computing a checksum here at all. The only open question
> is whether 1/2 of crc32c is substantially worse at detecting errors than
> crc16 or not?
All the literature I've read has suggested that crc16 can't guarantee any error
detection capability at all with data buffers longer than 256 bytes. So far in
my simulations I haven't seen that truncated-crc32c is particularly worse than
crc16, though they both seem to miss a lot of errors that crc32c would catch.
I guess we might as well use the fast one, it'll at least make the code
cleaner.
> I was also thinking whether the EXT4_FEATURE_RO_COMPAT_METADATA_CSUM
> feature should also cause the bg_checksum to do the same (store only
> low 16 bits of crc32c) just for the improved speed?
That mostly depends on how much overhead cryptoapi has over crc16 for small
blob sizes.
Or, if we imagine that bg descriptors might some day grow beyond 256 bytes,
maybe we allocate an extra 16 bits to store the entire crc32c? This isn't as
clear-cut as inodes where a user can specify a huge size at mkfs time.
Perhaps we ought to declare a "CRC type" field in the sb (where 0 = crc16 and 1
= crc32c) just in case there's ever a desire to change the checksum algorithm?
...or, for the bg tables, we could (ab)use the structure definition a bit.
crc32c the entire block, and store the low/high 16-bits of the checksum in the
first and second descriptors' checksum fields. This of course would result in
a loss of granularity (128 bgs go bad at once instead of just 1), so I don't
know if it's really needed for a small structure that can quadruple in size
before we need a longer CRC (and hasn't grown much in 15 years).
> It might be interesting to redo the table that you computed, but
> using a loop that is only computing the checksums for small blocks
> of data (e.g. 32 bytes and 4096 bytes in a loop for a total of 512MB)
> to see what the overhead of the cryptoapi and hardware calls are.
Yep.
--D
> >> struct ext4_group_desc
> >> {
> >> __le32 bg_block_bitmap_lo; /* Blocks bitmap block */
> >> __le32 bg_inode_bitmap_lo; /* Inodes bitmap block */
> >> __le32 bg_inode_table_lo; /* Inodes table block */
> >> __le16 bg_free_blocks_count_lo; /* Free blocks count */
> >> __le16 bg_free_inodes_count_lo; /* Free inodes count */
> >> __le16 bg_used_dirs_count_lo; /* Directories count */
> >> __le16 bg_flags; /* EXT4_BG_flags (INODE_UNINIT, etc) */
> >> __le32 bg_exclude_bitmap_lo; /* Exclude bitmap block */
> >> __le16 bg_block_bitmap_csum_lo; /* Block bitmap checksum */
> >> __le16 bg_inode_bitmap_csum_lo; /* Inode bitmap checksum */
> >> __le16 bg_itable_unused_lo; /* Unused inodes count */
> >> __le16 bg_checksum; /* crc16(sb_uuid+group+desc) */
> >> __le32 bg_block_bitmap_hi; /* Blocks bitmap block MSB */
> >> __le32 bg_inode_bitmap_hi; /* Inodes bitmap block MSB */
> >> __le32 bg_inode_table_hi; /* Inodes table block MSB */
> >> __le16 bg_free_blocks_count_hi; /* Free blocks count MSB */
> >> __le16 bg_free_inodes_count_hi; /* Free inodes count MSB */
> >> __le16 bg_used_dirs_count_hi; /* Directories count MSB */
> >> __le16 bg_itable_unused_hi; /* Unused inodes count MSB */
> >> __le32 bg_exclude_bitmap_hi; /* Exclude bitmap block MSB */
> >> __le16 bg_block_bitmap_csum_hi; /* Blocks bitmap checksum MSB */
> >> __le16 bg_inode_bitmap_csum_hi; /* Inodes bitmap checksum MSB */
> >> __le32 bg_reserved2;
> >> };
> >>
> >> This is also different from your layout because it locates the block bitmap
> >> checksum field before the inode bitmap checksum, to more closely match the
> >> order of other fields in this structure.
> >
> > Er.. I reversed the order in the structure definition just prior to publishing,
> > and forgot to update the wiki page. Well I guess I'm about to update it again.
> > :)
> >
> >>> /*
> >>> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> >>> index 9c63f27..53faffc 100644
> >>> --- a/fs/ext4/ialloc.c
> >>> +++ b/fs/ext4/ialloc.c
> >>> @@ -82,12 +82,18 @@ static unsigned ext4_init_inode_bitmap(struct super_block *sb,
> >>> ext4_free_inodes_set(sb, gdp, 0);
> >>> ext4_itable_unused_set(sb, gdp, 0);
> >>> memset(bh->b_data, 0xff, sb->s_blocksize);
> >>> + ext4_bitmap_csum_set(sb, block_group,
> >>> + &gdp->bg_inode_bitmap_csum, bh,
> >>> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> >>
> >> The number of inodes per group is already always a multiple of 8.
> >
> > Ok. I suppose we can fix that in the lines below too.
> >
> >>> return 0;
> >>> }
> >>>
> >>> memset(bh->b_data, 0, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> >>> ext4_mark_bitmap_end(EXT4_INODES_PER_GROUP(sb), sb->s_blocksize * 8,
> >>> bh->b_data);
> >>> + ext4_bitmap_csum_set(sb, block_group, &gdp->bg_inode_bitmap_csum, bh,
> >>> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> >>> + gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
> >>>
> >>> return EXT4_INODES_PER_GROUP(sb);
> >>> }
> >>> @@ -118,12 +124,12 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> >>> return NULL;
> >>> }
> >>> if (bitmap_uptodate(bh))
> >>> - return bh;
> >>> + goto verify;
> >>>
> >>> lock_buffer(bh);
> >>> if (bitmap_uptodate(bh)) {
> >>> unlock_buffer(bh);
> >>> - return bh;
> >>> + goto verify;
> >>> }
> >>>
> >>> ext4_lock_group(sb, block_group);
> >>> @@ -131,6 +137,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> >>> ext4_init_inode_bitmap(sb, bh, block_group, desc);
> >>> set_bitmap_uptodate(bh);
> >>> set_buffer_uptodate(bh);
> >>> + set_buffer_verified(bh);
> >>> ext4_unlock_group(sb, block_group);
> >>> unlock_buffer(bh);
> >>> return bh;
> >>> @@ -144,7 +151,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> >>> */
> >>> set_bitmap_uptodate(bh);
> >>> unlock_buffer(bh);
> >>> - return bh;
> >>> + goto verify;
> >>> }
> >>> /*
> >>> * submit the buffer_head for read. We can
> >>> @@ -161,6 +168,21 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
> >>> block_group, bitmap_blk);
> >>> return NULL;
> >>> }
> >>> +
> >>> +verify:
> >>> + ext4_lock_group(sb, block_group);
> >>> + if (!buffer_verified(bh) &&
> >>> + !ext4_bitmap_csum_verify(sb, block_group,
> >>> + desc->bg_inode_bitmap_csum, bh,
> >>> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8)) {
> >>> + ext4_unlock_group(sb, block_group);
> >>> + put_bh(bh);
> >>> + ext4_error(sb, "Corrupt inode bitmap - block_group = %u, "
> >>> + "inode_bitmap = %llu", block_group, bitmap_blk);
> >>> + return NULL;
> >>
> >> At some point we should add a flag like EXT4_BG_INODE_ERROR so that the
> >> group can be marked in error on disk, and skipped for future allocations,
> >> but the whole filesystem does not need to be remounted read-only. That's
> >> for another patch, however.
> >
> > Agreed. :)
> >
> > --D
> >
> >>> + }
> >>> + ext4_unlock_group(sb, block_group);
> >>> + set_buffer_verified(bh);
> >>> return bh;
> >>> }
> >>>
> >>> @@ -265,6 +287,8 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
> >>> ext4_used_dirs_set(sb, gdp, count);
> >>> percpu_counter_dec(&sbi->s_dirs_counter);
> >>> }
> >>> + ext4_bitmap_csum_set(sb, block_group, &gdp->bg_inode_bitmap_csum,
> >>> + bitmap_bh, (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> >>> gdp->bg_checksum = ext4_group_desc_csum(sbi, block_group, gdp);
> >>> ext4_unlock_group(sb, block_group);
> >>>
> >>> @@ -784,6 +808,9 @@ static int ext4_claim_inode(struct super_block *sb,
> >>> atomic_inc(&sbi->s_flex_groups[f].used_dirs);
> >>> }
> >>> }
> >>> + ext4_bitmap_csum_set(sb, group, &gdp->bg_inode_bitmap_csum,
> >>> + inode_bitmap_bh,
> >>> + (EXT4_INODES_PER_GROUP(sb) + 7) / 8);
> >>> gdp->bg_checksum = ext4_group_desc_csum(sbi, group, gdp);
> >>> err_ret:
> >>> ext4_unlock_group(sb, group);
> >>>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
One very interesting optimization would be to profile the metadata workload
where you got the slowdown and try to figure out which fields
were most commonly updated.
If it's some superblock fields or similar maybe those could get their separate
crcs that could be separatedly computed at much lower cost.
-Andi
On Fri, Sep 02, 2011 at 02:52:32PM -0600, Andreas Dilger wrote:
> On 2011-09-02, at 12:57 PM, Darrick J. Wong wrote:
> > On Thu, Sep 01, 2011 at 01:36:50AM -0600, Andreas Dilger wrote:
> >> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> >>> /*
> >>> + * This is a bogus directory entry at the end of each leaf block that
> >>> + * records checksums.
> >>> + */
> >>> +struct ext4_dir_entry_tail {
> >>> + __le32 reserved_zero1; /* Pretend to be unused */
> >>> + __le16 rec_len; /* 12 */
> >>> + __le16 reserved_zero2; /* Zero name length */
> >>> + __le32 checksum; /* crc32c(uuid+inum+dirblock) */
> >>> +};
> >>
> >> Since this field is stored inline with existing directory entries, it
> >> may make sense to also add a magic value to this entry (preferably one
> >> with non-ASCII values) so that it can be distinguished from an empty
> >> dirent that happens to be at the end of the block.
> >
> > I could set the file type to 0xDE since currently there's only 8 file types
> > defined.
>
> This seems possible, since the dirent is empty the value stored in
> file_type is largely irrelevant.
>
> It also definitely makes sense to declare this as an "ext4_dir_entry_2"
> style structure, since this is the only dirent that is used in the ext4
> code. I'd be happy if you also deleted the ext4_dir_entry structure
> definition from ext4.h, since it is unused and only serves to potentially
> cause confusion if used accidentally.
Perhaps, but as a separate patchset.
> You could do the whole sanity check for this tail dirent by treating
> it as a 64-bit magic number:
>
> struct ext4_dirent_tail {
> union {
> struct {
> __le32 inode_zero; /* Pretend to be unused */
> __le16 rec_len; /* 12 */
> __u8 name_zero; /* Zero name length */
> __u8 file_type; /* 0xde */
> };
> __le64 det_magic;
> };
> __le32 det_checksum;
> };
>
> #define EXT4_DIRENT_TAIL_MAGIC 0xde000c000000
>
> That said, looking at this magic number doesn't give me a world of
> confidence that it will not be accidentally duplicated, though at
> the same time consecutive NUL bytes do not happen in filenames, so
> maybe it is OK.
det_checksum sits where the filename usually goes, so that won't be the case.
On the other hand, there are very few zero-length directory entries, so we're
probably ok.
--D
>
> >>> /* checksumming functions */
> >>> +static struct ext4_dir_entry_tail *get_dirent_tail(struct inode *inode,
> >>> + struct ext4_dir_entry *de)
> >>> +{
> >>> + struct ext4_dir_entry *d, *top;
> >>> + struct ext4_dir_entry_tail *t;
> >>> +
> >>> + d = de;
> >>> + top = (struct ext4_dir_entry *)(((void *)de) +
> >>> + (EXT4_BLOCK_SIZE(inode->i_sb) -
> >>> + sizeof(struct ext4_dir_entry_tail)));
> >>> + while (d < top && d->rec_len)
> >>> + d = (struct ext4_dir_entry *)(((void *)d) +
> >>> + le16_to_cpu(d->rec_len));
> >>
> >> Calling get_dirent_tail() is fairly expensive, because it has to walk
> >> the whole directory block each time. When filling a block it would
> >> be O(n^2) for the number of entries in the block.
> >>
> >> It would be more efficient to just cast the end of the directory block
> >> to the ext4_dir_entry_tail and check its validity, which is especially
> >> easy if there is a magic value in it.
> >>
> >>> + if (d != top)
> >>> + return NULL;
> >>> +
> >>> + t = (struct ext4_dir_entry_tail *)d;
> >>> + if (t->reserved_zero1 ||
> >>> + le16_to_cpu(t->rec_len) != sizeof(struct ext4_dir_entry_tail) ||
> >>> + t->reserved_zero2)
> >>
> >> I'd prefer these reserved_zero[12] fields be explicitly compared to zero
> >> instead of treated as boolean values.
> >
> > Ok.
> >
> >>> + return NULL;
> >>> +
> >>> + return t;
> >>> +}
> >>> +
> >>> +static __le32 ext4_dirent_csum(struct inode *inode,
> >>> + struct ext4_dir_entry *dirent)
> >>> +{
> >>> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> >>> + struct ext4_dir_entry_tail *t;
> >>> + __le32 inum = cpu_to_le32(inode->i_ino);
> >>> + int size;
> >>> + __u32 crc = 0;
> >>> +
> >>> + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> >>> + return 0;
> >>> +
> >>> + t = get_dirent_tail(inode, dirent);
> >>
> >>> + if (!t)
> >>> + return 0;
> >>> +
> >>> + size = (void *)t - (void *)dirent;
> >>> + crc = crc32c_le(~0, sbi->s_es->s_uuid, sizeof(sbi->s_es->s_uuid));
> >>> + crc = crc32c_le(crc, (__u8 *)&inum, sizeof(inum));
> >>
> >> Based on the number of times that the s_uuid+inum checksum is used in this
> >> code, and since it is constant for the life of the inode, it probably
> >> makes sense to precompute it and store it in ext4_inode_info.
> >
> > Agreed.
> >
> >> Also, now that I think about it, these checksums that contain the inum
> >> should also contain i_generation, so that there is no confusion with
> >> accessing old blocks on disk.
> >
> > i_generation only gets updated when inodes are created or SETVERSION ioctl is
> > called, correct? I guess it wouldn't be too difficult to rewrite all file
> > metadata, though it could get a little expensive.
>
> I don't think SETVERSION is used in any regular cases (at least I'm not
> aware of any applications/tools that use it). Note that, despite the
> name, this sets the i_generation field and not the NFSv4 i_version field,
> so it should be constant for the life of the inode. In the short term you
> could just disable SETVERSION on an inode if checksums are enabled, and
> see if anyone complains about it at all.
>
> Cheers, Andreas
>
> >>> + crc = crc32c_le(crc, (__u8 *)dirent, size);
> >>> + return cpu_to_le32(crc);
> >>> +}
> >>> +
> >>> +int ext4_dirent_csum_verify(struct inode *inode, struct ext4_dir_entry *dirent)
> >>> +{
> >>> + struct ext4_dir_entry_tail *t;
> >>> +
> >>> + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> >>> + return 1;
> >>> +
> >>> + t = get_dirent_tail(inode, dirent);
> >>> + if (!t) {
> >>> + EXT4_ERROR_INODE(inode, "metadata_csum set but no space in dir "
> >>> + "leaf for checksum. Please run e2fsck -D.");
> >>> + return 0;
> >>> + }
> >>
> >> I don't think this should necessarily be considered an error. That
> >> there is no space in the directory block is not a sign of corruption.
> >
> > I was trying to steer users towards running fsck, which will notice the
> > lack of space and rebuild the dir. With a somewhat large mallet. :)
>
> Yes, but this would cause a service interruption because the filesystem
> will be mounted read-only and/or panic the kernel, which is not justified
> for a situation which is legitimately possible and does not indicate data
> corruption of the filesystem.
>
> >>> +
> >>> + if (t->checksum != ext4_dirent_csum(inode, dirent))
> >>> + return 0;
> >>> +
> >>> + return 1;
> >>> +}
> >>> +
> >>> +static void ext4_dirent_csum_set(struct inode *inode,
> >>> + struct ext4_dir_entry *dirent)
> >>> +{
> >>> + struct ext4_dir_entry_tail *t;
> >>> +
> >>> + if (!EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> >>> + return;
> >>> +
> >>> + t = get_dirent_tail(inode, dirent);
> >>> + if (!t) {
> >>> + EXT4_ERROR_INODE(inode, "metadata_csum set but no space in dir "
> >>> + "leaf for checksum. Please run e2fsck -D.");
> >>> + return;
> >>> + }
> >>> +
> >>> + t->checksum = ext4_dirent_csum(inode, dirent);
> >>> +}
> >>> +
> >>> +static inline int ext4_handle_dirty_dirent_node(handle_t *handle,
> >>> + struct inode *inode,
> >>> + struct buffer_head *bh)
> >>> +{
> >>> + ext4_dirent_csum_set(inode, (struct ext4_dir_entry *)bh->b_data);
> >>> + return ext4_handle_dirty_metadata(handle, inode, bh);
> >>> +}
> >>> +
> >>> static struct dx_countlimit *get_dx_countlimit(struct inode *inode,
> >>> struct ext4_dir_entry *dirent,
> >>> int *offset)
> >>> @@ -748,6 +846,11 @@ static int htree_dirblock_to_tree(struct file *dir_file,
> >>> if (!(bh = ext4_bread (NULL, dir, block, 0, &err)))
> >>> return err;
> >>>
> >>> + if (!buffer_verified(bh) &&
> >>> + !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data))
> >>> + return -EIO;
> >>> + set_buffer_verified(bh);
> >>> +
> >>> de = (struct ext4_dir_entry_2 *) bh->b_data;
> >>
> >> You might as well set de before calling ext4_dirent_csum_verify() to avoid
> >> having another unsightly cast.
> >
> > Ok.
> >
> >>> top = (struct ext4_dir_entry_2 *) ((char *) de +
> >>> dir->i_sb->s_blocksize -
> >>> @@ -1106,6 +1209,15 @@ restart:
> >>> brelse(bh);
> >>> goto next;
> >>> }
> >>> + if (!buffer_verified(bh) &&
> >>> + !ext4_dirent_csum_verify(dir,
> >>> + (struct ext4_dir_entry *)bh->b_data)) {
> >>> + EXT4_ERROR_INODE(dir, "checksumming directory "
> >>> + "block %lu", (unsigned long)block);
> >>> + brelse(bh);
> >>> + goto next;
> >>> + }
> >>> + set_buffer_verified(bh);
> >>> i = search_dirblock(bh, dir, d_name,
> >>> block << EXT4_BLOCK_SIZE_BITS(sb), res_dir);
> >>> if (i == 1) {
> >>> @@ -1157,6 +1269,16 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir, const struct q
> >>> if (!(bh = ext4_bread(NULL, dir, block, 0, err)))
> >>> goto errout;
> >>>
> >>> + if (!buffer_verified(bh) &&
> >>> + !ext4_dirent_csum_verify(dir,
> >>> + (struct ext4_dir_entry *)bh->b_data)) {
> >>> + EXT4_ERROR_INODE(dir, "checksumming directory "
> >>> + "block %lu", (unsigned long)block);
> >>> + brelse(bh);
> >>> + *err = -EIO;
> >>> + goto errout;
> >>> + }
> >>> + set_buffer_verified(bh);
> >>> retval = search_dirblock(bh, dir, d_name,
> >>> block << EXT4_BLOCK_SIZE_BITS(sb),
> >>> res_dir);
> >>> @@ -1329,8 +1451,14 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> >>> char *data1 = (*bh)->b_data, *data2;
> >>> unsigned split, move, size;
> >>> struct ext4_dir_entry_2 *de = NULL, *de2;
> >>> + struct ext4_dir_entry_tail *t;
> >>> + int csum_size = 0;
> >>> int err = 0, i;
> >>>
> >>> + if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> >>> + csum_size = sizeof(struct ext4_dir_entry_tail);
> >>> +
> >>> bh2 = ext4_append (handle, dir, &newblock, &err);
> >>> if (!(bh2)) {
> >>> brelse(*bh);
> >>> @@ -1377,10 +1505,24 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> >>> /* Fancy dance to stay within two buffers */
> >>> de2 = dx_move_dirents(data1, data2, map + split, count - split, blocksize);
> >>> de = dx_pack_dirents(data1, blocksize);
> >>> - de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de,
> >>> + de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) -
> >>> + (char *) de,
> >>> blocksize);
> >>
> >> (style) This should be "(char *)de, blocksize);"
> >>
> >>> - de2->rec_len = ext4_rec_len_to_disk(data2 + blocksize - (char *) de2,
> >>> + de2->rec_len = ext4_rec_len_to_disk(data2 + (blocksize - csum_size) -
> >>> + (char *) de2,
> >>> blocksize);
> >>
> >> (style) likewise
> >>
> >>> + if (csum_size) {
> >>> + t = (struct ext4_dir_entry_tail *)(data2 +
> >>> + (blocksize - csum_size));
> >>> + memset(t, 0, csum_size);
> >>> + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
> >>> +
> >>> + t = (struct ext4_dir_entry_tail *)(data1 +
> >>> + (blocksize - csum_size));
> >>> + memset(t, 0, csum_size);
> >>> + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
> >>> + }
> >>> +
> >>> dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data1, blocksize, 1));
> >>> dxtrace(dx_show_leaf (hinfo, (struct ext4_dir_entry_2 *) data2, blocksize, 1));
> >>>
> >>> @@ -1391,7 +1533,7 @@ static struct ext4_dir_entry_2 *do_split(handle_t *handle, struct inode *dir,
> >>> de = de2;
> >>> }
> >>> dx_insert_block(frame, hash2 + continued, newblock);
> >>> - err = ext4_handle_dirty_metadata(handle, dir, bh2);
> >>> + err = ext4_handle_dirty_dirent_node(handle, dir, bh2);
> >>> if (err)
> >>> goto journal_error;
> >>> err = ext4_handle_dirty_dx_node(handle, dir, frame->bh);
> >>> @@ -1431,11 +1573,16 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
> >>> unsigned short reclen;
> >>> int nlen, rlen, err;
> >>> char *top;
> >>> + int csum_size = 0;
> >>> +
> >>> + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> >>> + csum_size = sizeof(struct ext4_dir_entry_tail);
> >>>
> >>> reclen = EXT4_DIR_REC_LEN(namelen);
> >>> if (!de) {
> >>> de = (struct ext4_dir_entry_2 *)bh->b_data;
> >>> - top = bh->b_data + blocksize - reclen;
> >>> + top = bh->b_data + (blocksize - csum_size) - reclen;
> >>> while ((char *) de <= top) {
> >>> if (ext4_check_dir_entry(dir, NULL, de, bh, offset))
> >>> return -EIO;
> >>> @@ -1491,7 +1638,7 @@ static int add_dirent_to_buf(handle_t *handle, struct dentry *dentry,
> >>> dir->i_version++;
> >>> ext4_mark_inode_dirty(handle, dir);
> >>> BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> >>> - err = ext4_handle_dirty_metadata(handle, dir, bh);
> >>> + err = ext4_handle_dirty_dirent_node(handle, dir, bh);
> >>> if (err)
> >>> ext4_std_error(dir->i_sb, err);
> >>> return 0;
> >>> @@ -1512,6 +1659,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
> >>> struct dx_frame frames[2], *frame;
> >>> struct dx_entry *entries;
> >>> struct ext4_dir_entry_2 *de, *de2;
> >>> + struct ext4_dir_entry_tail *t;
> >>> char *data1, *top;
> >>> unsigned len;
> >>> int retval;
> >>> @@ -1519,6 +1667,11 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
> >>> struct dx_hash_info hinfo;
> >>> ext4_lblk_t block;
> >>> struct fake_dirent *fde;
> >>> + int csum_size = 0;
> >>> +
> >>> + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> >>> + csum_size = sizeof(struct ext4_dir_entry_tail);
> >>>
> >>> blocksize = dir->i_sb->s_blocksize;
> >>> dxtrace(printk(KERN_DEBUG "Creating index: inode %lu\n", dir->i_ino));
> >>> @@ -1539,7 +1692,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
> >>> brelse(bh);
> >>> return -EIO;
> >>> }
> >>> - len = ((char *) root) + blocksize - (char *) de;
> >>> + len = ((char *) root) + (blocksize - csum_size) - (char *) de;
> >>
> >> (style) "(char *)root)" and "(char *)de".
> >>
> >>> /* Allocate new block for the 0th block's dirents */
> >>> bh2 = ext4_append(handle, dir, &block, &retval);
> >>> @@ -1555,8 +1708,17 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
> >>> top = data1 + len;
> >>> while ((char *)(de2 = ext4_next_entry(de, blocksize)) < top)
> >>> de = de2;
> >>> - de->rec_len = ext4_rec_len_to_disk(data1 + blocksize - (char *) de,
> >>> + de->rec_len = ext4_rec_len_to_disk(data1 + (blocksize - csum_size) -
> >>> + (char *) de,
> >>> blocksize);
> >>
> >> Likewise.
> >
> > Ok, I'll fix the style complaints. Should checkpatch find these things?
> >
> > --D
> >
> >>> +
> >>> + if (csum_size) {
> >>> + t = (struct ext4_dir_entry_tail *)(data1 +
> >>> + (blocksize - csum_size));
> >>> + memset(t, 0, csum_size);
> >>> + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
> >>> + }
> >>> +
> >>> /* Initialize the root; the dot dirents already exist */
> >>> de = (struct ext4_dir_entry_2 *) (&root->dotdot);
> >>> de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(2),
> >>> @@ -1582,7 +1744,7 @@ static int make_indexed_dir(handle_t *handle, struct dentry *dentry,
> >>> bh = bh2;
> >>>
> >>> ext4_handle_dirty_dx_node(handle, dir, frame->bh);
> >>> - ext4_handle_dirty_metadata(handle, dir, bh);
> >>> + ext4_handle_dirty_dirent_node(handle, dir, bh);
> >>>
> >>> de = do_split(handle,dir, &bh, frame, &hinfo, &retval);
> >>> if (!de) {
> >>> @@ -1618,11 +1780,17 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
> >>> struct inode *dir = dentry->d_parent->d_inode;
> >>> struct buffer_head *bh;
> >>> struct ext4_dir_entry_2 *de;
> >>> + struct ext4_dir_entry_tail *t;
> >>> struct super_block *sb;
> >>> int retval;
> >>> int dx_fallback=0;
> >>> unsigned blocksize;
> >>> ext4_lblk_t block, blocks;
> >>> + int csum_size = 0;
> >>> +
> >>> + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> >>> + csum_size = sizeof(struct ext4_dir_entry_tail);
> >>>
> >>> sb = dir->i_sb;
> >>> blocksize = sb->s_blocksize;
> >>> @@ -1641,6 +1809,11 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
> >>> bh = ext4_bread(handle, dir, block, 0, &retval);
> >>> if(!bh)
> >>> return retval;
> >>> + if (!buffer_verified(bh) &&
> >>> + !ext4_dirent_csum_verify(dir,
> >>> + (struct ext4_dir_entry *)bh->b_data))
> >>> + return -EIO;
> >>> + set_buffer_verified(bh);
> >>> retval = add_dirent_to_buf(handle, dentry, inode, NULL, bh);
> >>> if (retval != -ENOSPC) {
> >>> brelse(bh);
> >>> @@ -1657,7 +1830,15 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
> >>> return retval;
> >>> de = (struct ext4_dir_entry_2 *) bh->b_data;
> >>> de->inode = 0;
> >>> - de->rec_len = ext4_rec_len_to_disk(blocksize, blocksize);
> >>> + de->rec_len = ext4_rec_len_to_disk(blocksize - csum_size, blocksize);
> >>> +
> >>> + if (csum_size) {
> >>> + t = (struct ext4_dir_entry_tail *)(((void *)bh->b_data) +
> >>> + (blocksize - csum_size));
> >>> + memset(t, 0, csum_size);
> >>> + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
> >>> + }
> >>> +
> >>> retval = add_dirent_to_buf(handle, dentry, inode, de, bh);
> >>> brelse(bh);
> >>> if (retval == 0)
> >>> @@ -1689,6 +1870,11 @@ static int ext4_dx_add_entry(handle_t *handle, struct dentry *dentry,
> >>> if (!(bh = ext4_bread(handle,dir, dx_get_block(frame->at), 0, &err)))
> >>> goto cleanup;
> >>>
> >>> + if (!buffer_verified(bh) &&
> >>> + !ext4_dirent_csum_verify(dir, (struct ext4_dir_entry *)bh->b_data))
> >>> + goto journal_error;
> >>> + set_buffer_verified(bh);
> >>> +
> >>> BUFFER_TRACE(bh, "get_write_access");
> >>> err = ext4_journal_get_write_access(handle, bh);
> >>> if (err)
> >>> @@ -1814,12 +2000,17 @@ static int ext4_delete_entry(handle_t *handle,
> >>> {
> >>> struct ext4_dir_entry_2 *de, *pde;
> >>> unsigned int blocksize = dir->i_sb->s_blocksize;
> >>> + int csum_size = 0;
> >>> int i, err;
> >>>
> >>> + if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> >>> + csum_size = sizeof(struct ext4_dir_entry_tail);
> >>> +
> >>> i = 0;
> >>> pde = NULL;
> >>> de = (struct ext4_dir_entry_2 *) bh->b_data;
> >>> - while (i < bh->b_size) {
> >>> + while (i < bh->b_size - csum_size) {
> >>> if (ext4_check_dir_entry(dir, NULL, de, bh, i))
> >>> return -EIO;
> >>> if (de == de_del) {
> >>> @@ -1840,7 +2031,7 @@ static int ext4_delete_entry(handle_t *handle,
> >>> de->inode = 0;
> >>> dir->i_version++;
> >>> BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
> >>> - err = ext4_handle_dirty_metadata(handle, dir, bh);
> >>> + err = ext4_handle_dirty_dirent_node(handle, dir, bh);
> >>> if (unlikely(err)) {
> >>> ext4_std_error(dir->i_sb, err);
> >>> return err;
> >>> @@ -1983,9 +2174,15 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, int mode)
> >>> struct inode *inode;
> >>> struct buffer_head *dir_block = NULL;
> >>> struct ext4_dir_entry_2 *de;
> >>> + struct ext4_dir_entry_tail *t;
> >>> unsigned int blocksize = dir->i_sb->s_blocksize;
> >>> + int csum_size = 0;
> >>> int err, retries = 0;
> >>>
> >>> + if (EXT4_HAS_RO_COMPAT_FEATURE(dir->i_sb,
> >>> + EXT4_FEATURE_RO_COMPAT_METADATA_CSUM))
> >>> + csum_size = sizeof(struct ext4_dir_entry_tail);
> >>> +
> >>> if (EXT4_DIR_LINK_MAX(dir))
> >>> return -EMLINK;
> >>>
> >>> @@ -2026,16 +2223,26 @@ retry:
> >>> ext4_set_de_type(dir->i_sb, de, S_IFDIR);
> >>> de = ext4_next_entry(de, blocksize);
> >>> de->inode = cpu_to_le32(dir->i_ino);
> >>> - de->rec_len = ext4_rec_len_to_disk(blocksize - EXT4_DIR_REC_LEN(1),
> >>> + de->rec_len = ext4_rec_len_to_disk(blocksize -
> >>> + (csum_size + EXT4_DIR_REC_LEN(1)),
> >>> blocksize);
> >>> de->name_len = 2;
> >>> strcpy(de->name, "..");
> >>> ext4_set_de_type(dir->i_sb, de, S_IFDIR);
> >>> inode->i_nlink = 2;
> >>> +
> >>> + if (csum_size) {
> >>> + t = (struct ext4_dir_entry_tail *)(((void *)dir_block->b_data) +
> >>> + (blocksize - csum_size));
> >>> + memset(t, 0, csum_size);
> >>> + t->rec_len = ext4_rec_len_to_disk(csum_size, blocksize);
> >>> + }
> >>> +
> >>> BUFFER_TRACE(dir_block, "call ext4_handle_dirty_metadata");
> >>> - err = ext4_handle_dirty_metadata(handle, inode, dir_block);
> >>> + err = ext4_handle_dirty_dirent_node(handle, inode, dir_block);
> >>> if (err)
> >>> goto out_clear_inode;
> >>> + set_buffer_verified(dir_block);
> >>> err = ext4_mark_inode_dirty(handle, inode);
> >>> if (!err)
> >>> err = ext4_add_entry(handle, dentry, inode);
> >>> @@ -2085,6 +2292,14 @@ static int empty_dir(struct inode *inode)
> >>> inode->i_ino);
> >>> return 1;
> >>> }
> >>> + if (!buffer_verified(bh) &&
> >>> + !ext4_dirent_csum_verify(inode,
> >>> + (struct ext4_dir_entry *)bh->b_data)) {
> >>> + EXT4_ERROR_INODE(inode, "checksum error reading directory "
> >>> + "lblock 0");
> >>> + return -EIO;
> >>> + }
> >>> + set_buffer_verified(bh);
> >>> de = (struct ext4_dir_entry_2 *) bh->b_data;
> >>> de1 = ext4_next_entry(de, sb->s_blocksize);
> >>> if (le32_to_cpu(de->inode) != inode->i_ino ||
> >>> @@ -2116,6 +2331,14 @@ static int empty_dir(struct inode *inode)
> >>> offset += sb->s_blocksize;
> >>> continue;
> >>> }
> >>> + if (!buffer_verified(bh) &&
> >>> + !ext4_dirent_csum_verify(inode,
> >>> + (struct ext4_dir_entry *)bh->b_data)) {
> >>> + EXT4_ERROR_INODE(inode, "checksum error "
> >>> + "reading directory lblock 0");
> >>> + return -EIO;
> >>> + }
> >>> + set_buffer_verified(bh);
> >>> de = (struct ext4_dir_entry_2 *) bh->b_data;
> >>> }
> >>> if (ext4_check_dir_entry(inode, NULL, de, bh, offset)) {
> >>> @@ -2616,6 +2839,11 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
> >>> dir_bh = ext4_bread(handle, old_inode, 0, 0, &retval);
> >>> if (!dir_bh)
> >>> goto end_rename;
> >>> + if (!buffer_verified(dir_bh) &&
> >>> + !ext4_dirent_csum_verify(old_inode,
> >>> + (struct ext4_dir_entry *)dir_bh->b_data))
> >>> + goto end_rename;
> >>> + set_buffer_verified(dir_bh);
> >>> if (le32_to_cpu(PARENT_INO(dir_bh->b_data,
> >>> old_dir->i_sb->s_blocksize)) != old_dir->i_ino)
> >>> goto end_rename;
> >>> @@ -2646,7 +2874,7 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
> >>> ext4_current_time(new_dir);
> >>> ext4_mark_inode_dirty(handle, new_dir);
> >>> BUFFER_TRACE(new_bh, "call ext4_handle_dirty_metadata");
> >>> - retval = ext4_handle_dirty_metadata(handle, new_dir, new_bh);
> >>> + retval = ext4_handle_dirty_dirent_node(handle, new_dir, new_bh);
> >>> if (unlikely(retval)) {
> >>> ext4_std_error(new_dir->i_sb, retval);
> >>> goto end_rename;
> >>> @@ -2700,7 +2928,8 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
> >>> PARENT_INO(dir_bh->b_data, new_dir->i_sb->s_blocksize) =
> >>> cpu_to_le32(new_dir->i_ino);
> >>> BUFFER_TRACE(dir_bh, "call ext4_handle_dirty_metadata");
> >>> - retval = ext4_handle_dirty_metadata(handle, old_inode, dir_bh);
> >>> + retval = ext4_handle_dirty_dirent_node(handle, old_inode,
> >>> + dir_bh);
> >>> if (retval) {
> >>> ext4_std_error(old_dir->i_sb, retval);
> >>> goto end_rename;
> >>>
> >>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Sep 04, 2011 at 07:41:03AM -0400, Martin K. Petersen wrote:
> >>>>> "Darrick" == Darrick J Wong <[email protected]> writes:
>
> Darrick,
>
> Darrick> Furthermore, the nice thing about the in-filesystem checksum is
> Darrick> that we bake in other things like the FS UUID and the inode
> Darrick> number, which gives you a somewhat better assurance that the
> Darrick> data block belongs to the fs and the file that the code think
> Darrick> it belongs to.
>
> Yeah, I view DIF/DIX mostly as in-flight protection for writes. Whereas
> FS metadata checksumming is great for problem detection at read time.
>
> Another problem with using the DIF app tag to store filesystem metadata
> is that many array vendors use it internally and thus only disk drives
> are likely to provide the app tag space.
>
>
> Darrick> The DIX interface allows for a 32-bit block number and a 16-bit
> Darrick> application tag ... which is unfortunately small given 64-bit
> Darrick> block numbers and 32-bit inode numbers.
>
> I never understood the 32-bit ref tag. Seems silly to have a check that
> wraps at the exact boundary where problems are most likely to occur.
>
> I advocated for a DIF Type with 16-bit guard tag and 48-bit ref tag but
> that never went anywhere. Too bad - would have been easy for the storage
> vendors to implement.
>
>
> Darrick> As a side note, the crc-t10dif implementation is quite slow --
> Darrick> the hardware accelerated crc32c is 15x faster, and the sw
> Darrick> implementation is usually 3-6x faster. I suspect somebody will
> Darrick> want to fix that before DIF becomes more widespread...
>
> The CRC32C op on Nehalem and beyond is really, really fast. It's
> essentially free except for pulling the data through the cache. So it's
> not entirely fair to use that as baseline for a pure software
> implementation. What is the faster sw implementation are you referring
> to, btw.?
I have some benchmarking data for various crc algorithms here:
https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums#Benchmarking
The "faster sw implementation" that I was talking about is the slice-by-8
algorithm that I sent to the crypto list a few days ago that's based off of Bob
Pearson's slice-by-8 crc32 patch.
In the huge table, "crc32c-by8-le" is crc32c slice-by-8.
> lib/crc-t10dif is a regular 256-entry table-based CRC implementation. It
> is done pretty much like all our other software CRCs. I seem to recall
> attempting a bigger table but that yielded worse real life results due
> to cache pollution.
Yes, the only downside to the slice-by-8 method is that it eats 8K of data
cache for the table. Not a huge issue on recent Intel and POWER where the L1D
is 32K, but I imagine it could be painful elsewhere.
Do you know of any faster crc16 algorithms? I guess it wouldn't be hard to
make a family of crcs, each with different cache/speed characteristics.
> On Westmere and beyond it is possible to accelerate generic CRC
> calculation using the PCLMULQDQ operation. There are many of our CRC
> functions that could benefit from this. However, so far intel have not
> been willing to contribute the relevant code to Linux.
>
>
> Darrick> The good news is that if you're really worried about integrity,
> Darrick> metadata_csum and DIF/DIX aren't mutually exclusive features.
> Darrick> Rejecting corrupted write commands at write time seems like a
> Darrick> useful feature. :)
>
> Yup!
>
> --
> Martin K. Petersen Oracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Sep 04, 2011 at 06:19:16PM -0400, Martin K. Petersen wrote:
> >>>>> "Andi" == Andi Kleen <[email protected]> writes:
>
> Andi> Doesn't have any performance numbers.
>
> It's been a while since I read them. I thought they had some compelling
> numbers. Anyway, made a big difference in real life testing here. For
> sustained I/O we're talking an order of magnitude.
>
>
> Andi> You need to keep in mind that PCLMULQDQ uses FPU state, so any
> Andi> speedup for the kernel must be large enough to amortize the cost
> Andi> of saving the FPU state.
>
> Yeah, my test cases were for bulk database I/O, not for writing a
> handful of fs metadata blocks. Plus for the DB tests the CRC was
> generated in userland.
>
> I seem to recall Joel picking something other than the hw-accelerated
> CRC32C for ocfs2 metadata and that didn't cause any problems.
Yes, he picked regular CRC32, which has a reasonably fast slice-by-4
software implementation. For ext4, my original choices were hw acceleration or
the slower single-byte lookup table. With hw acceleration the overhead of
adding the checksums is about ~10% (for just the metadata operations); with the
single-byte table it was about 50%; and with the proposed slice-by-8 patch it's
about 20%. Hopefully I can optimize this even more in the future.
> That said, I do see a difference between IP checksum and CRC on normal
> FS workloads with DIX enabled here.
I would hope so, since the IP checksum is much simpler than any CRC...
--D
> Andi> Typically that only works out for quite large buffers, but kernel
> Andi> buffers are relatively small.
>
> *nod*
>
> --
> Martin K. Petersen Oracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2011-09-05 at 11:22 -0700, Darrick J. Wong wrote:
> On Fri, Sep 02, 2011 at 03:27:21PM -0600, Andreas Dilger wrote:
> > On 2011-09-02, at 1:18 PM, Darrick J. Wong wrote:
> > > On Wed, Aug 31, 2011 at 10:49:05PM -0600, Andreas Dilger wrote:
> > >> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> > >>> Compute and verify the checksum of the inode bitmap; the checkum is stored in
> > >>> the block group descriptor.
> > >>
> > >> I would prefer if there was a 16-bit checksum for the (most common)
> > >> 32-byte group descriptors, and this was extended to a 32-bit checksum
> > >> for the (much less common) 64-byte+ group descriptors. For filesystems
> > >> that are newly formatted with the 64bit feature it makes no difference,
> > >> but virtually all ext3/4 filesystems have only the smaller group descriptors.
> > >>
> > >> Regardless of whether using half of the crc32c is better or worse than
> > >> using crc16 for the bitmap blocks, storing _any_ checksum is better than
> > >> storing nothing at all. I would propose the following:
> > >
> > > That's an interesting reframing of the argument that I hadn't considered.
> > > I'd fallen into the idea of needing crc32c because of its bit error
> > > guarantees (all corruptions of odd numbers of bits and all corruptions of
> > > fewer than ...4? bits) that I hadn't quite realized that even if crc16
> > > can't guarantee to find any corruption at all, it still _might_, and that's
> > > better than nothing.
> > >
> > > Ok, let's split the 32-bit fields and use crc16 for the case of 32-byte block
> > > group descriptors.
> >
> > I noticed the crc16 calculation is actually _slower_ than crc32c,
> > probably because the CPU cannot use 32-bit values when computing the
> > result, so it has to do a lot of word masking, per your table at
> > https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums.
> > Also, there is the question of whether computing two different
> > checksums is needlessly complicating the code, or if it is easier
> > to just compute crc32c all the time and only make the storing of
> > the high 16 bits conditional.
> >
> > What I'm suggesting is always computing the crc32c, but for filesystems
> > that are not formatted with the 64bit option just store the low 16 bits
> > of the crc32c value into bg_{block,inode}_bitmap_csum_lo. This is much
> > better than not computing a checksum here at all. The only open question
> > is whether 1/2 of crc32c is substantially worse at detecting errors than
> > crc16 or not?
>
> All the literature I've read has suggested that crc16 can't guarantee any error
> detection capability at all with data buffers longer than 256 bytes.
Um, so in a hashing algorithm that maps f:Z_m -> Z_n you can never
guarantee error detection if m>n because of hash collisions. All you
can guarantee is that if f(a) != f(b) then a != b, so crc16 wouldn't be
able to *guarantee* error detection in anything over 2 bytes.
All of the rest of the magic in hashing functions goes into making sure
that the collision sets don't include common errors (like bit flipping).
In theory, for the correct polynomial, CRC-16 should be able to detect
single, double and triple bit flip errors in blocks of up to 8191
bytes ... of course, if those aren't your common errors, then this
analysis is useless ...
James
On Mon, Sep 05, 2011 at 02:45:28PM -0500, James Bottomley wrote:
> On Mon, 2011-09-05 at 11:22 -0700, Darrick J. Wong wrote:
> > On Fri, Sep 02, 2011 at 03:27:21PM -0600, Andreas Dilger wrote:
> > > On 2011-09-02, at 1:18 PM, Darrick J. Wong wrote:
> > > > On Wed, Aug 31, 2011 at 10:49:05PM -0600, Andreas Dilger wrote:
> > > >> On 2011-08-31, at 6:31 PM, Darrick J. Wong wrote:
> > > >>> Compute and verify the checksum of the inode bitmap; the checkum is stored in
> > > >>> the block group descriptor.
> > > >>
> > > >> I would prefer if there was a 16-bit checksum for the (most common)
> > > >> 32-byte group descriptors, and this was extended to a 32-bit checksum
> > > >> for the (much less common) 64-byte+ group descriptors. For filesystems
> > > >> that are newly formatted with the 64bit feature it makes no difference,
> > > >> but virtually all ext3/4 filesystems have only the smaller group descriptors.
> > > >>
> > > >> Regardless of whether using half of the crc32c is better or worse than
> > > >> using crc16 for the bitmap blocks, storing _any_ checksum is better than
> > > >> storing nothing at all. I would propose the following:
> > > >
> > > > That's an interesting reframing of the argument that I hadn't considered.
> > > > I'd fallen into the idea of needing crc32c because of its bit error
> > > > guarantees (all corruptions of odd numbers of bits and all corruptions of
> > > > fewer than ...4? bits) that I hadn't quite realized that even if crc16
> > > > can't guarantee to find any corruption at all, it still _might_, and that's
> > > > better than nothing.
> > > >
> > > > Ok, let's split the 32-bit fields and use crc16 for the case of 32-byte block
> > > > group descriptors.
> > >
> > > I noticed the crc16 calculation is actually _slower_ than crc32c,
> > > probably because the CPU cannot use 32-bit values when computing the
> > > result, so it has to do a lot of word masking, per your table at
> > > https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums.
> > > Also, there is the question of whether computing two different
> > > checksums is needlessly complicating the code, or if it is easier
> > > to just compute crc32c all the time and only make the storing of
> > > the high 16 bits conditional.
> > >
> > > What I'm suggesting is always computing the crc32c, but for filesystems
> > > that are not formatted with the 64bit option just store the low 16 bits
> > > of the crc32c value into bg_{block,inode}_bitmap_csum_lo. This is much
> > > better than not computing a checksum here at all. The only open question
> > > is whether 1/2 of crc32c is substantially worse at detecting errors than
> > > crc16 or not?
> >
> > All the literature I've read has suggested that crc16 can't guarantee any error
> > detection capability at all with data buffers longer than 256 bytes.
>
> Um, so in a hashing algorithm that maps f:Z_m -> Z_n you can never
> guarantee error detection if m>n because of hash collisions. All you
> can guarantee is that if f(a) != f(b) then a != b, so crc16 wouldn't be
> able to *guarantee* error detection in anything over 2 bytes.
>
> All of the rest of the magic in hashing functions goes into making sure
> that the collision sets don't include common errors (like bit flipping).
> In theory, for the correct polynomial, CRC-16 should be able to detect
> single, double and triple bit flip errors in blocks of up to 8191
> bytes ... of course, if those aren't your common errors, then this
> analysis is useless ...
Sorry, I grossly misspoke. Of course crc16 can't guarantee the ability to
detect all possible errors in any data block larger than 16 bits. What I meant
to say is that I wasn't sure what is the maximum number of bit errors that
crc16 polynomials can detect given a message length of 32768+32+128 bits.
In particular, I remember reading on the wikipedia page[1] that for polynomials
with odd numbers of terms (such as ansi crc16), the period for 2-bit errors is
65535 bits as you say; but as I recall, those polynomials also can't detect all
errors involving odd numbers of bit flips. For polynomials with even numbers
of terms (such as the t10dif one) the period in which it can detect 2-bit
errors is 32767 bits, but on the other hand they can detect odd numbers of
errors.
The people who study error rates on disk hardware at IBM tell me that bit flips
are more common than you'd like, though I was also looking for something that
can tell me if blocks are being written to the wrong places.
[1] http://en.wikipedia.org/wiki/Mathematics_of_CRC#Bitfilters
--D
>
> James
>
>
>>>>> "Darrick" == Darrick J Wong <[email protected]> writes:
Darrick> I have some benchmarking data for various crc algorithms here:
Darrick> https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums#Benchmarking
I've been meaning to update my own benchmark results from a few years
ago but your table is much more comprehensive. Nice work!
Darrick> Yes, the only downside to the slice-by-8 method is that it eats
Darrick> 8K of data cache for the table. Not a huge issue on recent
Darrick> Intel and POWER where the L1D is 32K, but I imagine it could be
Darrick> painful elsewhere.
I'll see if I can come up with something better for the DIF CRC. It's
always calculated over either 512 or 4096-byte buffers.
--
Martin K. Petersen Oracle Linux Engineering