2011-08-17 03:25:19

by djwong

[permalink] [raw]
Subject: [RFC] ext4 metadata checksumming design

Hi all,

I've created a page on the ext4 wiki outlining the patchset that I'm working on
to add metadata checksumming to ext4. The page can be found at this address:
https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums

For the most part, the metadata objects in ext4 actually have enough space to
squeeze in a 32-bit checksum; it was trivially easy to find a spot in the
superblock, the extent tree, extended attribute blocks, and the inode. Those
pieces are already done and in my tree, but the patchset as a whole is being
held up by the second class of metadata objects.

That second class of objects are the ones that required a bit of work:

- Directory blocks have an "unused" 12-byte directory entry at the very end of
the block; 8 bytes of header are followed by a 32-bit checksum. This can be
taken care of as part of directory rebuilding in e2fsck/rehash.c.

- HTree blocks had to have the dx_entry limit reduced by 1 to accomodate a
checksum. This is also taken care of during e2fsck directory rebuild.

- Extended attribute blocks that are stored in the inode table -- the h_magic
field is written by the kernel, but neither the kernel nor e2fsprogs ever
actually read this field. The field could be reused to checksum the extra
space since (as far as I can tell) EAs are the only user of that empty space.

Other miscellany:

- e2fsprogs had to be converted to always work with ext2_inode_large.

- Various bugs in the htree code....

I hope to have a first draft of the kernel/e2fsprogs patches out on the mailing
list in a week or two, or at least before LPC next month. Still on my todo
list is superblocks, EAs, changing the jbd2 checksum, and rigorous testing on
powerpc.

Please have a look at the design document and please feel free to suggest any
changes.

--D


2011-08-17 13:57:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] ext4 metadata checksumming design

On Tue, Aug 16, 2011 at 08:25:19PM -0700, Darrick J. Wong wrote:
> Hi all,
>
> I've created a page on the ext4 wiki outlining the patchset that I'm working on
> to add metadata checksumming to ext4. The page can be found at this address:
> https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums

Can you summarize the differences to my earlier patchkit?

-Andi

2011-08-17 17:09:16

by djwong

[permalink] [raw]
Subject: Re: [RFC] ext4 metadata checksumming design

On Wed, Aug 17, 2011 at 03:57:19PM +0200, Andi Kleen wrote:
> On Tue, Aug 16, 2011 at 08:25:19PM -0700, Darrick J. Wong wrote:
> > Hi all,
> >
> > I've created a page on the ext4 wiki outlining the patchset that I'm working on
> > to add metadata checksumming to ext4. The page can be found at this address:
> > https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums
>
> Can you summarize the differences to my earlier patchkit?

My new patchset intends to expand on your earlier inode/superblock patches by
adding checksums to nearly all metadata objects (directory blocks, htree,
extent tree, block/inode bitmaps, extended attributes). It also adds code to
e2fsprogs to verify the checksums, and make the necessary adjustments to the
on-disk format to add space for checksums.

--D
>
> -Andi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-08-18 06:15:25

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] ext4 metadata checksumming design

On 2011-08-16, at 9:25 PM, "Darrick J. Wong" <[email protected]> wrote:
> - Extended attribute blocks that are stored in the inode table -- the h_magic
> field is written by the kernel, but neither the kernel nor e2fsprogs ever
> actually read this field. The field could be reused to checksum the extra
> space since (as far as I can tell) EAs are the only user of that empty space.

I haven't had a chance to read the document you wrote, but wanted to comment on xattrs. There is a hash field for each xattr (including internal xattrs), and one for the external xattr blocks that can be used to validate the xattr value.

In addition to the hash for the in-inode xattrs, the inode hash itself would serve to validate the xattr values.

I have a patch for e2fsprogs that checks the xattr hash for in-inode xattrs (currently it is always 0).

> Please have a look at the design document and please feel free to suggest any
> changes.

Hopefully soon.

Cheers, Andreas

2011-08-18 18:14:19

by djwong

[permalink] [raw]
Subject: Re: [RFC] ext4 metadata checksumming design

On Thu, Aug 18, 2011 at 12:16:00AM -0600, Andreas Dilger wrote:
> On 2011-08-16, at 9:25 PM, "Darrick J. Wong" <[email protected]> wrote:
> > - Extended attribute blocks that are stored in the inode table -- the h_magic
> > field is written by the kernel, but neither the kernel nor e2fsprogs ever
> > actually read this field. The field could be reused to checksum the extra
> > space since (as far as I can tell) EAs are the only user of that empty space.
>
> I haven't had a chance to read the document you wrote, but wanted to comment
> on xattrs. There is a hash field for each xattr (including internal xattrs),
> and one for the external xattr blocks that can be used to validate the xattr
> value.
>
> In addition to the hash for the in-inode xattrs, the inode hash itself would
> serve to validate the xattr values.
>
> I have a patch for e2fsprogs that checks the xattr hash for in-inode xattrs
> (currently it is always 0).

I surveyed the h_hash/e_hash calculation code; it only covers the name and
value fields. Do we care about checksum protection for the extra fields in
struct ext4_xattr_header and struct ext4_xattr_entry? I think it would be
useful to be able to check the sanity of h_refcount and h_blocks. Possibly
that extends to e_value_* as well, though the hash probably covers it. Also,
there's no hardware acceleration available for the xattr hash, though I doubt
xattrs are especially performance sensitive.

--D
>
> > Please have a look at the design document and please feel free to suggest any
> > changes.
>
> Hopefully soon.
>
> Cheers, Andreas--
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-08-18 21:53:16

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] ext4 metadata checksumming design

On 2011-08-18, at 12:14 PM, Darrick J. Wong wrote:
> On Thu, Aug 18, 2011 at 12:16:00AM -0600, Andreas Dilger wrote:
>> On 2011-08-16, at 9:25 PM, "Darrick J. Wong" <[email protected]> wrote:
>>> - Extended attribute blocks that are stored in the inode table -- the h_magic
>>> field is written by the kernel, but neither the kernel nor e2fsprogs ever
>>> actually read this field. The field could be reused to checksum the extra
>>> space since (as far as I can tell) EAs are the only user of that empty space.
>>
>> I haven't had a chance to read the document you wrote, but wanted to comment
>> on xattrs. There is a hash field for each xattr (including internal xattrs),
>> and one for the external xattr blocks that can be used to validate the xattr
>> value.
>>
>> In addition to the hash for the in-inode xattrs, the inode hash itself would
>> serve to validate the xattr values.
>>
>> I have a patch for e2fsprogs that checks the xattr hash for in-inode xattrs
>> (currently it is always 0).
>
> I surveyed the h_hash/e_hash calculation code; it only covers the name and
> value fields. Do we care about checksum protection for the extra fields in
> struct ext4_xattr_header and struct ext4_xattr_entry?

For entries in the inode, the ext4_xattr_entry should itself already be
covered by the inode hash.

For entries in an external EA block, the ext4_xattr_header h_hash only
contains the hash of individual e_hash values, themselves covering the
e_name and e_value parts of the structure.

This is by design since the h_hash is used to determine whether an xattr
block can be shared among inodes (mbcache, which I'm not a fan of, but
that is besides the point) so it should depend only on the content and
not the actual layout (e.g. cannot contain e_value_offs, e_value_block,
h_refcount, h_blocks).


I'm totally against using the h_magic field for a checksum, since it would
break our ability to change the structure of xattrs in the future. Instead,
it would be possible to use a new h_magic (e.g. EA03) that redefines the hash
function to be what we want it to be (e.g. e_hash is CRC32c of everything
except e_value_offs and e_hash, and h_hash is CRC32c of all e_hash fields),
without making the code significantly more complex.

That would firstly give us a better hash function for the entries, and
secondly add in the e_name_len, e_name_index, and e_value_size values into
e_hash (and subsequently into h_hash).

We could also/instead use h_reserved[0] for a hash of ext4_xattr_header
and the missing e_* fields to have a stronger validation of the entire block,
without breaking the ability to use h_hash as a hash of the content, and
without doubling the CPU cost of doing the checksum.

Alternately, if we want to spend the extra CPU cycles, it would be possible
to just use h_reserved[0] for a CRC32c (or other) checksum of the whole
block.

Using only h_reserved[0] instead of changing h_magic would not require
changing the h_magic at all (which would break read-only compatibility).

> I think it would be
> useful to be able to check the sanity of h_refcount and h_blocks. Possibly
> that extends to e_value_* as well, though the hash probably covers it. Also,
> there's no hardware acceleration available for the xattr hash, though I doubt
> xattrs are especially performance sensitive.
>
> --D
>>
>>> Please have a look at the design document and please feel free to suggest any
>>> changes.
>>
>> Hopefully soon.
>>
>> Cheers, Andreas--
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






2011-08-18 23:00:43

by djwong

[permalink] [raw]
Subject: Re: [RFC] ext4 metadata checksumming design

On Thu, Aug 18, 2011 at 03:53:11PM -0600, Andreas Dilger wrote:
> On 2011-08-18, at 12:14 PM, Darrick J. Wong wrote:
> > On Thu, Aug 18, 2011 at 12:16:00AM -0600, Andreas Dilger wrote:
> >> On 2011-08-16, at 9:25 PM, "Darrick J. Wong" <[email protected]> wrote:
> >>> - Extended attribute blocks that are stored in the inode table -- the h_magic
> >>> field is written by the kernel, but neither the kernel nor e2fsprogs ever
> >>> actually read this field. The field could be reused to checksum the extra
> >>> space since (as far as I can tell) EAs are the only user of that empty space.
> >>
> >> I haven't had a chance to read the document you wrote, but wanted to comment
> >> on xattrs. There is a hash field for each xattr (including internal xattrs),
> >> and one for the external xattr blocks that can be used to validate the xattr
> >> value.
> >>
> >> In addition to the hash for the in-inode xattrs, the inode hash itself would
> >> serve to validate the xattr values.
> >>
> >> I have a patch for e2fsprogs that checks the xattr hash for in-inode xattrs
> >> (currently it is always 0).
> >
> > I surveyed the h_hash/e_hash calculation code; it only covers the name and
> > value fields. Do we care about checksum protection for the extra fields in
> > struct ext4_xattr_header and struct ext4_xattr_entry?
>
> For entries in the inode, the ext4_xattr_entry should itself already be
> covered by the inode hash.

Based on your comments and other discoveries I've made today, I'm leaning
towards extending the inode crc to cover the extra space instead of dealing
with a new EA magic. I think the kernel is trivially ok with that; e2fsprogs,
however, seems to contain some assumptions about memory allocation and IO sizes
(it thinks it can get away with having on-stack ext2_inode variables) but since
I'm already tearing up a lot of that code, I can make it deal with a few more
bytes, say up to s_inode_size.

> For entries in an external EA block, the ext4_xattr_header h_hash only
> contains the hash of individual e_hash values, themselves covering the
> e_name and e_value parts of the structure.
>
> This is by design since the h_hash is used to determine whether an xattr
> block can be shared among inodes (mbcache, which I'm not a fan of, but
> that is besides the point) so it should depend only on the content and
> not the actual layout (e.g. cannot contain e_value_offs, e_value_block,
> h_refcount, h_blocks).
>
>
> I'm totally against using the h_magic field for a checksum, since it would
> break our ability to change the structure of xattrs in the future. Instead,
> it would be possible to use a new h_magic (e.g. EA03) that redefines the hash
> function to be what we want it to be (e.g. e_hash is CRC32c of everything
> except e_value_offs and e_hash, and h_hash is CRC32c of all e_hash fields),
> without making the code significantly more complex.

Sorry, I misread the code and thought the inode EA h_magic wasn't ever being
checked. I looked again today, and it does, so I'm against (ab)using h_magic
also. I'll just figure out how to make the inode checksum cover the extra
space. For the separate EA blocks, I'll just use h_reserved[0]. It's probably
easier just to crc32c the whole block, though if it's a performance problem I
can always change it before it goes upstream. The extra code complexity of
finding the not-covered-by-hash fields probably isn't worthwhile on an
attribute block.

> That would firstly give us a better hash function for the entries, and
> secondly add in the e_name_len, e_name_index, and e_value_size values into
> e_hash (and subsequently into h_hash).
>
> We could also/instead use h_reserved[0] for a hash of ext4_xattr_header
> and the missing e_* fields to have a stronger validation of the entire block,
> without breaking the ability to use h_hash as a hash of the content, and
> without doubling the CPU cost of doing the checksum.
>
> Alternately, if we want to spend the extra CPU cycles, it would be possible
> to just use h_reserved[0] for a CRC32c (or other) checksum of the whole
> block.
>
> Using only h_reserved[0] instead of changing h_magic would not require
> changing the h_magic at all (which would break read-only compatibility).

Thank you for the comments!

--D

2011-08-19 17:46:56

by Coly Li

[permalink] [raw]
Subject: Re: [RFC] ext4 metadata checksumming design

On 2011年08月17日 11:25, Darrick J. Wong Wrote:
> Hi all,
>
> I've created a page on the ext4 wiki outlining the patchset that I'm working on
> to add metadata checksumming to ext4. The page can be found at this address:
> https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums
>

Hi Darrick,

I just go through the proposal, have on objection for most of the text. Only some things want to confirm,
1) If a metadata_csum enabled file system is metadata_csum disabled, it should be better to mark the block group or
inode whether the existing (disabled) checksum is valid or not. So if people re-enable metadata_csum, we can save quite
a lot of time to re-build check sums for all metadata objects.
2) In no-journal mode, every time when we modify the metadata objects, we may have to hold a lock, calculate the check
sum, and release the lock, which may introduce performance regression. I hope this is only my unnecessary over worry.

BTW, an engineer in Taobao kernel team, is trying to count different meta data objects I/O in run time now. One of the
first efforts, is trying to unify a set of routines to read or dirty meta data object blocks. A.k.a something (might)
like ext4_read_ext_block(), ext4_read_idx_block(), ... etc. Then the counting routines can be added inside the meta data
object blocks I/O routines. So far, it seems the modification is not trivial, needs more study on the code. Anyway,
since you mentioned on the wiki page, just let you know what we are doing now :-)

P.S. The idea of meta data I/O counting is to help us understanding the I/O characteristic of our online servers running
Ext4 file systems, which is the basic material for further I/O performance optimization.

Thanks.
--
Coly Li

2011-08-22 18:11:27

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC] ext4 metadata checksumming design

On 2011-08-16, at 9:25 PM, Darrick J. Wong wrote:
> I've created a page on the ext4 wiki outlining the patchset that I'm working on
> to add metadata checksumming to ext4. The page can be found at this address:
> https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums

Darrick,
I just had a look though this document, and it looks pretty good. It does
need to be updated to reflect that the inode checksum now covers the full
inode size, which is already mentioned in the "Extended Attributes" section.

> For the most part, the metadata objects in ext4 actually have enough space to
> squeeze in a 32-bit checksum; it was trivially easy to find a spot in the
> superblock, the extent tree, extended attribute blocks, and the inode. Those
> pieces are already done and in my tree, but the patchset as a whole is being
> held up by the second class of metadata objects.

For the group descriptor checksum and inode/block bitmap checksums with
32-byte group descriptors it makes sense to truncate the CRC32c checksum
and store the low bits of the checksum in the existing 16-bit fields, and
the high bits in extended 16-bit fields.

As a follow on, it probably also makes sense to test with a < 2^32 block
filesystem with a 64-byte group descriptor. That would give enough room
for 32-bit checksums even on smaller filesystems, and would also help
facilitate resizing filesystems from < 2^32 blocks to > 2^32 blocks in
the future. That _may_ just be as easy as formatting with "-O 64bit"
on a < 2^32 block filesystem, but I don't know how much that has been
tested.

> That second class of objects are the ones that required a bit of work:
>
> - Directory blocks have an "unused" 12-byte directory entry at the very end of
> the block; 8 bytes of header are followed by a 32-bit checksum. This can be
> taken care of as part of directory rebuilding in e2fsck/rehash.c.
>
> - HTree blocks had to have the dx_entry limit reduced by 1 to accomodate a
> checksum. This is also taken care of during e2fsck directory rebuild.
>
> - Extended attribute blocks that are stored in the inode table -- the h_magic
> field is written by the kernel, but neither the kernel nor e2fsprogs ever
> actually read this field. The field could be reused to checksum the extra
> space since (as far as I can tell) EAs are the only user of that empty space.
>
> Other miscellany:
>
> - e2fsprogs had to be converted to always work with ext2_inode_large.
>
> - Various bugs in the htree code....
>
> I hope to have a first draft of the kernel/e2fsprogs patches out on the mailing
> list in a week or two, or at least before LPC next month. Still on my todo
> list is superblocks, EAs, changing the jbd2 checksum, and rigorous testing on
> powerpc.
>
> Please have a look at the design document and please feel free to suggest any
> changes.
>
> --D


2011-08-23 02:35:21

by djwong

[permalink] [raw]
Subject: Re: [RFC] ext4 metadata checksumming design

On Mon, Aug 22, 2011 at 12:11:25PM -0600, Andreas Dilger wrote:
> On 2011-08-16, at 9:25 PM, Darrick J. Wong wrote:
> > I've created a page on the ext4 wiki outlining the patchset that I'm working on
> > to add metadata checksumming to ext4. The page can be found at this address:
> > https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums
>
> Darrick,
> I just had a look though this document, and it looks pretty good. It does
> need to be updated to reflect that the inode checksum now covers the full
> inode size, which is already mentioned in the "Extended Attributes" section.

Updated; thank you.

> > For the most part, the metadata objects in ext4 actually have enough space to
> > squeeze in a 32-bit checksum; it was trivially easy to find a spot in the
> > superblock, the extent tree, extended attribute blocks, and the inode. Those
> > pieces are already done and in my tree, but the patchset as a whole is being
> > held up by the second class of metadata objects.
>
> For the group descriptor checksum and inode/block bitmap checksums with
> 32-byte group descriptors it makes sense to truncate the CRC32c checksum
> and store the low bits of the checksum in the existing 16-bit fields, and
> the high bits in extended 16-bit fields.

One thing I haven't had the time to do yet is run that monte carlo simulation
that Ted suggested to find out how painful it is to cut off half of a crc32.
Do you know of anyone who has? (Or for that matter knows anything about my
half-baked idea to crc16(crc32(bitmap))?)

> As a follow on, it probably also makes sense to test with a < 2^32 block
> filesystem with a 64-byte group descriptor. That would give enough room
> for 32-bit checksums even on smaller filesystems, and would also help
> facilitate resizing filesystems from < 2^32 blocks to > 2^32 blocks in
> the future. That _may_ just be as easy as formatting with "-O 64bit"
> on a < 2^32 block filesystem, but I don't know how much that has been
> tested.

I've been testing it. I haven't seen any problems _so_ far.... :)

Thank you for the review!

--D
>
> > That second class of objects are the ones that required a bit of work:
> >
> > - Directory blocks have an "unused" 12-byte directory entry at the very end of
> > the block; 8 bytes of header are followed by a 32-bit checksum. This can be
> > taken care of as part of directory rebuilding in e2fsck/rehash.c.
> >
> > - HTree blocks had to have the dx_entry limit reduced by 1 to accomodate a
> > checksum. This is also taken care of during e2fsck directory rebuild.
> >
> > - Extended attribute blocks that are stored in the inode table -- the h_magic
> > field is written by the kernel, but neither the kernel nor e2fsprogs ever
> > actually read this field. The field could be reused to checksum the extra
> > space since (as far as I can tell) EAs are the only user of that empty space.
> >
> > Other miscellany:
> >
> > - e2fsprogs had to be converted to always work with ext2_inode_large.
> >
> > - Various bugs in the htree code....
> >
> > I hope to have a first draft of the kernel/e2fsprogs patches out on the mailing
> > list in a week or two, or at least before LPC next month. Still on my todo
> > list is superblocks, EAs, changing the jbd2 checksum, and rigorous testing on
> > powerpc.
> >
> > Please have a look at the design document and please feel free to suggest any
> > changes.
> >
> > --D
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html