From: "Darrick J. Wong" Subject: Re: journal_checksum_v2 on-disk format change? (was: Re: journal recovery problems with metadata_csum, *non-64bit*) Date: Wed, 13 Aug 2014 18:56:38 -0700 Message-ID: <20140814015638.GL2808@birch.djwong.org> References: <20140811071025.GE19223@birch.djwong.org> <20140811192529.GG2808@birch.djwong.org> <20140812170826.GH2808@birch.djwong.org> <5DA9CA9E-9DAE-4279-A706-9D895FA0CF70@dilger.ca> <20140813225358.GB11105@birch.djwong.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: TR Reardon , "linux-ext4@vger.kernel.org" To: Andreas Dilger Return-path: Received: from aserp1040.oracle.com ([141.146.126.69]:22697 "EHLO aserp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753082AbaHNB4o (ORCPT ); Wed, 13 Aug 2014 21:56:44 -0400 Content-Disposition: inline In-Reply-To: <20140813225358.GB11105@birch.djwong.org> Sender: linux-ext4-owner@vger.kernel.org List-ID: On Wed, Aug 13, 2014 at 03:53:58PM -0700, Darrick J. Wong wrote: > On Wed, Aug 13, 2014 at 11:35:50PM +0200, Andreas Dilger wrote: > > Doesn't a larger journal_tag_bytes() size mean more overhead for > > the journal? > > Yes. If we move to a 16-byte block tag: > > typedef struct journal_block_tag3_s > { > __u32 t_blocknr; /* The on-disk block number */ > __u32 t_flags; /* See below */ > __u32 t_blocknr_high; /* most-significant high 32bits. */ > __u32 t_checksum; /* crc32c(uuid+seq+block) */ > } journal_block_tag3_t; > > Notice that this fixes the alignment problem due to the v2 bug -- tag 2 starts > with a 4-byte field at offset 14. > > For a 32-bit FS with 4K blocks and a 32768 block journal, I found through > experimentation that if I fill the journal with nothing but journal blocks > (i.e. no revoke blocks), overhead increases by 264 blocks, or 0.8%. For a > 64-bit FS with the same parameters, the increase is 136 blocks, or 0.4%. > > I'm hoping that a <1% increase won't discourage people. :) Nuts, those figures are for 1k block filesystems. For a FS with 4k blocks, storing 16384 blocks in the journal required this many blocks: 16450 for 64bit journal_csum_v3 16450 for 32bit journal_csum_v3 (assuming I reuse the tag3 structure as-is) 16442 for 64bit journal_csum_v2 16426 for 32bit journal_csum_v2 16434 for 64bit journal_csum_v1 16418 for 32bit journal_csum_v1 16434 for 64bit none 16418 for 32bit none So for 32-bit FSes, the extra overhead to go from no journal checksumming at all to v3 checksums is 0.2%. For 64-bit FSes, the extra overhead is 0.1%. The overhead to move a 32-bit v2 FS to a v3 FS is 0.1%, and for a 64-bit FS it's 0.05%. Also, for those watching at home, I fixed the infinite loop on journal block checksum failure that someone complained about months ago. --D > > --D > > > > Cheers, Andreas > > > > > On Aug 12, 2014, at 19:08, "Darrick J. Wong" wrote: > > > > > >> On Tue, Aug 12, 2014 at 01:39:35AM -0400, TR Reardon wrote: > > >>> On 8/11/14, Darrick J. Wong wrote: > > >>> Hi all, > > >>> > > >>> Mr. Reardon has discovered that due to a bug in journal_tag_bytes(), if the > > >>> the > > >>> "journal csum v2" feature flag is turned on, block tag records are being > > >>> written with two extra bytes of space because we don't need to execute > > >>> "x += sizeof(tag.t_checksum);". In 32-bit mode, other parts of the journal > > >>> then perform incorrect size comparisons, leading to BUG() being called. In > > >>> 64-bit mode, there's enough padding that bad things won't happen. > > >>> > > >>> This is a remnant of the days when I tried to enlarge journal_block_tag_t > > >>> to > > >>> hold the full 32-bit checksum for a data block that's stored in the > > >>> journal. > > >>> Back in 2011, we decided (though sadly I can't find the link; I think we > > >>> might > > >>> have discussed this in the concall) that it was better not to change the > > >>> size > > >>> of journal_block_tag_t than it was to make it bigger so that it could hold > > >>> the > > >>> full checksum. > > >>> > > >>> A simple fix for the problem has been proposed by Mr. Reardon which fixes > > >>> journal_tag_bytes() and leaves everything else unchanged. However, that is > > >>> technically a disk format change since the size of journal_block_tag_t on > > >>> disk > > >>> changes, albeit only for people running experimental metadata_csum > > >>> filesystems. > > >>> Since we've been allocating disk space for the enlarged checksum this whole > > >>> time, if we apply that patch, anyone with an unclean 64bit FS will not be > > >>> able > > >>> to recover the journal. (Anyone with an unclean 32-bit FS has been broken > > >>> the > > >>> whole time, and still will be.) > > >>> > > >>> The other thing we could do is actually use the two bytes to store the high > > >>> 16-bits of the checksum, fix the jbd2 helper functions to reflect that > > >>> reality > > >>> (so that they don't BUG()), and change the checksum verify routine to pass > > >>> the > > >>> block if either the full checksum matches, or if the lower 16 bits match > > >>> and > > >>> the upper 16 bits are zero. With this route, anybody with an uncleanly > > >>> unmounted FS could still recover the journal, since we're not changing the > > >>> size > > >>> of anything. > > >>> > > >>> Fortunately, journal tag blocks are fairly ephemeral and nobody ought to be > > >>> using metadata_csum on a production filesystem, so at this point we can > > >>> probably change the disk format without too many repercussions. If you > > >>> make > > >>> sure to cleanly unmount the filesystem when changing kernel/e2fsprogs > > >>> versions, > > >>> there will be no problems. > > >>> > > >>> So, uh... comments? How should we proceed? > > >>> > > >>> --D > > >> > > >> My only concern is that legacy applies to in-the-wild kernels, not > > >> just journals. Any post 3.4 kernel has this problem, which will be > > >> exposed as soon as e2fsprogs 1.43 is released. A common (enough) use > > >> case might be, say, a 2TB USB drive, being unplugged and replugged > > >> across machines without proper shutdown. Old machines will think > > >> they recognize the journal but don't actually, and replay something > > >> destructive. > > >> > > >> In other words, this is a future-retro problem. The problem is not so > > >> much with existing fs experimenting with metadata_csum, but a future > > >> properly-journaled drive being plugged into an legacy faulty machine. > > >> Those incompat flags are supposed to protect against just this > > >> scenario, but won't. So perhaps you'll need make a new > > >> "INCOMPAT_CSUM_V3" flag to protect? Actually, dwelling on this a bit > > >> today, I think whether or not you retain those 2 extra checksum bytes > > >> in final fix, you ought use a new flag for the format. > > > > > > Definitely we need a new bit to keep the old kernels out; I wanted others to > > > weigh in on whether we ought to shrink the struct or put the two bytes to use. > > > > > > That said... journal recovery is not so well tested, so I'm currently busy > > > building out debugfs to create journal transactions so we can test recovery > > > both with and without *_csum. > > > > > > --D > > >> > > >> +Reardon > > >> -- > > >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > > >> the body of a message to majordomo@vger.kernel.org > > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > > > the body of a message to majordomo@vger.kernel.org > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html