From: "Darrick J. Wong" <darrick.wong@oracle.com>
Subject: Re: journal_checksum_v2 on-disk format change? (was: Re: journal
 recovery problems with metadata_csum, *non-64bit*)
Date: Wed, 13 Aug 2014 18:56:38 -0700
Message-ID: <20140814015638.GL2808@birch.djwong.org>
References: <BLU437-SMTP3407533C3883653626BD3DFDEC0@phx.gbl>
 <20140811071025.GE19223@birch.djwong.org>
 <20140811192529.GG2808@birch.djwong.org>
 <BLU436-SMTP123C75E96D9A42B3CA66A12FDEA0@phx.gbl>
 <20140812170826.GH2808@birch.djwong.org>
 <5DA9CA9E-9DAE-4279-A706-9D895FA0CF70@dilger.ca>
 <20140813225358.GB11105@birch.djwong.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: TR Reardon <thomas_reardon@hotmail.com>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>
To: Andreas Dilger <adilger@dilger.ca>
Content-Disposition: inline
In-Reply-To: <20140813225358.GB11105@birch.djwong.org>
Sender: linux-ext4-owner@vger.kernel.org

On Wed, Aug 13, 2014 at 03:53:58PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 13, 2014 at 11:35:50PM +0200, Andreas Dilger wrote:
> > Doesn't a larger journal_tag_bytes() size mean more overhead for
> > the journal?
> 
> Yes.  If we move to a 16-byte block tag:
> 
> typedef struct journal_block_tag3_s
> {
> 	__u32		t_blocknr;	/* The on-disk block number */
> 	__u32		t_flags;	/* See below */
> 	__u32		t_blocknr_high; /* most-significant high 32bits. */
> 	__u32		t_checksum;	/* crc32c(uuid+seq+block) */
> } journal_block_tag3_t;
> 
> Notice that this fixes the alignment problem due to the v2 bug -- tag 2 starts
> with a 4-byte field at offset 14.
> 
> For a 32-bit FS with 4K blocks and a 32768 block journal, I found through
> experimentation that if I fill the journal with nothing but journal blocks
> (i.e. no revoke blocks), overhead increases by 264 blocks, or 0.8%.  For a
> 64-bit FS with the same parameters, the increase is 136 blocks, or 0.4%.
> 
> I'm hoping that a <1% increase won't discourage people. :)

Nuts, those figures are for 1k block filesystems.  For a FS with 4k blocks,
storing 16384 blocks in the journal required this many blocks:

16450 for 64bit journal_csum_v3
16450 for 32bit journal_csum_v3 (assuming I reuse the tag3 structure as-is)

16442 for 64bit journal_csum_v2
16426 for 32bit journal_csum_v2

16434 for 64bit journal_csum_v1
16418 for 32bit journal_csum_v1

16434 for 64bit none
16418 for 32bit none

So for 32-bit FSes, the extra overhead to go from no journal checksumming at
all to v3 checksums is 0.2%.  For 64-bit FSes, the extra overhead is 0.1%.

The overhead to move a 32-bit v2 FS to a v3 FS is 0.1%, and for a 64-bit FS
it's 0.05%.

Also, for those watching at home, I fixed the infinite loop on journal block
checksum failure that someone complained about months ago.

--D

> 
> --D
> > 
> > Cheers, Andreas
> > 
> > > On Aug 12, 2014, at 19:08, "Darrick J. Wong" <darrick.wong@oracle.com> wrote:
> > > 
> > >> On Tue, Aug 12, 2014 at 01:39:35AM -0400, TR Reardon wrote:
> > >>> On 8/11/14, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > >>> Hi all,
> > >>> 
> > >>> Mr. Reardon has discovered that due to a bug in journal_tag_bytes(), if the
> > >>> the
> > >>> "journal csum v2" feature flag is turned on, block tag records are being
> > >>> written with two extra bytes of space because we don't need to execute
> > >>> "x += sizeof(tag.t_checksum);".  In 32-bit mode, other parts of the journal
> > >>> then perform incorrect size comparisons, leading to BUG() being called.  In
> > >>> 64-bit mode, there's enough padding that bad things won't happen.
> > >>> 
> > >>> This is a remnant of the days when I tried to enlarge journal_block_tag_t
> > >>> to
> > >>> hold the full 32-bit checksum for a data block that's stored in the
> > >>> journal.
> > >>> Back in 2011, we decided (though sadly I can't find the link; I think we
> > >>> might
> > >>> have discussed this in the concall) that it was better not to change the
> > >>> size
> > >>> of journal_block_tag_t than it was to make it bigger so that it could hold
> > >>> the
> > >>> full checksum.
> > >>> 
> > >>> A simple fix for the problem has been proposed by Mr. Reardon which fixes
> > >>> journal_tag_bytes() and leaves everything else unchanged.  However, that is
> > >>> technically a disk format change since the size of journal_block_tag_t on
> > >>> disk
> > >>> changes, albeit only for people running experimental metadata_csum
> > >>> filesystems.
> > >>> Since we've been allocating disk space for the enlarged checksum this whole
> > >>> time, if we apply that patch, anyone with an unclean 64bit FS will not be
> > >>> able
> > >>> to recover the journal.  (Anyone with an unclean 32-bit FS has been broken
> > >>> the
> > >>> whole time, and still will be.)
> > >>> 
> > >>> The other thing we could do is actually use the two bytes to store the high
> > >>> 16-bits of the checksum, fix the jbd2 helper functions to reflect that
> > >>> reality
> > >>> (so that they don't BUG()), and change the checksum verify routine to pass
> > >>> the
> > >>> block if either the full checksum matches, or if the lower 16 bits match
> > >>> and
> > >>> the upper 16 bits are zero.  With this route, anybody with an uncleanly
> > >>> unmounted FS could still recover the journal, since we're not changing the
> > >>> size
> > >>> of anything.
> > >>> 
> > >>> Fortunately, journal tag blocks are fairly ephemeral and nobody ought to be
> > >>> using metadata_csum on a production filesystem, so at this point we can
> > >>> probably change the disk format without too many repercussions.  If you
> > >>> make
> > >>> sure to cleanly unmount the filesystem when changing kernel/e2fsprogs
> > >>> versions,
> > >>> there will be no problems.
> > >>> 
> > >>> So, uh... comments?  How should we proceed?
> > >>> 
> > >>> --D
> > >> 
> > >> My only concern is that legacy applies to in-the-wild kernels, not
> > >> just journals.  Any post 3.4 kernel has this problem, which will be
> > >> exposed as soon as e2fsprogs 1.43 is released.  A common (enough) use
> > >> case might be, say, a 2TB USB drive, being unplugged and replugged
> > >> across machines  without proper shutdown.  Old machines will think
> > >> they recognize the journal but don't actually, and replay something
> > >> destructive.
> > >> 
> > >> In other words, this is a future-retro problem.  The problem is not so
> > >> much with existing fs experimenting with metadata_csum, but a future
> > >> properly-journaled drive being plugged into an legacy faulty machine.
> > >> Those incompat flags are supposed to protect against just this
> > >> scenario, but won't.  So perhaps you'll need make a new
> > >> "INCOMPAT_CSUM_V3" flag to protect?  Actually, dwelling on this a bit
> > >> today, I think whether or not you retain those 2 extra checksum bytes
> > >> in final fix, you ought use a new flag for the format.
> > > 
> > > Definitely we need a new bit to keep the old kernels out; I wanted others to
> > > weigh in on whether we ought to shrink the struct or put the two bytes to use.
> > > 
> > > That said... journal recovery is not so well tested, so I'm currently busy
> > > building out debugfs to create journal transactions so we can test recovery
> > > both with and without *_csum.
> > > 
> > > --D
> > >> 
> > >> +Reardon
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > >> the body of a message to majordomo@vger.kernel.org
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html