2014-08-08 19:37:03

by TR Reardon

[permalink] [raw]
Subject: journal recovery problems with metadata_csum, *non-64bit*

Kernel 3.16, e2fsprogs 1.43-WIP (1.42.11 compiled with metadata_csum
support), filesystems mounted with journal_async_commit.

This may be an fsck problem, hard to tell. I consistently have
problems with journal recovery after hard reboot when filesystem has
metadata_csum. Interestingly, no problems with 64-bit filesystems.

during journal replay, bogus block numbers are reported. in this
case, a 128MB file was deleted on two filesystems where the only
difference is 64bit vs non-64bit. This also happens with or without
bigalloc, though I only document bigalloc case here.

please see attached superblocks, dumped just prior to hard reboot.
disk8=sdm1, disk9=sdn1

dmesg:

[Fri Aug 8 15:16:28 2014] JBD2: Out of memory during recovery.
[Fri Aug 8 15:16:28 2014] JBD2: recovery failed
[Fri Aug 8 15:16:28 2014] EXT4-fs (sdm1): error loading journal
[Fri Aug 8 15:16:27 2014] EXT4-fs (sdn1): recovery complete
[Fri Aug 8 15:16:27 2014] EXT4-fs (sdn1): mounted filesystem with
ordered data mode. Opts: journal_async_commit


fsck:

#e2fsck -f -v /dev/sdm1
e2fsck 1.43-WIP (09-Jul-2014)
disk8: recovering journal
Error writing block 549755813991 (Invalid argument). Ignore error<y>? yes
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (11332304, counted=11365072).
Fix<y>? yes
Free inodes count wrong (177287, counted=177288).
Fix<y>? yes

disk8: ***** FILE SYSTEM WAS MODIFIED *****

1656 inodes used (0.93%, out of 178944)
260 non-contiguous files (15.7%)
0 non-contiguous directories (0.0%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 1280/363/5
721201312 blocks used (98.45%, out of 732566384)
0 bad blocks
358 large files

1267 regular files
380 directories
0 character device files
0 block device files
0 fifos
0 links
0 symbolic links (0 fast symbolic links)
0 sockets
------------
1647 files


Attachments:
disk8-dumpe2fs-postrm (2.02 kB)
disk9-dumpe2fs-postrm (2.03 kB)
Download all attachments

2014-08-08 21:47:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: journal recovery problems with metadata_csum, *non-64bit*

On Fri, Aug 08, 2014 at 03:31:12PM -0400, TR Reardon wrote:
> Kernel 3.16, e2fsprogs 1.43-WIP (1.42.11 compiled with metadata_csum
> support), filesystems mounted with journal_async_commit.
>
> This may be an fsck problem, hard to tell. I consistently have
> problems with journal recovery after hard reboot when filesystem has
> metadata_csum. Interestingly, no problems with 64-bit filesystems.

These are test file systems I hope. metadata_csum is something where
we're still working out some of the bugs. (Obviously)

So metadata_csum works fine with 64-bit file systems, but it is
failing with journal recovery if the 64-bit file system feature is not
enabled? Is that a correct summary of what you are seeing?

Also, with e2fsprogs 1.43-WIP releases, it's important that you
specify the git commit id you are using, since the next branch in
particular is quite fast moving.

- Ted

2014-08-08 22:15:53

by Darrick J. Wong

[permalink] [raw]
Subject: Re: journal recovery problems with metadata_csum, *non-64bit*

On Fri, Aug 08, 2014 at 03:31:12PM -0400, TR Reardon wrote:
> Kernel 3.16, e2fsprogs 1.43-WIP (1.42.11 compiled with metadata_csum
> support), filesystems mounted with journal_async_commit.
>
> This may be an fsck problem, hard to tell. I consistently have
> problems with journal recovery after hard reboot when filesystem has
> metadata_csum. Interestingly, no problems with 64-bit filesystems.
>
> during journal replay, bogus block numbers are reported. in this
> case, a 128MB file was deleted on two filesystems where the only
> difference is 64bit vs non-64bit. This also happens with or without
> bigalloc, though I only document bigalloc case here.

Could you post full logs, including the bogus block numbers, please? I've a
suspicion that we might be mishandling the 32/64 bit switch in either the
descriptor block or the revocation block, but it's hard to tell from the five
lines of log.

--D
>
> please see attached superblocks, dumped just prior to hard reboot.
> disk8=sdm1, disk9=sdn1
>
> dmesg:
>
> [Fri Aug 8 15:16:28 2014] JBD2: Out of memory during recovery.
> [Fri Aug 8 15:16:28 2014] JBD2: recovery failed
> [Fri Aug 8 15:16:28 2014] EXT4-fs (sdm1): error loading journal
> [Fri Aug 8 15:16:27 2014] EXT4-fs (sdn1): recovery complete
> [Fri Aug 8 15:16:27 2014] EXT4-fs (sdn1): mounted filesystem with
> ordered data mode. Opts: journal_async_commit
>
>
> fsck:
>
> #e2fsck -f -v /dev/sdm1
> e2fsck 1.43-WIP (09-Jul-2014)
> disk8: recovering journal
> Error writing block 549755813991 (Invalid argument). Ignore error<y>? yes
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Free blocks count wrong (11332304, counted=11365072).
> Fix<y>? yes
> Free inodes count wrong (177287, counted=177288).
> Fix<y>? yes
>
> disk8: ***** FILE SYSTEM WAS MODIFIED *****
>
> 1656 inodes used (0.93%, out of 178944)
> 260 non-contiguous files (15.7%)
> 0 non-contiguous directories (0.0%)
> # of inodes with ind/dind/tind blocks: 0/0/0
> Extent depth histogram: 1280/363/5
> 721201312 blocks used (98.45%, out of 732566384)
> 0 bad blocks
> 358 large files
>
> 1267 regular files
> 380 directories
> 0 character device files
> 0 block device files
> 0 fifos
> 0 links
> 0 symbolic links (0 fast symbolic links)
> 0 sockets
> ------------
> 1647 files




2014-08-08 22:30:24

by TR Reardon

[permalink] [raw]
Subject: Re: journal recovery problems with metadata_csum, *non-64bit*

On Fri, Aug 8, 2014 at 5:47 PM, Theodore Ts'o <[email protected]> wrote:
> So metadata_csum works fine with 64-bit file systems, but it is
> failing with journal recovery if the 64-bit file system feature is not
> enabled? Is that a correct summary of what you are seeing?

Correct.

> Also, with e2fsprogs 1.43-WIP releases, it's important that you
> specify the git commit id you are using, since the next branch in
> particular is quite fast moving.
>

Using master @ de25d9c8c48c7474828e9452184e204b18b8e090, which was the
1.42.11 release. The current master and next have same problem.

I'd rather use debian versions, but metadata_csum is disabled there.
debian does not sync with git master, so what exactly does it sync
with? ie, how are the metadata_csum patches dropped, since they exist
in master?

2014-08-09 00:21:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: journal recovery problems with metadata_csum, *non-64bit*

On Fri, Aug 08, 2014 at 06:29:40PM -0400, TR Reardon wrote:
>
> Using master @ de25d9c8c48c7474828e9452184e204b18b8e090, which was the
> 1.42.11 release. The current master and next have same problem.

What's going on is that the "master" and "next" branches on the
e2fsprogs git tree are development branches which will eventually be
the 1.43.0 release. The "maint" branch is where the 1.42.x releases
are cut.

The maint branch does not have metadata_csum and inline_data support
at all. It's not a matter of it being disabled, that code is simply
not there.

So I'm not sure what you mean by "master @
de25d9c8c48c7474828e9452184e204b18b8e090". Your local git tree may
have assigned master to that git commit id, but then it wouldn't have
any metadata_csum support at all.


> I'd rather use debian versions, but metadata_csum is disabled there.
> debian does not sync with git master, so what exactly does it sync
> with? ie, how are the metadata_csum patches dropped, since they exist
> in master?

See above. The reason why debian doesn't support metadata_csum is
because it's not ready yet. If you'd like to help us debug it, we
will be very grateful, but please understand it's still a work in
progress.

There's a big difference between "it mostly works", and "it's ready
for it to be used in production". When I have time I might drop a
1.43-WIP ("work in progress") into debian experimental, but Debian
Jessie is going to be targetting 3.16, and we will need to make sure
that all of the necessary kernel bits for metadata_csum and
inline_data are working flawlessly before I'll want to let a 1.43
release into Debian testing --- and that's assuming we can get the
1.43 userspace bits fully stablized as well before Jessie date of
November 5, 2014.

Cheers,

- Ted

2014-08-09 04:07:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: journal recovery problems with metadata_csum, *non-64bit*

On Fri, Aug 08, 2014 at 11:29:11PM -0400, TR Reardon wrote:
>
> I meant only: that commit is in master _and_ maint branches, so to get a
> version roughly = 1.42.11 + metadata_csum, I pulled up to that commit.

Right but you gave me the git commit id for 1.42.11, which is not
particularly useful; I know the commit id for v1.42.11 already. :-)

What I needed to know was the git commit of "master" that you had
grabbed, since the "master" and "next" branches are fairly active
these days. (The master branch langs the next branch, so depending on
how exciting a life you want to live, you can try the next branch
instead of the master branch.)

> Regardless, the problem exists in kernel too, judging by the JBD2 errors.

Yes, we'll need to look into that. Definitely worth knowing; thanks
for that.

Cheers,

- Ted

2014-08-10 22:36:17

by TR Reardon

[permalink] [raw]
Subject: Re: journal recovery problems with metadata_csum, *non-64bit*

Ok, I found the problem in jbd2, and have a solution, though it's
debatable what the ideal solution is. For now, the simplest patch is
below, though a similar patch in lib/ext2fs/kernel-jbd.h is required
to get e2fsck back in sync.

The original c3900875 commit adding metadata_csum (ie
journal_checksum_v2) to jbd2 added 2 extra bytes for the block
checksums, in addition to re-allocating 2 bytes from the 4 bytes of
flags. However, a decision was made to only retain the lower 16-bits
of the crc32c, and thus those extra 2 bytes were unneeded. But those
2 extra bytes were never "deallocated" from journal_tag_bytes().

Unfortunately, different code relies on JBD_TAG_SIZE32/64 constants
directly rather than the journal_tag_bytes() utility function, in
particular the recovery code which is common to e2fsck and jbd2. This
led different tools to think they were looking at a 64bit journal when
actually it was 32bit. Code that relied on journal_tag_bytes()
remained safe, so the block iterators were fine, but any direct use of
those constants [including the hideous greater-than comparison in
read_tag_bytes()] went awry, and journal replay will fail.

As far as I can tell, metadata_csum + journal checksum has never
worked for 32bit filesystems. By a little bit of padding luck, 64bit
worked fine.

Now, as to the solution: depends on whether one feels that existing
in-the-wild journals matter. The original commit was May 2012, are we
past early-adopters now? If this patch is taken, you shrink the
journal block tags to the intended size but in-the-wild journals will
be broken. But they already are, so...? This opens up the
possibility of now using those extra 2 bytes and retaining full 32-bit
crc32c for the block tags. If going that route, debugs/logdump needs
a fix in addition to changes to jbd2.

FWIW, the "JBD2: Out of memory during recovery." error in
fs/jbd2/recovery.c was opaque at best and should be changed to always
include the block# that caused the problem.

+Reardon

---
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 67b8e30..dc27d09 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -2166,15 +2166,11 @@ int jbd2_journal_blocks_per_page(struct inode *inode)
size_t journal_tag_bytes(journal_t *journal)
{
journal_block_tag_t tag;
- size_t x = 0;

2014-08-11 07:10:33

by Darrick J. Wong

[permalink] [raw]
Subject: Re: journal recovery problems with metadata_csum, *non-64bit*

On Sun, Aug 10, 2014 at 06:35:33PM -0400, TR Reardon wrote:
> Ok, I found the problem in jbd2, and have a solution, though it's
> debatable what the ideal solution is. For now, the simplest patch is
> below, though a similar patch in lib/ext2fs/kernel-jbd.h is required
> to get e2fsck back in sync.
>
> The original c3900875 commit adding metadata_csum (ie
> journal_checksum_v2) to jbd2 added 2 extra bytes for the block
> checksums, in addition to re-allocating 2 bytes from the 4 bytes of
> flags. However, a decision was made to only retain the lower 16-bits
> of the crc32c, and thus those extra 2 bytes were unneeded. But those
> 2 extra bytes were never "deallocated" from journal_tag_bytes().

Hrmm... yes, I remember trying to push for full 32-bit checksums on journal
blocks, and our subsequent decision not to put in the two bytes. Oops.

(This looks more like a coding error on my part.)

I suppose it would help to be able to use debugfs or something to create
journal transactions just to see if they'll replay correctly in the
kernel/e2fsck. I've wondered for a while if the e2fsck jbd code ought to be
pushed into libext2fs (or libjbd2) to make this easier. A long ago fuse2fs
patchbomb actually did this so that fuse2fs could at least replay the journal.

> Unfortunately, different code relies on JBD_TAG_SIZE32/64 constants
> directly rather than the journal_tag_bytes() utility function, in
> particular the recovery code which is common to e2fsck and jbd2. This
> led different tools to think they were looking at a 64bit journal when
> actually it was 32bit. Code that relied on journal_tag_bytes()
> remained safe, so the block iterators were fine, but any direct use of
> those constants [including the hideous greater-than comparison in
> read_tag_bytes()] went awry, and journal replay will fail.

Hmm... I thought recovery.c sets tag_bytes to journal_tag_bytes()?

(It's late, I'll have another look in the morning.)

> As far as I can tell, metadata_csum + journal checksum has never
> worked for 32bit filesystems. By a little bit of padding luck, 64bit
> worked fine.
`
D'oh. :(

FWIW, 64bit is recommended for metadata_csum since it enables full 32-bit
bitmap checksums.

> Now, as to the solution: depends on whether one feels that existing
> in-the-wild journals matter. The original commit was May 2012, are we
> past early-adopters now? If this patch is taken, you shrink the

Definitely not past the early adopter stage. The e2fsprogs code will be in
1.43, which means that most people can't use metadata_csum yet.

So, thank you very much for helping us to smoke test. :)

> journal block tags to the intended size but in-the-wild journals will
> be broken. But they already are, so...? This opens up the
> possibility of now using those extra 2 bytes and retaining full 32-bit
> crc32c for the block tags. If going that route, debugs/logdump needs
> a fix in addition to changes to jbd2.

In theory the only wild journals should be on test FSes anyway, so breaking
them isn't the end of the world.

But as you point out, the space is already getting used because journal_csum_v2
is the gate for the extra 2 bytes to be turned on, so I guess we could just use
the extra 2 bytes and store the full checksum.

> FWIW, the "JBD2: Out of memory during recovery." error in
> fs/jbd2/recovery.c was opaque at best and should be changed to always
> include the block# that caused the problem.

recovery.c line 611, correct?

--D
>
> +Reardon
>
> ---
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 67b8e30..dc27d09 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -2166,15 +2166,11 @@ int jbd2_journal_blocks_per_page(struct inode *inode)
> size_t journal_tag_bytes(journal_t *journal)
> {
> journal_block_tag_t tag;
> - size_t x = 0;
> -
> - if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
> - x += sizeof(tag.t_checksum);
>
> if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT))
> - return x + JBD2_TAG_SIZE64;
> + return JBD2_TAG_SIZE64;
> else
> - return x + JBD2_TAG_SIZE32;
> + return JBD2_TAG_SIZE32;
> }
>
> /*

2014-08-11 19:25:39

by Darrick J. Wong

[permalink] [raw]
Subject: journal_checksum_v2 on-disk format change? (was: Re: journal recovery problems with metadata_csum, *non-64bit*)

Hi all,

Mr. Reardon has discovered that due to a bug in journal_tag_bytes(), if the the
"journal csum v2" feature flag is turned on, block tag records are being
written with two extra bytes of space because we don't need to execute
"x += sizeof(tag.t_checksum);". In 32-bit mode, other parts of the journal
then perform incorrect size comparisons, leading to BUG() being called. In
64-bit mode, there's enough padding that bad things won't happen.

This is a remnant of the days when I tried to enlarge journal_block_tag_t to
hold the full 32-bit checksum for a data block that's stored in the journal.
Back in 2011, we decided (though sadly I can't find the link; I think we might
have discussed this in the concall) that it was better not to change the size
of journal_block_tag_t than it was to make it bigger so that it could hold the
full checksum.

A simple fix for the problem has been proposed by Mr. Reardon which fixes
journal_tag_bytes() and leaves everything else unchanged. However, that is
technically a disk format change since the size of journal_block_tag_t on disk
changes, albeit only for people running experimental metadata_csum filesystems.
Since we've been allocating disk space for the enlarged checksum this whole
time, if we apply that patch, anyone with an unclean 64bit FS will not be able
to recover the journal. (Anyone with an unclean 32-bit FS has been broken the
whole time, and still will be.)

The other thing we could do is actually use the two bytes to store the high
16-bits of the checksum, fix the jbd2 helper functions to reflect that reality
(so that they don't BUG()), and change the checksum verify routine to pass the
block if either the full checksum matches, or if the lower 16 bits match and
the upper 16 bits are zero. With this route, anybody with an uncleanly
unmounted FS could still recover the journal, since we're not changing the size
of anything.

Fortunately, journal tag blocks are fairly ephemeral and nobody ought to be
using metadata_csum on a production filesystem, so at this point we can
probably change the disk format without too many repercussions. If you make
sure to cleanly unmount the filesystem when changing kernel/e2fsprogs versions,
there will be no problems.

So, uh... comments? How should we proceed?

--D

On Mon, Aug 11, 2014 at 12:10:25AM -0700, Darrick J. Wong wrote:
> On Sun, Aug 10, 2014 at 06:35:33PM -0400, TR Reardon wrote:
> > Ok, I found the problem in jbd2, and have a solution, though it's
> > debatable what the ideal solution is. For now, the simplest patch is
> > below, though a similar patch in lib/ext2fs/kernel-jbd.h is required
> > to get e2fsck back in sync.
> >
> > The original c3900875 commit adding metadata_csum (ie
> > journal_checksum_v2) to jbd2 added 2 extra bytes for the block
> > checksums, in addition to re-allocating 2 bytes from the 4 bytes of
> > flags. However, a decision was made to only retain the lower 16-bits
> > of the crc32c, and thus those extra 2 bytes were unneeded. But those
> > 2 extra bytes were never "deallocated" from journal_tag_bytes().
>
> Hrmm... yes, I remember trying to push for full 32-bit checksums on journal
> blocks, and our subsequent decision not to put in the two bytes. Oops.
>
> (This looks more like a coding error on my part.)
>
> I suppose it would help to be able to use debugfs or something to create
> journal transactions just to see if they'll replay correctly in the
> kernel/e2fsck. I've wondered for a while if the e2fsck jbd code ought to be
> pushed into libext2fs (or libjbd2) to make this easier. A long ago fuse2fs
> patchbomb actually did this so that fuse2fs could at least replay the journal.
>
> > Unfortunately, different code relies on JBD_TAG_SIZE32/64 constants
> > directly rather than the journal_tag_bytes() utility function, in
> > particular the recovery code which is common to e2fsck and jbd2. This
> > led different tools to think they were looking at a 64bit journal when
> > actually it was 32bit. Code that relied on journal_tag_bytes()
> > remained safe, so the block iterators were fine, but any direct use of
> > those constants [including the hideous greater-than comparison in
> > read_tag_bytes()] went awry, and journal replay will fail.
>
> Hmm... I thought recovery.c sets tag_bytes to journal_tag_bytes()?
>
> (It's late, I'll have another look in the morning.)
>
> > As far as I can tell, metadata_csum + journal checksum has never
> > worked for 32bit filesystems. By a little bit of padding luck, 64bit
> > worked fine.
> `
> D'oh. :(
>
> FWIW, 64bit is recommended for metadata_csum since it enables full 32-bit
> bitmap checksums.
>
> > Now, as to the solution: depends on whether one feels that existing
> > in-the-wild journals matter. The original commit was May 2012, are we
> > past early-adopters now? If this patch is taken, you shrink the
>
> Definitely not past the early adopter stage. The e2fsprogs code will be in
> 1.43, which means that most people can't use metadata_csum yet.
>
> So, thank you very much for helping us to smoke test. :)
>
> > journal block tags to the intended size but in-the-wild journals will
> > be broken. But they already are, so...? This opens up the
> > possibility of now using those extra 2 bytes and retaining full 32-bit
> > crc32c for the block tags. If going that route, debugs/logdump needs
> > a fix in addition to changes to jbd2.
>
> In theory the only wild journals should be on test FSes anyway, so breaking
> them isn't the end of the world.
>
> But as you point out, the space is already getting used because journal_csum_v2
> is the gate for the extra 2 bytes to be turned on, so I guess we could just use
> the extra 2 bytes and store the full checksum.
>
> > FWIW, the "JBD2: Out of memory during recovery." error in
> > fs/jbd2/recovery.c was opaque at best and should be changed to always
> > include the block# that caused the problem.
>
> recovery.c line 611, correct?
>
> --D
> >
> > +Reardon
> >
> > ---
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index 67b8e30..dc27d09 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -2166,15 +2166,11 @@ int jbd2_journal_blocks_per_page(struct inode *inode)
> > size_t journal_tag_bytes(journal_t *journal)
> > {
> > journal_block_tag_t tag;
> > - size_t x = 0;
> > -
> > - if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_CSUM_V2))
> > - x += sizeof(tag.t_checksum);
> >
> > if (JBD2_HAS_INCOMPAT_FEATURE(journal, JBD2_FEATURE_INCOMPAT_64BIT))
> > - return x + JBD2_TAG_SIZE64;
> > + return JBD2_TAG_SIZE64;
> > else
> > - return x + JBD2_TAG_SIZE32;
> > + return JBD2_TAG_SIZE32;
> > }
> >
> > /*
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-08-12 05:39:39

by TR Reardon

[permalink] [raw]
Subject: Re: journal_checksum_v2 on-disk format change? (was: Re: journal recovery problems with metadata_csum, *non-64bit*)

On 8/11/14, Darrick J. Wong <[email protected]> wrote:
> Hi all,
>
> Mr. Reardon has discovered that due to a bug in journal_tag_bytes(), if the
> the
> "journal csum v2" feature flag is turned on, block tag records are being
> written with two extra bytes of space because we don't need to execute
> "x += sizeof(tag.t_checksum);". In 32-bit mode, other parts of the journal
> then perform incorrect size comparisons, leading to BUG() being called. In
> 64-bit mode, there's enough padding that bad things won't happen.
>
> This is a remnant of the days when I tried to enlarge journal_block_tag_t
> to
> hold the full 32-bit checksum for a data block that's stored in the
> journal.
> Back in 2011, we decided (though sadly I can't find the link; I think we
> might
> have discussed this in the concall) that it was better not to change the
> size
> of journal_block_tag_t than it was to make it bigger so that it could hold
> the
> full checksum.
>
> A simple fix for the problem has been proposed by Mr. Reardon which fixes
> journal_tag_bytes() and leaves everything else unchanged. However, that is
> technically a disk format change since the size of journal_block_tag_t on
> disk
> changes, albeit only for people running experimental metadata_csum
> filesystems.
> Since we've been allocating disk space for the enlarged checksum this whole
> time, if we apply that patch, anyone with an unclean 64bit FS will not be
> able
> to recover the journal. (Anyone with an unclean 32-bit FS has been broken
> the
> whole time, and still will be.)
>
> The other thing we could do is actually use the two bytes to store the high
> 16-bits of the checksum, fix the jbd2 helper functions to reflect that
> reality
> (so that they don't BUG()), and change the checksum verify routine to pass
> the
> block if either the full checksum matches, or if the lower 16 bits match
> and
> the upper 16 bits are zero. With this route, anybody with an uncleanly
> unmounted FS could still recover the journal, since we're not changing the
> size
> of anything.
>
> Fortunately, journal tag blocks are fairly ephemeral and nobody ought to be
> using metadata_csum on a production filesystem, so at this point we can
> probably change the disk format without too many repercussions. If you
> make
> sure to cleanly unmount the filesystem when changing kernel/e2fsprogs
> versions,
> there will be no problems.
>
> So, uh... comments? How should we proceed?
>
> --D

My only concern is that legacy applies to in-the-wild kernels, not
just journals. Any post 3.4 kernel has this problem, which will be
exposed as soon as e2fsprogs 1.43 is released. A common (enough) use
case might be, say, a 2TB USB drive, being unplugged and replugged
across machines without proper shutdown. Old machines will think
they recognize the journal but don't actually, and replay something
destructive.

In other words, this is a future-retro problem. The problem is not so
much with existing fs experimenting with metadata_csum, but a future
properly-journaled drive being plugged into an legacy faulty machine.
Those incompat flags are supposed to protect against just this
scenario, but won't. So perhaps you'll need make a new
"INCOMPAT_CSUM_V3" flag to protect? Actually, dwelling on this a bit
today, I think whether or not you retain those 2 extra checksum bytes
in final fix, you ought use a new flag for the format.

+Reardon

2014-08-12 17:08:32

by Darrick J. Wong

[permalink] [raw]
Subject: Re: journal_checksum_v2 on-disk format change? (was: Re: journal recovery problems with metadata_csum, *non-64bit*)

On Tue, Aug 12, 2014 at 01:39:35AM -0400, TR Reardon wrote:
> On 8/11/14, Darrick J. Wong <[email protected]> wrote:
> > Hi all,
> >
> > Mr. Reardon has discovered that due to a bug in journal_tag_bytes(), if the
> > the
> > "journal csum v2" feature flag is turned on, block tag records are being
> > written with two extra bytes of space because we don't need to execute
> > "x += sizeof(tag.t_checksum);". In 32-bit mode, other parts of the journal
> > then perform incorrect size comparisons, leading to BUG() being called. In
> > 64-bit mode, there's enough padding that bad things won't happen.
> >
> > This is a remnant of the days when I tried to enlarge journal_block_tag_t
> > to
> > hold the full 32-bit checksum for a data block that's stored in the
> > journal.
> > Back in 2011, we decided (though sadly I can't find the link; I think we
> > might
> > have discussed this in the concall) that it was better not to change the
> > size
> > of journal_block_tag_t than it was to make it bigger so that it could hold
> > the
> > full checksum.
> >
> > A simple fix for the problem has been proposed by Mr. Reardon which fixes
> > journal_tag_bytes() and leaves everything else unchanged. However, that is
> > technically a disk format change since the size of journal_block_tag_t on
> > disk
> > changes, albeit only for people running experimental metadata_csum
> > filesystems.
> > Since we've been allocating disk space for the enlarged checksum this whole
> > time, if we apply that patch, anyone with an unclean 64bit FS will not be
> > able
> > to recover the journal. (Anyone with an unclean 32-bit FS has been broken
> > the
> > whole time, and still will be.)
> >
> > The other thing we could do is actually use the two bytes to store the high
> > 16-bits of the checksum, fix the jbd2 helper functions to reflect that
> > reality
> > (so that they don't BUG()), and change the checksum verify routine to pass
> > the
> > block if either the full checksum matches, or if the lower 16 bits match
> > and
> > the upper 16 bits are zero. With this route, anybody with an uncleanly
> > unmounted FS could still recover the journal, since we're not changing the
> > size
> > of anything.
> >
> > Fortunately, journal tag blocks are fairly ephemeral and nobody ought to be
> > using metadata_csum on a production filesystem, so at this point we can
> > probably change the disk format without too many repercussions. If you
> > make
> > sure to cleanly unmount the filesystem when changing kernel/e2fsprogs
> > versions,
> > there will be no problems.
> >
> > So, uh... comments? How should we proceed?
> >
> > --D
>
> My only concern is that legacy applies to in-the-wild kernels, not
> just journals. Any post 3.4 kernel has this problem, which will be
> exposed as soon as e2fsprogs 1.43 is released. A common (enough) use
> case might be, say, a 2TB USB drive, being unplugged and replugged
> across machines without proper shutdown. Old machines will think
> they recognize the journal but don't actually, and replay something
> destructive.
>
> In other words, this is a future-retro problem. The problem is not so
> much with existing fs experimenting with metadata_csum, but a future
> properly-journaled drive being plugged into an legacy faulty machine.
> Those incompat flags are supposed to protect against just this
> scenario, but won't. So perhaps you'll need make a new
> "INCOMPAT_CSUM_V3" flag to protect? Actually, dwelling on this a bit
> today, I think whether or not you retain those 2 extra checksum bytes
> in final fix, you ought use a new flag for the format.

Definitely we need a new bit to keep the old kernels out; I wanted others to
weigh in on whether we ought to shrink the struct or put the two bytes to use.

That said... journal recovery is not so well tested, so I'm currently busy
building out debugfs to create journal transactions so we can test recovery
both with and without *_csum.

--D
>
> +Reardon
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-08-13 21:35:54

by Andreas Dilger

[permalink] [raw]
Subject: Re: journal_checksum_v2 on-disk format change? (was: Re: journal recovery problems with metadata_csum, *non-64bit*)

Doesn't a larger journal_tag_bytes() size mean more overhead for
the journal?

Cheers, Andreas

> On Aug 12, 2014, at 19:08, "Darrick J. Wong" <[email protected]> wrote:
>
>> On Tue, Aug 12, 2014 at 01:39:35AM -0400, TR Reardon wrote:
>>> On 8/11/14, Darrick J. Wong <[email protected]> wrote:
>>> Hi all,
>>>
>>> Mr. Reardon has discovered that due to a bug in journal_tag_bytes(), if the
>>> the
>>> "journal csum v2" feature flag is turned on, block tag records are being
>>> written with two extra bytes of space because we don't need to execute
>>> "x += sizeof(tag.t_checksum);". In 32-bit mode, other parts of the journal
>>> then perform incorrect size comparisons, leading to BUG() being called. In
>>> 64-bit mode, there's enough padding that bad things won't happen.
>>>
>>> This is a remnant of the days when I tried to enlarge journal_block_tag_t
>>> to
>>> hold the full 32-bit checksum for a data block that's stored in the
>>> journal.
>>> Back in 2011, we decided (though sadly I can't find the link; I think we
>>> might
>>> have discussed this in the concall) that it was better not to change the
>>> size
>>> of journal_block_tag_t than it was to make it bigger so that it could hold
>>> the
>>> full checksum.
>>>
>>> A simple fix for the problem has been proposed by Mr. Reardon which fixes
>>> journal_tag_bytes() and leaves everything else unchanged. However, that is
>>> technically a disk format change since the size of journal_block_tag_t on
>>> disk
>>> changes, albeit only for people running experimental metadata_csum
>>> filesystems.
>>> Since we've been allocating disk space for the enlarged checksum this whole
>>> time, if we apply that patch, anyone with an unclean 64bit FS will not be
>>> able
>>> to recover the journal. (Anyone with an unclean 32-bit FS has been broken
>>> the
>>> whole time, and still will be.)
>>>
>>> The other thing we could do is actually use the two bytes to store the high
>>> 16-bits of the checksum, fix the jbd2 helper functions to reflect that
>>> reality
>>> (so that they don't BUG()), and change the checksum verify routine to pass
>>> the
>>> block if either the full checksum matches, or if the lower 16 bits match
>>> and
>>> the upper 16 bits are zero. With this route, anybody with an uncleanly
>>> unmounted FS could still recover the journal, since we're not changing the
>>> size
>>> of anything.
>>>
>>> Fortunately, journal tag blocks are fairly ephemeral and nobody ought to be
>>> using metadata_csum on a production filesystem, so at this point we can
>>> probably change the disk format without too many repercussions. If you
>>> make
>>> sure to cleanly unmount the filesystem when changing kernel/e2fsprogs
>>> versions,
>>> there will be no problems.
>>>
>>> So, uh... comments? How should we proceed?
>>>
>>> --D
>>
>> My only concern is that legacy applies to in-the-wild kernels, not
>> just journals. Any post 3.4 kernel has this problem, which will be
>> exposed as soon as e2fsprogs 1.43 is released. A common (enough) use
>> case might be, say, a 2TB USB drive, being unplugged and replugged
>> across machines without proper shutdown. Old machines will think
>> they recognize the journal but don't actually, and replay something
>> destructive.
>>
>> In other words, this is a future-retro problem. The problem is not so
>> much with existing fs experimenting with metadata_csum, but a future
>> properly-journaled drive being plugged into an legacy faulty machine.
>> Those incompat flags are supposed to protect against just this
>> scenario, but won't. So perhaps you'll need make a new
>> "INCOMPAT_CSUM_V3" flag to protect? Actually, dwelling on this a bit
>> today, I think whether or not you retain those 2 extra checksum bytes
>> in final fix, you ought use a new flag for the format.
>
> Definitely we need a new bit to keep the old kernels out; I wanted others to
> weigh in on whether we ought to shrink the struct or put the two bytes to use.
>
> That said... journal recovery is not so well tested, so I'm currently busy
> building out debugfs to create journal transactions so we can test recovery
> both with and without *_csum.
>
> --D
>>
>> +Reardon
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-08-13 22:54:04

by Darrick J. Wong

[permalink] [raw]
Subject: Re: journal_checksum_v2 on-disk format change? (was: Re: journal recovery problems with metadata_csum, *non-64bit*)

On Wed, Aug 13, 2014 at 11:35:50PM +0200, Andreas Dilger wrote:
> Doesn't a larger journal_tag_bytes() size mean more overhead for
> the journal?

Yes. If we move to a 16-byte block tag:

typedef struct journal_block_tag3_s
{
__u32 t_blocknr; /* The on-disk block number */
__u32 t_flags; /* See below */
__u32 t_blocknr_high; /* most-significant high 32bits. */
__u32 t_checksum; /* crc32c(uuid+seq+block) */
} journal_block_tag3_t;

Notice that this fixes the alignment problem due to the v2 bug -- tag 2 starts
with a 4-byte field at offset 14.

For a 32-bit FS with 4K blocks and a 32768 block journal, I found through
experimentation that if I fill the journal with nothing but journal blocks
(i.e. no revoke blocks), overhead increases by 264 blocks, or 0.8%. For a
64-bit FS with the same parameters, the increase is 136 blocks, or 0.4%.

I'm hoping that a <1% increase won't discourage people. :)

--D
>
> Cheers, Andreas
>
> > On Aug 12, 2014, at 19:08, "Darrick J. Wong" <[email protected]> wrote:
> >
> >> On Tue, Aug 12, 2014 at 01:39:35AM -0400, TR Reardon wrote:
> >>> On 8/11/14, Darrick J. Wong <[email protected]> wrote:
> >>> Hi all,
> >>>
> >>> Mr. Reardon has discovered that due to a bug in journal_tag_bytes(), if the
> >>> the
> >>> "journal csum v2" feature flag is turned on, block tag records are being
> >>> written with two extra bytes of space because we don't need to execute
> >>> "x += sizeof(tag.t_checksum);". In 32-bit mode, other parts of the journal
> >>> then perform incorrect size comparisons, leading to BUG() being called. In
> >>> 64-bit mode, there's enough padding that bad things won't happen.
> >>>
> >>> This is a remnant of the days when I tried to enlarge journal_block_tag_t
> >>> to
> >>> hold the full 32-bit checksum for a data block that's stored in the
> >>> journal.
> >>> Back in 2011, we decided (though sadly I can't find the link; I think we
> >>> might
> >>> have discussed this in the concall) that it was better not to change the
> >>> size
> >>> of journal_block_tag_t than it was to make it bigger so that it could hold
> >>> the
> >>> full checksum.
> >>>
> >>> A simple fix for the problem has been proposed by Mr. Reardon which fixes
> >>> journal_tag_bytes() and leaves everything else unchanged. However, that is
> >>> technically a disk format change since the size of journal_block_tag_t on
> >>> disk
> >>> changes, albeit only for people running experimental metadata_csum
> >>> filesystems.
> >>> Since we've been allocating disk space for the enlarged checksum this whole
> >>> time, if we apply that patch, anyone with an unclean 64bit FS will not be
> >>> able
> >>> to recover the journal. (Anyone with an unclean 32-bit FS has been broken
> >>> the
> >>> whole time, and still will be.)
> >>>
> >>> The other thing we could do is actually use the two bytes to store the high
> >>> 16-bits of the checksum, fix the jbd2 helper functions to reflect that
> >>> reality
> >>> (so that they don't BUG()), and change the checksum verify routine to pass
> >>> the
> >>> block if either the full checksum matches, or if the lower 16 bits match
> >>> and
> >>> the upper 16 bits are zero. With this route, anybody with an uncleanly
> >>> unmounted FS could still recover the journal, since we're not changing the
> >>> size
> >>> of anything.
> >>>
> >>> Fortunately, journal tag blocks are fairly ephemeral and nobody ought to be
> >>> using metadata_csum on a production filesystem, so at this point we can
> >>> probably change the disk format without too many repercussions. If you
> >>> make
> >>> sure to cleanly unmount the filesystem when changing kernel/e2fsprogs
> >>> versions,
> >>> there will be no problems.
> >>>
> >>> So, uh... comments? How should we proceed?
> >>>
> >>> --D
> >>
> >> My only concern is that legacy applies to in-the-wild kernels, not
> >> just journals. Any post 3.4 kernel has this problem, which will be
> >> exposed as soon as e2fsprogs 1.43 is released. A common (enough) use
> >> case might be, say, a 2TB USB drive, being unplugged and replugged
> >> across machines without proper shutdown. Old machines will think
> >> they recognize the journal but don't actually, and replay something
> >> destructive.
> >>
> >> In other words, this is a future-retro problem. The problem is not so
> >> much with existing fs experimenting with metadata_csum, but a future
> >> properly-journaled drive being plugged into an legacy faulty machine.
> >> Those incompat flags are supposed to protect against just this
> >> scenario, but won't. So perhaps you'll need make a new
> >> "INCOMPAT_CSUM_V3" flag to protect? Actually, dwelling on this a bit
> >> today, I think whether or not you retain those 2 extra checksum bytes
> >> in final fix, you ought use a new flag for the format.
> >
> > Definitely we need a new bit to keep the old kernels out; I wanted others to
> > weigh in on whether we ought to shrink the struct or put the two bytes to use.
> >
> > That said... journal recovery is not so well tested, so I'm currently busy
> > building out debugfs to create journal transactions so we can test recovery
> > both with and without *_csum.
> >
> > --D
> >>
> >> +Reardon
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html

2014-08-14 01:56:44

by Darrick J. Wong

[permalink] [raw]
Subject: Re: journal_checksum_v2 on-disk format change? (was: Re: journal recovery problems with metadata_csum, *non-64bit*)

On Wed, Aug 13, 2014 at 03:53:58PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 13, 2014 at 11:35:50PM +0200, Andreas Dilger wrote:
> > Doesn't a larger journal_tag_bytes() size mean more overhead for
> > the journal?
>
> Yes. If we move to a 16-byte block tag:
>
> typedef struct journal_block_tag3_s
> {
> __u32 t_blocknr; /* The on-disk block number */
> __u32 t_flags; /* See below */
> __u32 t_blocknr_high; /* most-significant high 32bits. */
> __u32 t_checksum; /* crc32c(uuid+seq+block) */
> } journal_block_tag3_t;
>
> Notice that this fixes the alignment problem due to the v2 bug -- tag 2 starts
> with a 4-byte field at offset 14.
>
> For a 32-bit FS with 4K blocks and a 32768 block journal, I found through
> experimentation that if I fill the journal with nothing but journal blocks
> (i.e. no revoke blocks), overhead increases by 264 blocks, or 0.8%. For a
> 64-bit FS with the same parameters, the increase is 136 blocks, or 0.4%.
>
> I'm hoping that a <1% increase won't discourage people. :)

Nuts, those figures are for 1k block filesystems. For a FS with 4k blocks,
storing 16384 blocks in the journal required this many blocks:

16450 for 64bit journal_csum_v3
16450 for 32bit journal_csum_v3 (assuming I reuse the tag3 structure as-is)

16442 for 64bit journal_csum_v2
16426 for 32bit journal_csum_v2

16434 for 64bit journal_csum_v1
16418 for 32bit journal_csum_v1

16434 for 64bit none
16418 for 32bit none

So for 32-bit FSes, the extra overhead to go from no journal checksumming at
all to v3 checksums is 0.2%. For 64-bit FSes, the extra overhead is 0.1%.

The overhead to move a 32-bit v2 FS to a v3 FS is 0.1%, and for a 64-bit FS
it's 0.05%.

Also, for those watching at home, I fixed the infinite loop on journal block
checksum failure that someone complained about months ago.

--D

>
> --D
> >
> > Cheers, Andreas
> >
> > > On Aug 12, 2014, at 19:08, "Darrick J. Wong" <[email protected]> wrote:
> > >
> > >> On Tue, Aug 12, 2014 at 01:39:35AM -0400, TR Reardon wrote:
> > >>> On 8/11/14, Darrick J. Wong <[email protected]> wrote:
> > >>> Hi all,
> > >>>
> > >>> Mr. Reardon has discovered that due to a bug in journal_tag_bytes(), if the
> > >>> the
> > >>> "journal csum v2" feature flag is turned on, block tag records are being
> > >>> written with two extra bytes of space because we don't need to execute
> > >>> "x += sizeof(tag.t_checksum);". In 32-bit mode, other parts of the journal
> > >>> then perform incorrect size comparisons, leading to BUG() being called. In
> > >>> 64-bit mode, there's enough padding that bad things won't happen.
> > >>>
> > >>> This is a remnant of the days when I tried to enlarge journal_block_tag_t
> > >>> to
> > >>> hold the full 32-bit checksum for a data block that's stored in the
> > >>> journal.
> > >>> Back in 2011, we decided (though sadly I can't find the link; I think we
> > >>> might
> > >>> have discussed this in the concall) that it was better not to change the
> > >>> size
> > >>> of journal_block_tag_t than it was to make it bigger so that it could hold
> > >>> the
> > >>> full checksum.
> > >>>
> > >>> A simple fix for the problem has been proposed by Mr. Reardon which fixes
> > >>> journal_tag_bytes() and leaves everything else unchanged. However, that is
> > >>> technically a disk format change since the size of journal_block_tag_t on
> > >>> disk
> > >>> changes, albeit only for people running experimental metadata_csum
> > >>> filesystems.
> > >>> Since we've been allocating disk space for the enlarged checksum this whole
> > >>> time, if we apply that patch, anyone with an unclean 64bit FS will not be
> > >>> able
> > >>> to recover the journal. (Anyone with an unclean 32-bit FS has been broken
> > >>> the
> > >>> whole time, and still will be.)
> > >>>
> > >>> The other thing we could do is actually use the two bytes to store the high
> > >>> 16-bits of the checksum, fix the jbd2 helper functions to reflect that
> > >>> reality
> > >>> (so that they don't BUG()), and change the checksum verify routine to pass
> > >>> the
> > >>> block if either the full checksum matches, or if the lower 16 bits match
> > >>> and
> > >>> the upper 16 bits are zero. With this route, anybody with an uncleanly
> > >>> unmounted FS could still recover the journal, since we're not changing the
> > >>> size
> > >>> of anything.
> > >>>
> > >>> Fortunately, journal tag blocks are fairly ephemeral and nobody ought to be
> > >>> using metadata_csum on a production filesystem, so at this point we can
> > >>> probably change the disk format without too many repercussions. If you
> > >>> make
> > >>> sure to cleanly unmount the filesystem when changing kernel/e2fsprogs
> > >>> versions,
> > >>> there will be no problems.
> > >>>
> > >>> So, uh... comments? How should we proceed?
> > >>>
> > >>> --D
> > >>
> > >> My only concern is that legacy applies to in-the-wild kernels, not
> > >> just journals. Any post 3.4 kernel has this problem, which will be
> > >> exposed as soon as e2fsprogs 1.43 is released. A common (enough) use
> > >> case might be, say, a 2TB USB drive, being unplugged and replugged
> > >> across machines without proper shutdown. Old machines will think
> > >> they recognize the journal but don't actually, and replay something
> > >> destructive.
> > >>
> > >> In other words, this is a future-retro problem. The problem is not so
> > >> much with existing fs experimenting with metadata_csum, but a future
> > >> properly-journaled drive being plugged into an legacy faulty machine.
> > >> Those incompat flags are supposed to protect against just this
> > >> scenario, but won't. So perhaps you'll need make a new
> > >> "INCOMPAT_CSUM_V3" flag to protect? Actually, dwelling on this a bit
> > >> today, I think whether or not you retain those 2 extra checksum bytes
> > >> in final fix, you ought use a new flag for the format.
> > >
> > > Definitely we need a new bit to keep the old kernels out; I wanted others to
> > > weigh in on whether we ought to shrink the struct or put the two bytes to use.
> > >
> > > That said... journal recovery is not so well tested, so I'm currently busy
> > > building out debugfs to create journal transactions so we can test recovery
> > > both with and without *_csum.
> > >
> > > --D
> > >>
> > >> +Reardon
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > >> the body of a message to [email protected]
> > >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html