2009-05-11 15:03:14

by Alexander Shishkin

[permalink] [raw]
Subject: [Q] ext3 mkfs: zeroing journal blocks

Hi,

Here's a question regarding ext3, jbd and mkfs. I'm not 100% confident
this is the right
list, got it from MAINTAINERS for ext3 and jbd. Please correct me if
this is wrong.

As far as I could tell from brief looking at jbd code, it seemed to me
that the only thing
that has to be reset during the filesystem creation time is journal
superblock (talking
about the default case when journal resides within an ext3 partition).
However, currently
mke2fs -j would zero every journal block no matter what. So, the
question is: can this
zeroing really be avoided in mkfs?
I tried commenting-out ext2fs_zero_block() in mkjournal_proc() and it
seems to speed
up mkfs a great deal while the kernel is still able to mount the
partition afterwards. Also,
for the sake of experiment, I filled the partition with urandom's
contents before doing
the modified mkfs and it still works. My next step in this direction
would be to go
through jbd code, but before doing that, I thought, I'd ask here.

Please CC me in replies as I'm not (yet) subscribed.

Regards,
--
Alex


2009-05-11 17:59:00

by Eric Sandeen

[permalink] [raw]
Subject: Re: [Q] ext3 mkfs: zeroing journal blocks

Alexander Shishkin wrote:
> Hi,
>
> Here's a question regarding ext3, jbd and mkfs. I'm not 100%
> confident this is the right list, got it from MAINTAINERS for ext3
> and jbd. Please correct me if this is wrong.
>
> As far as I could tell from brief looking at jbd code, it seemed to
> me that the only thing that has to be reset during the filesystem
> creation time is journal superblock (talking about the default case
> when journal resides within an ext3 partition). However, currently
> mke2fs -j would zero every journal block no matter what. So, the
> question is: can this zeroing really be avoided in mkfs? I tried
> commenting-out ext2fs_zero_block() in mkjournal_proc() and it seems
> to speed up mkfs a great deal while the kernel is still able to mount
> the partition afterwards. Also, for the sake of experiment, I filled
> the partition with urandom's contents before doing the modified mkfs
> and it still works. My next step in this direction would be to go
> through jbd code, but before doing that, I thought, I'd ask here.
>
> Please CC me in replies as I'm not (yet) subscribed.
>
> Regards,

Hi Alexander -

Yes, this is a fine list for these questions.

Looks like commit 16ed5b3af43c72f60991222b9d7ab65cf53f203d added the
block zeroing at the same time as external journal support went in way
back in 2001 ... IOW, it wasn't added later to fix anything in
particular. Also even at that time, internal journals were not zeroed,
so it's not like that was removed in the meantime. Seems extraneous to
me, but ... maybe Ted knows more ...

-Eric

2009-05-11 18:21:26

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Q] ext3 mkfs: zeroing journal blocks

On May 11, 2009 12:58 -0500, Eric Sandeen wrote:
> Alexander Shishkin wrote:
> > As far as I could tell from brief looking at jbd code, it seemed to
> > me that the only thing that has to be reset during the filesystem
> > creation time is journal superblock (talking about the default case
> > when journal resides within an ext3 partition). However, currently
> > mke2fs -j would zero every journal block no matter what. So, the
> > question is: can this zeroing really be avoided in mkfs? I tried
> > commenting-out ext2fs_zero_block() in mkjournal_proc() and it seems
> > to speed up mkfs a great deal while the kernel is still able to mount
> > the partition afterwards. Also, for the sake of experiment, I filled
> > the partition with urandom's contents before doing the modified mkfs
> > and it still works. My next step in this direction would be to go
> > through jbd code, but before doing that, I thought, I'd ask here.
>
> Looks like commit 16ed5b3af43c72f60991222b9d7ab65cf53f203d added the
> block zeroing at the same time as external journal support went in way
> back in 2001 ... IOW, it wasn't added later to fix anything in
> particular. Also even at that time, internal journals were not zeroed,
> so it's not like that was removed in the meantime. Seems extraneous to
> me, but ... maybe Ted knows more ...

The reason that the journal is zeroed is because there is some chance
that old (valid at the time) transaction headers and commit blocks might
be in the journal and could accidentally be "recovered" and cause bad
corruption of the filesystem.

That said, the chance of this is relatively low, so if you are feeling
lucky the zeroing of the journal could be skipped. This accidental
journal recovery could only happen if a valid transaction completed at
block X, then there was a stale transaction from the filesystem's
previous life starting at block X+1 with the next consecutive transaction
number. It is pretty unlikely I think.

We could avoid this problem entirely if the journal checksum was computed
to include the JBD UUID or something in the checksum value, since even
old transactions with the correct location and transaction ID would fail
the checksum because the new JBD UUID would be different. This could
be implemented as part of the "V2 per-block journal checksum", if anyone
had time to work on that.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-05-11 18:44:20

by Eric Sandeen

[permalink] [raw]
Subject: Re: [Q] ext3 mkfs: zeroing journal blocks

Andreas Dilger wrote:

> The reason that the journal is zeroed is because there is some chance
> that old (valid at the time) transaction headers and commit blocks might
> be in the journal and could accidentally be "recovered" and cause bad
> corruption of the filesystem.

But I guess the question is, why isn't a normal internal log zeroed?

If I'm reading it right only external logs get this treatment, and I
think that's what generated the original question from Alexander.

-Eric

2009-05-11 19:35:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Q] ext3 mkfs: zeroing journal blocks

On Mon, May 11, 2009 at 01:44:18PM -0500, Eric Sandeen wrote:
> Andreas Dilger wrote:
>
> > The reason that the journal is zeroed is because there is some chance
> > that old (valid at the time) transaction headers and commit blocks might
> > be in the journal and could accidentally be "recovered" and cause bad
> > corruption of the filesystem.
>
> But I guess the question is, why isn't a normal internal log zeroed?
>
> If I'm reading it right only external logs get this treatment, and I
> think that's what generated the original question from Alexander.

Internal journals are indeed cleared. Check out write_journal_inode()
in lib/ext2fs/mkjournal.c, which calls ext2fs_block_iterate() passing
in the callback function mkjournal_proc(), which calls
ext2fs_zero_blocks().

- Ted

2009-05-11 19:36:00

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Q] ext3 mkfs: zeroing journal blocks

On May 11, 2009 13:44 -0500, Eric Sandeen wrote:
> Andreas Dilger wrote:
> > The reason that the journal is zeroed is because there is some chance
> > that old (valid at the time) transaction headers and commit blocks might
> > be in the journal and could accidentally be "recovered" and cause bad
> > corruption of the filesystem.
>
> But I guess the question is, why isn't a normal internal log zeroed?
>
> If I'm reading it right only external logs get this treatment, and I
> think that's what generated the original question from Alexander.

Hmm, possibly because when ext3 was first allocated the internal journal
created was "dd if=/dev/zero of=/mnt/fs/.journal bs=1M count={jnl_size}"
on the filesystem mounted as ext2, so normal filesystem IO would handle
the zeroing of the blocks. Even today if tune2fs adds a journal to a
filesystem it does the zero filling of the journal.

Looking at the mke2fs code it also appears to be doing zeroing of the
journal inode in:

mke2fs
->ext2fs_add_journal_inode
->write_journal_inode
->ext2fs_block_iterate
->mkjournal_proc (increment zero_count)
->ext2fs_zero_blocks

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-05-12 11:55:11

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [Q] ext3 mkfs: zeroing journal blocks

On 11 May 2009 21:44, Eric Sandeen <[email protected]> wrote:
> Andreas Dilger wrote:
>
>> The reason that the journal is zeroed is because there is some chance
>> that old (valid at the time) transaction headers and commit blocks might
>> be in the journal and could accidentally be "recovered" and cause bad
>> corruption of the filesystem.
>
> But I guess the question is, why isn't a normal internal log zeroed?
>
> If I'm reading it right only external logs get this treatment, and I
> think that's what generated the original question from Alexander.

My concern was basically if it is safe to skip zeroing for internal journal.

Regards,
--
Alex

2009-05-12 12:13:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Q] ext3 mkfs: zeroing journal blocks

On Tue, May 12, 2009 at 02:55:10PM +0300, Alexander Shishkin wrote:
> On 11 May 2009 21:44, Eric Sandeen <[email protected]> wrote:
> > Andreas Dilger wrote:
> >
> >> The reason that the journal is zeroed is because there is some chance
> >> that old (valid at the time) transaction headers and commit blocks might
> >> be in the journal and could accidentally be "recovered" and cause bad
> >> corruption of the filesystem.
> >
> > But I guess the question is, why isn't a normal internal log zeroed?
> >
> > If I'm reading it right only external logs get this treatment, and I
> > think that's what generated the original question from Alexander.
>
> My concern was basically if it is safe to skip zeroing for internal journal.

Strictly speaking, no. Most of the time you'll get lucky. The place
where you will get into trouble will be is if there is leftover
uninitialized garbage (particularly if you are reformatting an
existing ext3/4 filesystem) that looks like a journal log block, with
the correct journal transaction number, *and* the system crashes
before the journal has been completely written through at least once.

What precisely is your concern? Normally the journal isn't that big,
and it's a contiguous write --- so it doesn't take that long. Are you
worried about the time it takes, or trying to avoid some writes to an
SSD, or some other concern? If we know it's an SSD, where reads are
fast, and writes are slow, I suppose we could scan the disk looking
for potentially dangerous blocks and zero them manually. It's really
not clear it's worth the effort though.

- Ted

2009-05-12 12:49:31

by Alexander Shishkin

[permalink] [raw]
Subject: Re: [Q] ext3 mkfs: zeroing journal blocks

On 12 May 2009 15:13, Theodore Tso <[email protected]> wrote:
>> My concern was basically if it is safe to skip zeroing for internal journal.
>
> Strictly speaking, no.  Most of the time you'll get lucky.  The place
> where you will get into trouble will be is if there is leftover
> uninitialized garbage (particularly if you are reformatting an
> existing ext3/4 filesystem) that looks like a journal log block, with
> the correct journal transaction number, *and* the system crashes
> before the journal has been completely written through at least once.

So, what Andreas explained yesterday also applies to the internal log
case. I see. Would you say it's possible to prevent this, for instance
somehow say, by means checksums as Andreas suggested?

> What precisely is your concern?  Normally the journal isn't that big,
> and it's a contiguous write --- so it doesn't take that long.  Are you
> worried about the time it takes, or trying to avoid some writes to an
> SSD, or some other concern?  If we know it's an SSD, where reads are

It's an mmc and it (mkfs) runs almost two times faster without zeroing
the journal. The only thing I'm worried about is the time that it
takes for mke2fs -j to complete. I've done some caching trickery to
unix_io.c which I'm going to post here separately, but most of the
time seems to be taken by the journal.

> fast, and writes are slow, I suppose we could scan the disk looking
> for potentially dangerous blocks and zero them manually.  It's really
> not clear it's worth the effort though.

Hm. This might work, I'll look into it. Thanks for the idea!

Regards,
--
Alex

2009-05-12 21:04:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Q] ext3 mkfs: zeroing journal blocks

On Tue, May 12, 2009 at 03:49:30PM +0300, Alexander Shishkin wrote:
>
> So, what Andreas explained yesterday also applies to the internal log
> case. I see. Would you say it's possible to prevent this, for instance
> somehow say, by means checksums as Andreas suggested?
>

It's *possible*, but it's not a trivial amount of work; it requires
both kernel and userspace changes, though.

> It's an mmc and it (mkfs) runs almost two times faster without zeroing
> the journal. The only thing I'm worried about is the time that it
> takes for mke2fs -j to complete. I've done some caching trickery to
> unix_io.c which I'm going to post here separately, but most of the
> time seems to be taken by the journal.

But why do you care about the time it akes for mke2fs -j to complete?
How much time is it taking? Normally mke2fs isn't one of those
programs which gets run all the time....

- Ted

2009-07-29 15:59:20

by Alexander Shishkin

[permalink] [raw]
Subject: [PATCH] [RFC] mkjournal: zero journal blocks only when necessary

Will something like this do?

The only blocks that might theoretically (although very unlikely) be
dangerous for newly-created filesystem's integrity are those that
still contain valid signatures, others can be safely skipped.

Since reads are generally faster (or at least, not slower), this
gives some performance increase during mkfs run.

Signed-off-by: Alexander Shishkin <[email protected]>
---
lib/ext2fs/mkjournal.c | 49 ++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
index f5a9dba..bb6e748 100644
--- a/lib/ext2fs/mkjournal.c
+++ b/lib/ext2fs/mkjournal.c
@@ -144,6 +144,44 @@ errout:
* attempt to free the static zeroizing buffer. (This is to keep
* programs that check for memory leaks happy.)
*/
+static errcode_t jzero_blocks(io_channel channel, unsigned long long block,
+ int count)
+{
+ errcode_t retval;
+ unsigned long long n;
+ static void *__buf = NULL;
+ struct unix_private_data *data;
+
+ data = channel->private_data;
+ if (!__buf)
+ __buf = malloc(channel->block_size);
+ if (!__buf)
+ return -1;
+
+ for (n = block; n < block + count; n++) {
+ journal_header_t *jr;
+
+ retval = io_channel_read_blk(channel, n, 1, __buf);
+ if (retval)
+ goto out_err;
+ continue;
+
+ jr = __buf;
+ if (jr->h_magic == htonl(JFS_MAGIC_NUMBER) &&
+ jr->h_blocktype == htonl(JFS_REVOKE_BLOCK)) {
+ memset(__buf, 0, channel->block_size);
+ retval = io_channel_write_blk(channel, n, 1, __buf);
+ }
+
+ if (retval)
+ goto out_err;
+ }
+
+ return 0;
+out_err:
+ return -1;
+}
+
#define STRIDE_LENGTH 8
errcode_t ext2fs_zero_blocks(ext2_filsys fs, blk_t blk, int num,
blk_t *ret_blk, int *ret_count)
@@ -238,10 +276,9 @@ static int mkjournal_proc(ext2_filsys fs,
(es->zero_count < 1024))
es->zero_count++;
else {
- retval = ext2fs_zero_blocks(fs,
- es->blk_to_zero,
- es->zero_count,
- 0, 0);
+ retval = jzero_blocks(fs->io,
+ es->blk_to_zero,
+ es->zero_count);
es->zero_count = 0;
}
}
@@ -339,8 +376,8 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
goto errout;
}
if (es.zero_count) {
- retval = ext2fs_zero_blocks(fs, es.blk_to_zero,
- es.zero_count, 0, 0);
+ retval = jzero_blocks(fs->io, es.blk_to_zero,
+ es.zero_count);
if (retval)
goto errout;
}
--
1.6.3.3


2009-07-29 17:17:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] [RFC] mkjournal: zero journal blocks only when necessary

On Wed, Jul 29, 2009 at 06:58:16PM +0300, Alexander Shishkin wrote:
> Will something like this do?
>
> The only blocks that might theoretically (although very unlikely) be
> dangerous for newly-created filesystem's integrity are those that
> still contain valid signatures, others can be safely skipped.
>
> Since reads are generally faster (or at least, not slower), this
> gives some performance increase during mkfs run.

Did you bother benchmarking what this would do on normal disk drives?
Previously we were writing out the blocks to be zeroed in large chunks
at a time for speed reasons. This patch reduces it to reading the
journal one block at a time, and if it contains a valid signature it
writes a zero block.

The patch also doesn't check for commit blocks, which are just as much
a problem (if not more so) than revoke blocks.

- Ted