2010-02-10 11:44:48

by Andreas Dilger

[permalink] [raw]
Subject: [PATCH] mk2fs lazy_journal_init option

Attached is a patch to skip zeroing of the journal if the "-E
lazy_journal_init" option is given to mke2fs (named to complement the
"-E lazy_itable_init" option). This can speed up format time
significantly for large journal devices. There is only a short-term
risk of problems with the uninitialized journal, until such a time
that the journal has been overwritten once.

To have any kind of problem with the uninitialized journal, the
filesystem would have to have been newly reformatted twice nearly in a
row, and then the node crashes before filling the journal even once,
and the previously-written content would have to line up precisely on
disk and also have the same transaction ID.

Patch has been lightly tested, showing mke2fs times steady at 14s for
a 40GB filesystem, regardless of journal size, while previously it
took up to 45s for an internal 2GB journal.

Signed-off-by: Andreas Dilger <[email protected]>

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


Attachments:
e2fsprogs-lazy_journal_init.patch (5.94 kB)

2010-02-16 22:15:26

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] mk2fs lazy_journal_init option

On 2010-02-10, at 04:44, Andreas Dilger wrote:
> Attached is a patch to skip zeroing of the journal if the
> "-E lazy_journal_init" option is given to mke2fs (named to
> complement the "-E lazy_itable_init" option). This can
> speed up format time significantly for large journal devices.
> There's only a short-term risk of problems with uninitialized
> journal, until the journal has been overwritten once.
>
> Patch has been lightly tested, showing mke2fs times steady
> at 14s for a 40GB filesystem, regardless of journal size,
> while previously it took up to 45s for an internal 2GB journal.

While testing this patch more thoroughly, we uncovered a bug
in the mke2fs/libext2fs code. It seems that when running:

mke2fs -J size=X -O extents /dev/XXX

for any size > 512 the journal creation time is growing
exponentially:

no journal-> 12s
size=128 -> 14s
size=256 -> 16s
size=512 -> 21s
size=768 -> 143s
size=1024-> 298s
size=1280-> 663s

We wanted originally to use size=4000, but this took so
long we thought it was hung, and started investigating.

This happens even without the "-E lazy_itable_init" option.
Running ltrace on mke2fs shows lots of zero writes (to be
expected for journal zeroing) followed by a single read
(completes quickly) and many thousands of memcpy() calls.
The mke2fs program is completely CPU bound (99.9% user).

Running with the "-E lazy_itable_init" the writes/reads go
away, and all that is left is an endless stream of memcpy().

It seems to loop in ext2fs_block_iterate2->mkjournal_proc()
forever:

426 for (blockcnt = extent.e_lblk, j = 0;
427 j < extent.e_len;
428 blk++, blockcnt++, j++) {
429 new_blk = blk;
430 r = (*ctx.func)(fs, &new_blk, blockcnt,
431 0, 0, priv_data);


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2010-02-16 23:11:25

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] mk2fs lazy_journal_init option

Andreas Dilger wrote:
> On 2010-02-10, at 04:44, Andreas Dilger wrote:
>> Attached is a patch to skip zeroing of the journal if the
>> "-E lazy_journal_init" option is given to mke2fs (named to
>> complement the "-E lazy_itable_init" option). This can
>> speed up format time significantly for large journal devices.
>> There's only a short-term risk of problems with uninitialized
>> journal, until the journal has been overwritten once.
>>
>> Patch has been lightly tested, showing mke2fs times steady
>> at 14s for a 40GB filesystem, regardless of journal size,
>> while previously it took up to 45s for an internal 2GB journal.
>
> While testing this patch more thoroughly, we uncovered a bug
> in the mke2fs/libext2fs code. It seems that when running:
>
> mke2fs -J size=X -O extents /dev/XXX
>
> for any size > 512 the journal creation time is growing
> exponentially:
>
> no journal-> 12s
> size=128 -> 14s
> size=256 -> 16s
> size=512 -> 21s
> size=768 -> 143s
> size=1024-> 298s
> size=1280-> 663s
>
> We wanted originally to use size=4000, but this took so
> long we thought it was hung, and started investigating.
>
> This happens even without the "-E lazy_itable_init" option.
> Running ltrace on mke2fs shows lots of zero writes (to be
> expected for journal zeroing) followed by a single read
> (completes quickly) and many thousands of memcpy() calls.
> The mke2fs program is completely CPU bound (99.9% user).
>
> Running with the "-E lazy_itable_init" the writes/reads go
> away, and all that is left is an endless stream of memcpy().
>
> It seems to loop in ext2fs_block_iterate2->mkjournal_proc()
> forever:
>
> 426 for (blockcnt = extent.e_lblk, j = 0;
> 427 j < extent.e_len;
> 428 blk++, blockcnt++, j++) {
> 429 new_blk = blk;
> 430 r = (*ctx.func)(fs, &new_blk, blockcnt,
> 431 0, 0, priv_data);
>

Haven't dug all the way through it but I think this is another
in the saga of blk_t vs. blk64_t. This seems to fix (?) it:

--- a/lib/ext2fs/mkjournal.c
+++ b/lib/ext2fs/mkjournal.c
@@ -218,7 +218,7 @@ struct mkjournal_struct {
};

static int mkjournal_proc(ext2_filsys fs,
- blk_t *blocknr,
+ blk64_t *blocknr,
e2_blkcnt_t blockcnt,
blk_t ref_block EXT2FS_ATTR((unused)),
int ref_offset EXT2FS_ATTR((unused)),

though I doubt that is correct/complete.

-Eric

2010-02-16 23:21:37

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] mk2fs lazy_journal_init option

Eric Sandeen wrote:

> Haven't dug all the way through it but I think this is another
> in the saga of blk_t vs. blk64_t. This seems to fix (?) it:
>
> --- a/lib/ext2fs/mkjournal.c
> +++ b/lib/ext2fs/mkjournal.c
> @@ -218,7 +218,7 @@ struct mkjournal_struct {
> };
>
> static int mkjournal_proc(ext2_filsys fs,
> - blk_t *blocknr,
> + blk64_t *blocknr,
> e2_blkcnt_t blockcnt,
> blk_t ref_block EXT2FS_ATTR((unused)),
> int ref_offset EXT2FS_ATTR((unused)),
>
> though I doubt that is correct/complete.

Humm my git tree is not what I thought it was, the above might be a
wild goose chase, sorry.

-Eric

2010-02-19 19:53:54

by Nic Case

[permalink] [raw]
Subject: Re: [PATCH] mk2fs lazy_journal_init option

I hope you decide to continue unconditionally zeroing the journal at mkfs time.

In this case, block_iterate2 was being used to create a file, but when the file became big enough to have an extent leaf, ext2fs_get_extent was returning an extent when it shouldn't have been. That caused ext2fs_block_iterate2 to go wrong, eventually calling mkjournal_proc millions of times more than it needed to (during which it would immediately return). A patch which resolves this is below. It shows normal mkfs times for me with a 512 MB journal and passes make check.

Signed-off-by Nic Case <[email protected]>

--- lib/ext2fs/extent-orig.c 2010-02-05 08:58:41.000000000 -0600
+++ lib/ext2fs/extent.c 2010-02-19 13:37:32.000000000 -0600
@@ -307,16 +307,12 @@
op = EXT2_EXTENT_DOWN;
} else if (path->left > 0)
op = EXT2_EXTENT_NEXT_SIB;
- else if (handle->level > 0)
- op = EXT2_EXTENT_UP;
else
return EXT2_ET_EXTENT_NO_NEXT;
} else {
/* leaf node */
if (path->left > 0)
op = EXT2_EXTENT_NEXT_SIB;
- else if (handle->level > 0)
- op = EXT2_EXTENT_UP;
else
return EXT2_ET_EXTENT_NO_NEXT;
}







2010-02-24 05:30:41

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] mk2fs lazy_journal_init option

On 2010-02-19, at 12:53, number9652 wrote:
> I hope you decide to continue unconditionally zeroing the journal at
> mkfs time.

That will continue to be the default for the time being, unless we can
be sure that there is no possibility of error. That would be possible
in conjunction with the journal checksum patch, if the checksum
function is changed slightly to include the journal UUID.

> In this case, block_iterate2 was being used to create a file, but
> when the file became big enough to have an extent leaf,
> ext2fs_get_extent was returning an extent when it shouldn't have
> been. That caused ext2fs_block_iterate2 to go wrong, eventually
> calling mkjournal_proc millions of times more than it needed to
> (during which it would immediately return). A patch which resolves
> this is below. It shows normal mkfs times for me with a 512 MB
> journal and passes make check.
>
> Signed-off-by Nic Case <[email protected]>

I haven't looked at this from a logic POV, but I tested this with a
1GB journal and it ran the same time with/without specifying "-O
extents" (10s vs. 650s without this patch). I also ran our additional
e2fsck extents tests without errors, and on a couple of my extent-
based filesystems.

Tested-by: Andreas Dilger <[email protected]>

> --- lib/ext2fs/extent-orig.c 2010-02-05 08:58:41.000000000 -0600
> +++ lib/ext2fs/extent.c 2010-02-19 13:37:32.000000000 -0600
> @@ -307,16 +307,12 @@
> op = EXT2_EXTENT_DOWN;
> } else if (path->left > 0)
> op = EXT2_EXTENT_NEXT_SIB;
> - else if (handle->level > 0)
> - op = EXT2_EXTENT_UP;
> else
> return EXT2_ET_EXTENT_NO_NEXT;
> } else {
> /* leaf node */
> if (path->left > 0)
> op = EXT2_EXTENT_NEXT_SIB;
> - else if (handle->level > 0)
> - op = EXT2_EXTENT_UP;
> else
> return EXT2_ET_EXTENT_NO_NEXT;
> }
>
>
>
>
>
>


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2010-03-02 19:36:22

by Nic Case

[permalink] [raw]
Subject: Re: [PATCH] mk2fs lazy_journal_init option

--- On Tue, 2/23/10, Andreas Dilger wrote:
> I haven't looked at this from a logic POV, but I tested
> this with a 1GB journal and it ran the same time
> with/without specifying "-O extents" (10s vs. 650s without
> this patch). I also ran our additional e2fsck extents
> tests without errors, and on a couple of my extent-based
> filesystems.

Thanks for testing this patch, but after looking at it again,
it is clear it is not the right thing to do. If there are at
least two levels of the extent tree with more than one entry,
iterating through with EXTENT_NEXT will return NO_EXTENT before
reaching the end of the file. I sent another patch to the ext4
mailing list (with ext2fs_extent_get in the title) that I believe
is the right way to fix the problem.

Nic