While playing with filesystems using flex bg, I noticed that the journal
file may be fragmented when there are a lots of meta-data in the first
flex-group.
For example, with this command : mkfs.ext4 -t ext4dev -G512 /dev/sdb1
The journal file is reported by "stat <8>" in debugfs to be like this :
Inode: 8 Type: regular Mode: 0600 Flags: 0x0
Generation: 0 Version: 0x00000000
User: 0 Group: 0 Size: 134217728
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 262416
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x48b4a426 -- Wed Aug 27 02:47:34 2008
atime: 0x00000000 -- Thu Jan 1 01:00:00 1970
mtime: 0x48b4a426 -- Wed Aug 27 02:47:34 2008
Size of extra inode fields: 0
BLOCKS:
(0-11):28679-28690, (IND):28691, (12-1035):28692-29715, (DIND):29716,
(IND):29717, (1036-2059):29718-30741, (IND):30742,
(2060-3083):30743-31766, (IND):31767, (3084-4083):31768-32767,
(4084-4107):94209-94232, (IND):94233, (4108-5131):94234-95257,
(IND):95258, (5132-6155):95259-96282, (IND):96283,
(6156-7179):96284-97307, (IND):97308, (7180-8174):97309-98303,
(8175-8203):159745-159773, (IND):159774, (8204-9227):159775-160798,
(IND):160799, (9228-10251):160800-161823, (IND):161824,
(10252-11275):161825-162848, (IND):162849, (11276-12265):162850-163839,
(12266-12299):225281-225314, (IND):225315, (12300-13323):225316-226339,
(IND):226340, (13324-14347):226341-227364, (IND):227365,
(14348-15371):227366-228389, (IND):228390, (15372-16356):228391-229375,
(16357-16395):284673-284711, (IND):284712, (16396-17419):284713-285736,
(IND):285737, (17420-18443):285738-286761, (IND):286762,
(18444-19467):286763-287786, (IND):287787, (19468-20491):287788-288811,
(IND):288812, (20492-21515):288813-289836, (IND):289837,
(21516-22539):289838-290861, (IND):290862, (22540-23563):290863-291886,
(IND):291887, (23564-24587):291888-292911, (IND):292912,
(24588-25611):292913-293936, (IND):293937, (25612-26585):293938-294911,
(26586-26635):295937-295986, (IND):295987, (26636-27659):295988-297011,
(IND):297012, (27660-28683):297013-298036, (IND):298037,
(28684-29707):298038-299061, (IND):299062, (29708-30731):299063-300086,
(IND):300087, (30732-31755):300088-301111, (IND):301112,
(31756-32768):301113-302125
TOTAL: 32802
This journal file is splited in 5 parts : some blocks at 28679-32767,
then 94209-98303, then 159745-163839, then 225281-229375 and finally
284673-302125
Of course "-G512" in the mkfs commad line is an extreme case but it
shows clearly the fragmentation.
I've tried to find if this fragmentation has any performance impact. So
I've quickly wrote the following patch for the mkfs program :
Index: e2fsprogs/lib/ext2fs/mkjournal.c
===================================================================
--- e2fsprogs.orig/lib/ext2fs/mkjournal.c 2008-08-27 02:37:59.000000000 +0200
+++ e2fsprogs/lib/ext2fs/mkjournal.c 2008-08-27 14:51:02.000000000 +0200
@@ -220,7 +220,11 @@ static int mkjournal_proc(ext2_filsys fs
last_blk = *blocknr;
return 0;
}
- retval = ext2fs_new_block(fs, last_blk, 0, &new_blk);
+ retval = ext2fs_get_free_blocks(fs, ref_block,
+ fs->super->s_blocks_count,
+ es->num_blocks, fs->block_map,
+ &new_blk);
+
if (retval) {
es->err = retval;
return BLOCK_ABORT;
This makes the mkfs time a bit longer but ends up with an unfragmented
journal file : debugfs stat<8> reports that the journal file uses
contiguous blocks from 295937 to 328738.
Then I've launched bonnie++ for testing performance impact.This is my
test script :
mkfs.ext4 -t ext4dev -G512 /dev/sdb1
mount -t ext4dev -o data=journal /dev/sdb1 /mnt/test
bonnie++ -u root -s 0 -n 4000 -d /mnt/test/
And the results:
Without patch :
Version 1.03d ------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
4000 3978 7 602 0 518 1 3962 8 520 0 326 1
With patch :
Version 1.03d ------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
4000 4180 8 736 1 543 1 4029 8 556 0 335 1
Difference :
+5.0 +22% +4.8% +1.6% +6.9% +2.7%
Conclusion :
First, the higher performance enhancement are on read operation, which,
if i am not wrong, has nothing to do with the journal file. This is
surprising and may indicate that those results are wrong, but I can't
see why right now.
Second, there is a slight enhancement on write operations so the journal
file defragmentation seems to have a positive impact in this test.
I'm still bothered by the performance increase in read. So I will launch
some more tests and see if it is consistant.
Please, feel free to give me any comments you may have on this subject.
Thanks.
Frederic
On Wed, 27 Aug 2008 19:36:07 +0200
Frédéric Bohé <[email protected]> wrote:
> While playing with filesystems using flex bg, I noticed that the journal
> file may be fragmented when there are a lots of meta-data in the first
> flex-group.
> For example, with this command : mkfs.ext4 -t ext4dev -G512 /dev/sdb1
> The journal file is reported by "stat <8>" in debugfs to be like this :
>
> Inode: 8 Type: regular Mode: 0600 Flags: 0x0
> Generation: 0 Version: 0x00000000
> User: 0 Group: 0 Size: 134217728
> File ACL: 0 Directory ACL: 0
> Links: 1 Blockcount: 262416
> Fragment: Address: 0 Number: 0 Size: 0
> ctime: 0x48b4a426 -- Wed Aug 27 02:47:34 2008
> atime: 0x00000000 -- Thu Jan 1 01:00:00 1970
> mtime: 0x48b4a426 -- Wed Aug 27 02:47:34 2008
> Size of extra inode fields: 0
> BLOCKS:
> (0-11):28679-28690, (IND):28691, (12-1035):28692-29715, (DIND):29716,
> (IND):29717, (1036-2059):29718-30741, (IND):30742,
> (2060-3083):30743-31766, (IND):31767, (3084-4083):31768-32767,
> (4084-4107):94209-94232, (IND):94233, (4108-5131):94234-95257,
> (IND):95258, (5132-6155):95259-96282, (IND):96283,
> (6156-7179):96284-97307, (IND):97308, (7180-8174):97309-98303,
> (8175-8203):159745-159773, (IND):159774, (8204-9227):159775-160798,
> (IND):160799, (9228-10251):160800-161823, (IND):161824,
> (10252-11275):161825-162848, (IND):162849, (11276-12265):162850-163839,
> (12266-12299):225281-225314, (IND):225315, (12300-13323):225316-226339,
> (IND):226340, (13324-14347):226341-227364, (IND):227365,
> (14348-15371):227366-228389, (IND):228390, (15372-16356):228391-229375,
> (16357-16395):284673-284711, (IND):284712, (16396-17419):284713-285736,
> (IND):285737, (17420-18443):285738-286761, (IND):286762,
> (18444-19467):286763-287786, (IND):287787, (19468-20491):287788-288811,
> (IND):288812, (20492-21515):288813-289836, (IND):289837,
> (21516-22539):289838-290861, (IND):290862, (22540-23563):290863-291886,
> (IND):291887, (23564-24587):291888-292911, (IND):292912,
> (24588-25611):292913-293936, (IND):293937, (25612-26585):293938-294911,
> (26586-26635):295937-295986, (IND):295987, (26636-27659):295988-297011,
> (IND):297012, (27660-28683):297013-298036, (IND):298037,
> (28684-29707):298038-299061, (IND):299062, (29708-30731):299063-300086,
> (IND):300087, (30732-31755):300088-301111, (IND):301112,
> (31756-32768):301113-302125
> TOTAL: 32802
>
> This journal file is splited in 5 parts : some blocks at 28679-32767,
> then 94209-98303, then 159745-163839, then 225281-229375 and finally
> 284673-302125
>
> Of course "-G512" in the mkfs commad line is an extreme case but it
> shows clearly the fragmentation.
>
> I've tried to find if this fragmentation has any performance impact. So
> I've quickly wrote the following patch for the mkfs program :
>
> Index: e2fsprogs/lib/ext2fs/mkjournal.c
> ===================================================================
> --- e2fsprogs.orig/lib/ext2fs/mkjournal.c 2008-08-27 02:37:59.000000000 +0200
> +++ e2fsprogs/lib/ext2fs/mkjournal.c 2008-08-27 14:51:02.000000000 +0200
> @@ -220,7 +220,11 @@ static int mkjournal_proc(ext2_filsys fs
> last_blk = *blocknr;
> return 0;
> }
> - retval = ext2fs_new_block(fs, last_blk, 0, &new_blk);
> + retval = ext2fs_get_free_blocks(fs, ref_block,
> + fs->super->s_blocks_count,
> + es->num_blocks, fs->block_map,
> + &new_blk);
> +
> if (retval) {
> es->err = retval;
> return BLOCK_ABORT;
>
> This makes the mkfs time a bit longer but ends up with an unfragmented
> journal file : debugfs stat<8> reports that the journal file uses
> contiguous blocks from 295937 to 328738.
The problem with this approach is that mkfs will take longer still as
you make -G xxx larger since ext2fs_get_free_blocks() is not very smart
at finding a large number of contiguous blocks. If I understand this
correctly, the main problem we have here is that we start the new block
search from block 0. A better approach would be to start
ext2fs_new_block() from the last block of the last inode table in a
flex_bg. This way we avoid the fragmentation issues we see when the
inode tables for a flexbg are larger that the capacity of a single
block group.
> Then I've launched bonnie++ for testing performance impact.This is my
> test script :
>
> mkfs.ext4 -t ext4dev -G512 /dev/sdb1
> mount -t ext4dev -o data=journal /dev/sdb1 /mnt/test
> bonnie++ -u root -s 0 -n 4000 -d /mnt/test/
>
> And the results:
>
> Without patch :
>
> Version 1.03d ------Sequential Create------ --------Random Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
> 4000 3978 7 602 0 518 1 3962 8 520 0 326 1
>
> With patch :
>
> Version 1.03d ------Sequential Create------ --------Random Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
> 4000 4180 8 736 1 543 1 4029 8 556 0 335 1
>
> Difference :
>
> +5.0 +22% +4.8% +1.6% +6.9% +2.7%
>
> Conclusion :
>
> First, the higher performance enhancement are on read operation, which,
> if i am not wrong, has nothing to do with the journal file. This is
> surprising and may indicate that those results are wrong, but I can't
> see why right now.
> Second, there is a slight enhancement on write operations so the journal
> file defragmentation seems to have a positive impact in this test.
>
> I'm still bothered by the performance increase in read. So I will launch
> some more tests and see if it is consistant.
>
> Please, feel free to give me any comments you may have on this subject.
>
> Thanks.
>
> Frederic
-JRS
On Wed, Aug 27, 2008 at 07:36:07PM +0200, Fr?d?ric Boh? wrote:
> While playing with filesystems using flex bg, I noticed that the journal
> file may be fragmented when there are a lots of meta-data in the first
> flex-group.
> For example, with this command : mkfs.ext4 -t ext4dev -G512 /dev/sdb1
> The journal file is reported by "stat <8>" in debugfs to be like this :
Yeah, we really want to put the journal in the middle of the
filesystem; that not only avoids the metadata at the very beginning of
the filesystem, especially when flex_bg is enabled, but also because
it eliminates the worst case seek times when the file data is at the
end of the disk, and the journal is at the beginning of the disk, and
we are using a very fsync-intensive workload.
With the following patches the journal inode now looks like this:
Inode: 8 Type: regular Mode: 0600 Flags: 0x80000
Generation: 0 Version: 0x00000000
User: 0 Group: 0 Size: 134217728
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 262144
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x48b5b982 -- Wed Aug 27 16:30:58 2008
atime: 0x00000000 -- Wed Dec 31 19:00:00 1969
mtime: 0x48b5b982 -- Wed Aug 27 16:30:58 2008
Size of extra inode fields: 0
BLOCKS:
(0-32767):2588672-2621439
TOTAL: 32768
This also creates the journal using extents, which eliminates the
indirect block overhead, and means that the 128MB journal conveniently
takes up a single block group:
Group 79: (Blocks 2588672-2621439) [INODE_UNINIT, ITABLE_ZEROED]
Checksum 0x441d, unused inodes 8192
Block bitmap at 2097167, Inode bitmap at 2097183
Inode table at 2104864-2105375
0 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
Free blocks:
Free inodes:
- Ted
This is needed so that extent-based inodes (including a journal inode)
can be created via block_iterate.
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
lib/ext2fs/block.c | 28 ++++++++++++++++++++++++----
1 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/lib/ext2fs/block.c b/lib/ext2fs/block.c
index 14be1ba..2fc3c4a 100644
--- a/lib/ext2fs/block.c
+++ b/lib/ext2fs/block.c
@@ -361,7 +361,7 @@ errcode_t ext2fs_block_iterate2(ext2_filsys fs,
if (inode.i_flags & EXT4_EXTENTS_FL) {
ext2_extent_handle_t handle;
struct ext2fs_extent extent;
- e2_blkcnt_t blockcnt;
+ e2_blkcnt_t blockcnt = 0;
blk_t blk, new_blk;
int op = EXT2_EXTENT_ROOT;
unsigned int j;
@@ -373,9 +373,29 @@ errcode_t ext2fs_block_iterate2(ext2_filsys fs,
while (1) {
ctx.errcode = ext2fs_extent_get(handle, op, &extent);
if (ctx.errcode) {
- if (ctx.errcode == EXT2_ET_EXTENT_NO_NEXT)
- ctx.errcode = 0;
- break;
+ if (ctx.errcode != EXT2_ET_EXTENT_NO_NEXT)
+ break;
+ ctx.errcode = 0;
+ if (!(flags & BLOCK_FLAG_APPEND))
+ break;
+ blk = 0;
+ r = (*ctx.func)(fs, &blk, blockcnt,
+ 0, 0, priv_data);
+ ret |= r;
+ check_for_ro_violation_goto(&ctx, ret,
+ extent_errout);
+ if (r & BLOCK_CHANGED) {
+ ctx.errcode =
+ ext2fs_extent_set_bmap(handle,
+ (blk64_t) blockcnt++,
+ (blk64_t) blk, 0);
+ if (ctx.errcode)
+ goto errout;
+ continue;
+ } else {
+ ext2fs_extent_free(handle);
+ goto errout;
+ }
}
op = EXT2_EXTENT_NEXT;
--
1.5.6.1.205.ge2c7.dirty
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
lib/ext2fs/mkjournal.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
index cd3df07..c112be9 100644
--- a/lib/ext2fs/mkjournal.c
+++ b/lib/ext2fs/mkjournal.c
@@ -298,6 +298,12 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
es.err = 0;
es.zero_count = 0;
+ if (fs->super->s_feature_incompat & EXT3_FEATURE_INCOMPAT_EXTENTS) {
+ inode.i_flags |= EXT4_EXTENTS_FL;
+ if ((retval = ext2fs_write_inode(fs, journal_ino, &inode)))
+ return retval;
+ }
+
/*
* Set the initial goal block to be roughly at the middle of
* the filesystem. Pick a group that has the largest number
--
1.5.6.1.205.ge2c7.dirty
This speeds up access to the journal by eliminating worst-case seeks
from one end of the disk to another, which can be quite common in very
fsync-intensive workloads if the file is located near the end of the
disk, and the journal is located the beginning of the disk.
In addition, this can help eliminate journal fragmentation when
flex_bg is enabled, since the first block group has a large amount of
metadata.
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
lib/ext2fs/mkjournal.c | 28 +++++++++++++++++++++++-----
1 files changed, 23 insertions(+), 5 deletions(-)
diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
index 3cae874..cd3df07 100644
--- a/lib/ext2fs/mkjournal.c
+++ b/lib/ext2fs/mkjournal.c
@@ -198,6 +198,7 @@ errcode_t ext2fs_zero_blocks(ext2_filsys fs, blk_t blk, int num,
struct mkjournal_struct {
int num_blocks;
int newblocks;
+ blk_t goal;
blk_t blk_to_zero;
int zero_count;
char *buf;
@@ -213,14 +214,13 @@ static int mkjournal_proc(ext2_filsys fs,
{
struct mkjournal_struct *es = (struct mkjournal_struct *) priv_data;
blk_t new_blk;
- static blk_t last_blk = 0;
errcode_t retval;
if (*blocknr) {
- last_blk = *blocknr;
+ es->goal = *blocknr;
return 0;
}
- retval = ext2fs_new_block(fs, last_blk, 0, &new_blk);
+ retval = ext2fs_new_block(fs, es->goal, 0, &new_blk);
if (retval) {
es->err = retval;
return BLOCK_ABORT;
@@ -258,8 +258,7 @@ static int mkjournal_proc(ext2_filsys fs,
es->err = retval;
return BLOCK_ABORT;
}
- *blocknr = new_blk;
- last_blk = new_blk;
+ *blocknr = es->goal = new_blk;
ext2fs_block_alloc_stats(fs, new_blk, +1);
if (es->num_blocks == 0)
@@ -276,6 +275,7 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
blk_t size, int flags)
{
char *buf;
+ dgrp_t group, start, end, i;
errcode_t retval;
struct ext2_inode inode;
struct mkjournal_struct es;
@@ -298,6 +298,24 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
es.err = 0;
es.zero_count = 0;
+ /*
+ * Set the initial goal block to be roughly at the middle of
+ * the filesystem. Pick a group that has the largest number
+ * of free blocks.
+ */
+ group = ext2fs_group_of_blk(fs, (fs->super->s_blocks_count -
+ fs->super->s_first_data_block) / 2);
+ start = (group > 0) ? group-1 : group;
+ end = ((group+1) < fs->group_desc_count) ? group+1 : group;
+ group = start;
+ for (i=start+1; i <= end; i++)
+ if (fs->group_desc[i].bg_free_blocks_count >
+ fs->group_desc[group].bg_free_blocks_count)
+ group = i;
+
+ es.goal = (fs->super->s_blocks_per_group * group) +
+ fs->super->s_first_data_block;
+
retval = ext2fs_block_iterate2(fs, journal_ino, BLOCK_FLAG_APPEND,
0, mkjournal_proc, &es);
if (es.err) {
--
1.5.6.1.205.ge2c7.dirty
Addresses-Sourceforge-Bug: 1483791
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
lib/ext2fs/mkjournal.c | 2 +-
tests/f_badjour_indblks/expect.1 | 2 +-
tests/f_badjour_indblks/expect.2 | 2 +-
tests/f_badjourblks/expect.1 | 2 +-
tests/f_badjourblks/expect.2 | 2 +-
tests/f_miss_journal/expect.1 | 2 +-
tests/f_miss_journal/expect.2 | 2 +-
7 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
index e55dcbd..3cae874 100644
--- a/lib/ext2fs/mkjournal.c
+++ b/lib/ext2fs/mkjournal.c
@@ -225,7 +225,7 @@ static int mkjournal_proc(ext2_filsys fs,
es->err = retval;
return BLOCK_ABORT;
}
- if (blockcnt > 0)
+ if (blockcnt >= 0)
es->num_blocks--;
es->newblocks++;
diff --git a/tests/f_badjour_indblks/expect.1 b/tests/f_badjour_indblks/expect.1
index 0190bf2..d80da89 100644
--- a/tests/f_badjour_indblks/expect.1
+++ b/tests/f_badjour_indblks/expect.1
@@ -29,5 +29,5 @@ Creating journal (1024 blocks): Done.
*** journal has been re-created - filesystem is now ext3 again ***
test_filesys: ***** FILE SYSTEM WAS MODIFIED *****
-test_filesys: 11/256 files (0.0% non-contiguous), 1112/8192 blocks
+test_filesys: 11/256 files (0.0% non-contiguous), 1111/8192 blocks
Exit status is 1
diff --git a/tests/f_badjour_indblks/expect.2 b/tests/f_badjour_indblks/expect.2
index 35365fa..3fbb8b3 100644
--- a/tests/f_badjour_indblks/expect.2
+++ b/tests/f_badjour_indblks/expect.2
@@ -3,5 +3,5 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test_filesys: 11/256 files (0.0% non-contiguous), 1112/8192 blocks
+test_filesys: 11/256 files (0.0% non-contiguous), 1111/8192 blocks
Exit status is 0
diff --git a/tests/f_badjourblks/expect.1 b/tests/f_badjourblks/expect.1
index 5a0bfef..cd86fc4 100644
--- a/tests/f_badjourblks/expect.1
+++ b/tests/f_badjourblks/expect.1
@@ -27,5 +27,5 @@ Creating journal (1024 blocks): Done.
*** journal has been re-created - filesystem is now ext3 again ***
test_filesys: ***** FILE SYSTEM WAS MODIFIED *****
-test_filesys: 11/256 files (0.0% non-contiguous), 1080/8192 blocks
+test_filesys: 11/256 files (0.0% non-contiguous), 1079/8192 blocks
Exit status is 1
diff --git a/tests/f_badjourblks/expect.2 b/tests/f_badjourblks/expect.2
index 632dc71..7c50703 100644
--- a/tests/f_badjourblks/expect.2
+++ b/tests/f_badjourblks/expect.2
@@ -3,5 +3,5 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test_filesys: 11/256 files (0.0% non-contiguous), 1080/8192 blocks
+test_filesys: 11/256 files (0.0% non-contiguous), 1079/8192 blocks
Exit status is 0
diff --git a/tests/f_miss_journal/expect.1 b/tests/f_miss_journal/expect.1
index cad69f6..7d696f8 100644
--- a/tests/f_miss_journal/expect.1
+++ b/tests/f_miss_journal/expect.1
@@ -25,5 +25,5 @@ Creating journal (1024 blocks): Done.
*** journal has been re-created - filesystem is now ext3 again ***
test_filesys: ***** FILE SYSTEM WAS MODIFIED *****
-test_filesys: 11/256 files (0.0% non-contiguous), 1080/2048 blocks
+test_filesys: 11/256 files (0.0% non-contiguous), 1079/2048 blocks
Exit status is 1
diff --git a/tests/f_miss_journal/expect.2 b/tests/f_miss_journal/expect.2
index 1e8c47f..ad32763 100644
--- a/tests/f_miss_journal/expect.2
+++ b/tests/f_miss_journal/expect.2
@@ -3,5 +3,5 @@ Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
-test_filesys: 11/256 files (0.0% non-contiguous), 1080/2048 blocks
+test_filesys: 11/256 files (0.0% non-contiguous), 1079/2048 blocks
Exit status is 0
--
1.5.6.1.205.ge2c7.dirty
On Thu, Aug 28, 2008 at 11:55:21AM +0200, Fr?d?ric Boh? wrote:
> With 512 groups by flex group, meta-datas for a single flex-group are 8
> groups long ! If we have no luck and there are a bunch of groups
> occupied by meta-datas at the middle of the filesystem, we should
> slightly increase the number of groups scanned to find a completely free
> group.
I'm not sure it ever makes sense to use such a huge -G setting, but
yes, you're right. It actually wasn't a major tragedy, since this
just specifies the goal block, and so the block allocator would just
search forward to find the first free block. But it is better to move
forward to the next free block group, so we leave space for interior
nodes of the extent tree to be allocated.
The following patch takes into account the flex_bg size, and will
stash the journal in the first free block group after metadata; we do
by starting at a flex_bg boundary, and then searching forward until
bg_free_blocks_count is non-zero. However, if the number of block
groups is less than half of the flex_bg size, we'll just give up and
throw it at the mid-point of the filesystem, since that (plus using
extents instead of indirect blocks) is really the major optimization
here.
One or two discontinuities in the journal file really isn't a big
deal, since we're normally seaking back and forth between the rest of
the filesystem data blocks and the journal anyway. The best benchmark
to see a problem isn't going to be bonnie, but something that which is
extremely fsync-intensive.
- Ted
diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
index 96b574e..f5a9dba 100644
--- a/lib/ext2fs/mkjournal.c
+++ b/lib/ext2fs/mkjournal.c
@@ -275,7 +275,7 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
blk_t size, int flags)
{
char *buf;
- dgrp_t group, start, end, i;
+ dgrp_t group, start, end, i, log_flex;
errcode_t retval;
struct ext2_inode inode;
struct mkjournal_struct es;
@@ -311,7 +311,17 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
*/
group = ext2fs_group_of_blk(fs, (fs->super->s_blocks_count -
fs->super->s_first_data_block) / 2);
- start = (group > 0) ? group-1 : group;
+ log_flex = 1 << fs->super->s_log_groups_per_flex;
+ if (fs->super->s_log_groups_per_flex && (group > log_flex)) {
+ group = group & ~(log_flex - 1);
+ while ((group < fs->group_desc_count) &&
+ fs->group_desc[group].bg_free_blocks_count == 0)
+ group++;
+ if (group == fs->group_desc_count)
+ group = 0;
+ start = group;
+ } else
+ start = (group > 0) ? group-1 : group;
end = ((group+1) < fs->group_desc_count) ? group+1 : group;
group = start;
for (i=start+1; i <= end; i++)
Theodore Tso wrote:
> On Thu, Aug 28, 2008 at 11:55:21AM +0200, Fr?d?ric Boh? wrote:
>
>> With 512 groups by flex group, meta-datas for a single flex-group are 8
>> groups long ! If we have no luck and there are a bunch of groups
>> occupied by meta-datas at the middle of the filesystem, we should
>> slightly increase the number of groups scanned to find a completely free
>> group.
>>
>
> I'm not sure it ever makes sense to use such a huge -G setting, but
> yes, you're right. It actually wasn't a major tragedy, since this
> just specifies the goal block, and so the block allocator would just
> search forward to find the first free block. But it is better to move
> forward to the next free block group, so we leave space for interior
> nodes of the extent tree to be allocated.
>
> The following patch takes into account the flex_bg size, and will
> stash the journal in the first free block group after metadata; we do
> by starting at a flex_bg boundary, and then searching forward until
> bg_free_blocks_count is non-zero. However, if the number of block
> groups is less than half of the flex_bg size, we'll just give up and
> throw it at the mid-point of the filesystem, since that (plus using
> extents instead of indirect blocks) is really the major optimization
> here.
>
> One or two discontinuities in the journal file really isn't a big
> deal, since we're normally seaking back and forth between the rest of
> the filesystem data blocks and the journal anyway. The best benchmark
> to see a problem isn't going to be bonnie, but something that which is
> extremely fsync-intensive.
>
> - Ted
>
I can try and test this with my fsync() heavy fs_mark run...
Ric
> diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
> index 96b574e..f5a9dba 100644
> --- a/lib/ext2fs/mkjournal.c
> +++ b/lib/ext2fs/mkjournal.c
> @@ -275,7 +275,7 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
> blk_t size, int flags)
> {
> char *buf;
> - dgrp_t group, start, end, i;
> + dgrp_t group, start, end, i, log_flex;
> errcode_t retval;
> struct ext2_inode inode;
> struct mkjournal_struct es;
> @@ -311,7 +311,17 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
> */
> group = ext2fs_group_of_blk(fs, (fs->super->s_blocks_count -
> fs->super->s_first_data_block) / 2);
> - start = (group > 0) ? group-1 : group;
> + log_flex = 1 << fs->super->s_log_groups_per_flex;
> + if (fs->super->s_log_groups_per_flex && (group > log_flex)) {
> + group = group & ~(log_flex - 1);
> + while ((group < fs->group_desc_count) &&
> + fs->group_desc[group].bg_free_blocks_count == 0)
> + group++;
> + if (group == fs->group_desc_count)
> + group = 0;
> + start = group;
> + } else
> + start = (group > 0) ? group-1 : group;
> end = ((group+1) < fs->group_desc_count) ? group+1 : group;
> group = start;
> for (i=start+1; i <= end; i++)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Le jeudi 28 août 2008 à 09:34 -0400, Theodore Tso a écrit :
> On Thu, Aug 28, 2008 at 11:55:21AM +0200, Frédéric Bohé wrote:
> > With 512 groups by flex group, meta-datas for a single flex-group are 8
> > groups long ! If we have no luck and there are a bunch of groups
> > occupied by meta-datas at the middle of the filesystem, we should
> > slightly increase the number of groups scanned to find a completely free
> > group.
>
> I'm not sure it ever makes sense to use such a huge -G setting, but
> yes, you're right. It actually wasn't a major tragedy, since this
> just specifies the goal block, and so the block allocator would just
> search forward to find the first free block. But it is better to move
> forward to the next free block group, so we leave space for interior
> nodes of the extent tree to be allocated.
>
> The following patch takes into account the flex_bg size, and will
> stash the journal in the first free block group after metadata; we do
> by starting at a flex_bg boundary, and then searching forward until
> bg_free_blocks_count is non-zero. However, if the number of block
> groups is less than half of the flex_bg size, we'll just give up and
> throw it at the mid-point of the filesystem, since that (plus using
> extents instead of indirect blocks) is really the major optimization
> here.
>
> One or two discontinuities in the journal file really isn't a big
> deal, since we're normally seaking back and forth between the rest of
> the filesystem data blocks and the journal anyway. The best benchmark
> to see a problem isn't going to be bonnie, but something that which is
> extremely fsync-intensive.
>
> - Ted
>
> diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
> index 96b574e..f5a9dba 100644
> --- a/lib/ext2fs/mkjournal.c
> +++ b/lib/ext2fs/mkjournal.c
> @@ -275,7 +275,7 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
> blk_t size, int flags)
> {
> char *buf;
> - dgrp_t group, start, end, i;
> + dgrp_t group, start, end, i, log_flex;
> errcode_t retval;
> struct ext2_inode inode;
> struct mkjournal_struct es;
> @@ -311,7 +311,17 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
> */
> group = ext2fs_group_of_blk(fs, (fs->super->s_blocks_count -
> fs->super->s_first_data_block) / 2);
> - start = (group > 0) ? group-1 : group;
> + log_flex = 1 << fs->super->s_log_groups_per_flex;
> + if (fs->super->s_log_groups_per_flex && (group > log_flex)) {
> + group = group & ~(log_flex - 1);
> + while ((group < fs->group_desc_count) &&
> + fs->group_desc[group].bg_free_blocks_count == 0)
I would have preferred this to test if a group is free :
fs->group_desc[group].bg_free_blocks_count !=
fs->super->s_blocks_per_group)
That's because there could be "holes" with free blocks in flex_bg
meta-datas when they cross backups superblocks and GDT. Very rare case
and not a big issue, I admit.
> + group++;
> + if (group == fs->group_desc_count)
> + group = 0;
> + start = group;
> + } else
> + start = (group > 0) ? group-1 : group;
> end = ((group+1) < fs->group_desc_count) ? group+1 : group;
> group = start;
> for (i=start+1; i <= end; i++)
>
>
On Thu, Aug 28, 2008 at 09:40:30AM -0400, Ric Wheeler wrote:
>
> I can try and test this with my fsync() heavy fs_mark run...
>
Given [1] I have no doubt that you should see a difference between
mke2fs from e2fsprogs 1.41.0 (which will allocate the journal at the
beginning of the filesystem) and the latest tip of e2fsprogs.
It would be interesting to see how measurable the difference is,
though. I'd recommend doing "mke2fs -t ext4dev" using both versions
of mke2fs, and seeing how much the difference it makes.
[1] http://www.usenix.org/events/usenix05/tech/general/full_papers/prabhakaran/prabhakaran_html/main.html#fig-journal-location-withinfs-fix
- Ted
Theodore Tso wrote:
> On Thu, Aug 28, 2008 at 09:40:30AM -0400, Ric Wheeler wrote:
>
>> I can try and test this with my fsync() heavy fs_mark run...
>>
>>
>
> Given [1] I have no doubt that you should see a difference between
> mke2fs from e2fsprogs 1.41.0 (which will allocate the journal at the
> beginning of the filesystem) and the latest tip of e2fsprogs.
>
> It would be interesting to see how measurable the difference is,
> though. I'd recommend doing "mke2fs -t ext4dev" using both versions
> of mke2fs, and seeing how much the difference it makes.
>
> [1] http://www.usenix.org/events/usenix05/tech/general/full_papers/prabhakaran/prabhakaran_html/main.html#fig-journal-location-withinfs-fix
>
> - Ted
>
>
I was thinking of trying this on a much larger disk (1TB) - the article
reports on a 4GB partition which is pretty tiny,
ric
Le mercredi 27 août 2008 à 17:14 -0400, Theodore Ts'o a écrit :
> This speeds up access to the journal by eliminating worst-case seeks
> from one end of the disk to another, which can be quite common in very
> fsync-intensive workloads if the file is located near the end of the
> disk, and the journal is located the beginning of the disk.
>
> In addition, this can help eliminate journal fragmentation when
> flex_bg is enabled, since the first block group has a large amount of
> metadata.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
> ---
> lib/ext2fs/mkjournal.c | 28 +++++++++++++++++++++++-----
> 1 files changed, 23 insertions(+), 5 deletions(-)
>
> diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
> index 3cae874..cd3df07 100644
> --- a/lib/ext2fs/mkjournal.c
> +++ b/lib/ext2fs/mkjournal.c
> @@ -198,6 +198,7 @@ errcode_t ext2fs_zero_blocks(ext2_filsys fs, blk_t blk, int num,
> struct mkjournal_struct {
> int num_blocks;
> int newblocks;
> + blk_t goal;
> blk_t blk_to_zero;
> int zero_count;
> char *buf;
> @@ -213,14 +214,13 @@ static int mkjournal_proc(ext2_filsys fs,
> {
> struct mkjournal_struct *es = (struct mkjournal_struct *) priv_data;
> blk_t new_blk;
> - static blk_t last_blk = 0;
> errcode_t retval;
>
> if (*blocknr) {
> - last_blk = *blocknr;
> + es->goal = *blocknr;
> return 0;
> }
> - retval = ext2fs_new_block(fs, last_blk, 0, &new_blk);
> + retval = ext2fs_new_block(fs, es->goal, 0, &new_blk);
> if (retval) {
> es->err = retval;
> return BLOCK_ABORT;
> @@ -258,8 +258,7 @@ static int mkjournal_proc(ext2_filsys fs,
> es->err = retval;
> return BLOCK_ABORT;
> }
> - *blocknr = new_blk;
> - last_blk = new_blk;
> + *blocknr = es->goal = new_blk;
> ext2fs_block_alloc_stats(fs, new_blk, +1);
>
> if (es->num_blocks == 0)
> @@ -276,6 +275,7 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
> blk_t size, int flags)
> {
> char *buf;
> + dgrp_t group, start, end, i;
> errcode_t retval;
> struct ext2_inode inode;
> struct mkjournal_struct es;
> @@ -298,6 +298,24 @@ static errcode_t write_journal_inode(ext2_filsys fs, ext2_ino_t journal_ino,
> es.err = 0;
> es.zero_count = 0;
>
> + /*
> + * Set the initial goal block to be roughly at the middle of
> + * the filesystem. Pick a group that has the largest number
> + * of free blocks.
> + */
> + group = ext2fs_group_of_blk(fs, (fs->super->s_blocks_count -
> + fs->super->s_first_data_block) / 2);
> + start = (group > 0) ? group-1 : group;
> + end = ((group+1) < fs->group_desc_count) ? group+1 : group;
With 512 groups by flex group, meta-datas for a single flex-group are 8
groups long ! If we have no luck and there are a bunch of groups
occupied by meta-datas at the middle of the filesystem, we should
slightly increase the number of groups scanned to find a completely free
group.
> + group = start;
> + for (i=start+1; i <= end; i++)
> + if (fs->group_desc[i].bg_free_blocks_count >
> + fs->group_desc[group].bg_free_blocks_count)
> + group = i;
This is ok if the journal file has the default size, but not optimal if
it's bigger than one group.
> +
> + es.goal = (fs->super->s_blocks_per_group * group) +
> + fs->super->s_first_data_block;
> +
> retval = ext2fs_block_iterate2(fs, journal_ino, BLOCK_FLAG_APPEND,
> 0, mkjournal_proc, &es);
> if (es.err) {
Ric Wheeler wrote:
> Theodore Tso wrote:
>> On Thu, Aug 28, 2008 at 09:40:30AM -0400, Ric Wheeler wrote:
>>
>>> I can try and test this with my fsync() heavy fs_mark run...
>>>
>>>
>>
>> Given [1] I have no doubt that you should see a difference between
>> mke2fs from e2fsprogs 1.41.0 (which will allocate the journal at the
>> beginning of the filesystem) and the latest tip of e2fsprogs.
>>
>> It would be interesting to see how measurable the difference is,
>> though. I'd recommend doing "mke2fs -t ext4dev" using both versions
>> of mke2fs, and seeing how much the difference it makes.
>>
>> [1]
>> http://www.usenix.org/events/usenix05/tech/general/full_papers/prabhakaran/prabhakaran_html/main.html#fig-journal-location-withinfs-fix
>>
>>
>> - Ted
>>
>>
>
> I was thinking of trying this on a much larger disk (1TB) - the
> article reports on a 4GB partition which is pretty tiny,
>
> ric
Using Ted's new journal in the middle mkfs & hacking it slightly to make
the journal go back to block 0, I ran some quick tests to try and
measure an impact of the journal placement (multiple threads writing 4MB
files).
The results that are most clear are certainly that delayed allocation is
a big win, the journal placement has mostly a positive impact on the
write performance, but the results are quite close.
Starting with a newly created file system, each pass put down 16,000 4MB
files:
Count Files/sec - ZERO Files/sec - Middle
16000 20.8 20.8
32000 19.2 18.4
48000 16.0 14.4
64000 20.8 20.8
80000 20.8 20.8
96000 20.8 20.8
112000 20.8 20.8
128000 19.2 20.8
144000 19.2 19.2
160000 17.6 19.2
176000 16.6 17.6
192000 16.0 16.0
208000 14.4 14.4
224000 12.8 14.4
With no delayed allocation:
Count Files/sec - ZERO Files/sec - Middle
16000 16.0 16.0
32000 16.0 16.0
48000 16.0 16.0
64000 16.0 14.8
80000 14.4 14.4
96000 14.1 14.4
112000 12.8 12.8
128000 11.2 11.2
144000 11.2 12.8
160000 14.4 14.4
176000 11.3 12.8
192000 14.9 12.8
208000 16.0 16.0
224000 14.4 16.0
ric