2009-04-23 16:41:56

by Curt Wohlgemuth

[permalink] [raw]
Subject: Question on block group allocation

Hi:

I'm seeing a performance problem on ext4 vs ext2, and in trying to
narrow it down, I've got a question about block allocation in ext4
that I'm having trouble figuring out.

The test in question just does random reads of several rather large
files (4.5GB and 10GB) in a single thread. All files are created in
the top-level directory. Looking into the block layout for the
various files, I'm struck by the wide separation of the extents in
some of the files.

As a simple example, I formatted/mounted a new ext4 partition with
default parameters (with the exception of "-O ^has_journal", but this
shouldn't make a difference); the FS has 5585 block groups of 4K
blocks.

Using dd, I created (in this order) two 4GB files and a 10GB file in
the mount directory.

The extent blocks are reasonably close together for the two 4GB files,
but the extents for the 10GB file show a huge gap, which seems to hurt
the random read performance pretty substantially. Here's the output
from debugfs:

BLOCKS:
(IND):8396832, (0-106495):8282112-8388607,
(106496-399359):11241472-11534335, (399360-888831):20482048-20971519,
(888832-1116159):23889920-24117247, (1116160-1277951):71665664-
71827455, (1277952-1767423):78678016-79167487,
(1767424-2125823):102402048-102760447,
(2125824-2148351):102768672-102791199,
(2148352-2621439):102793216-103266303
TOTAL: 2621441

Note the gap between blocks 79167487 and 102402048. I was lucky
enough to capture the mb_history from this 10GB create:

29109 14 735/30720/32758@1114112 735/30720/2048@1114112
735/30720/2048@1114112 1 0 0 1568 M 0 0
29109 14 736/0/32758@1116160 736/0/2048@1116160
2187/2048/2048@1116160 1 1 0 1568 0 0
29109 14 2187/4096/32758@1118208 2187/4096/2048@1118208
2187/4096/2048@1118208 1 0 0 1568 M 2048 4096

I've been staring at ext4_mb_regular_allocator() trying to understand
why an allocation with a goal block of 736 ends up with a best found
extent group of 2187, and I'm stuck -- at least without a lot of
printk messages. It seems to me that we just cycle through the block
groups starting with the goal group until we find a group that fits.
Again, according to dumpe2fs, block groups 737, 738, 739, ... all have
32768 free blocks. So why we end up with a best fit group of 2187 is
a mystery to me.

Can anybody give me an insight to what's happening here?

Thanks,
Curt


2009-04-23 19:08:38

by Andreas Dilger

[permalink] [raw]
Subject: Re: Question on block group allocation

On Apr 23, 2009 09:41 -0700, Curt Wohlgemuth wrote:
> I'm seeing a performance problem on ext4 vs ext2, and in trying to
> narrow it down, I've got a question about block allocation in ext4
> that I'm having trouble figuring out.
>
> Using dd, I created (in this order) two 4GB files and a 10GB file in
> the mount directory.
>
> The extent blocks are reasonably close together for the two 4GB files,
> but the extents for the 10GB file show a huge gap, which seems to hurt
> the random read performance pretty substantially. Here's the output
> from debugfs:
>
> BLOCKS:
> (IND):8396832, (0-106495):8282112-8388607,
> (106496-399359):11241472-11534335, (399360-888831):20482048-20971519,
> (888832-1116159):23889920-24117247, (1116160-1277951):71665664-
> 71827455, (1277952-1767423):78678016-79167487,
> (1767424-2125823):102402048-102760447,
> (2125824-2148351):102768672-102791199,
> (2148352-2621439):102793216-103266303
> TOTAL: 2621441
>
> Note the gap between blocks 79167487 and 102402048.

Well, there are other even larger gaps for other chunks of the file.

> I was lucky enough to capture the mb_history from this 10GB create:
>
> 29109 14 735/30720/32758@1114112 735/30720/2048@1114112
> 735/30720/2048@1114112 1 0 0 1568 M 0 0
> 29109 14 736/0/32758@1116160 736/0/2048@1116160
> 2187/2048/2048@1116160 1 1 0 1568 0 0
> 29109 14 2187/4096/32758@1118208 2187/4096/2048@1118208
> 2187/4096/2048@1118208 1 0 0 1568 M 2048 4096
>
> I've been staring at ext4_mb_regular_allocator() trying to understand
> why an allocation with a goal block of 736 ends up with a best found
> extent group of 2187, and I'm stuck -- at least without a lot of
> printk messages. It seems to me that we just cycle through the block
> groups starting with the goal group until we find a group that fits.
> Again, according to dumpe2fs, block groups 737, 738, 739, ... all have
> 32768 free blocks. So why we end up with a best fit group of 2187 is
> a mystery to me.

This is likely the "uninit_bg" feature that is causing the allocations
to skip groups which are marked BLOCK_UNINIT. In some sense the benefit
of skipping the block bitmap read during e2fsck is probably not at all
beneficial compared to the cost of the extra seeking during IO. As the
filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways,
so we might as well just keep the early allocations contiguous.

A simple change to verify this would be something like the following,
but it hasn't actually been tested.

--- ./fs/ext4/mballoc.c.uninit 2009-04-08 19:13:13.000000000 -0600
+++ ./fs/ext4/mballoc.c 2009-04-23 13:02:22.000000000 -0600
@@ -1742,10 +1723,6 @@ static int ext4_mb_good_group(struct ext
switch (cr) {
case 0:
BUG_ON(ac->ac_2order == 0);
- /* If this group is uninitialized, skip it initially */
- desc = ext4_get_group_desc(ac->ac_sb, group, NULL);
- if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
- return 0;

bits = ac->ac_sb->s_blocksize_bits + 1;
for (i = ac->ac_2order; i <= bits; i++)
@@ -2039,9 +2035,7 @@ repeat:
ac->ac_groups_scanned++;
desc = ext4_get_group_desc(sb, group, NULL);
- if (cr == 0 || (desc->bg_flags &
- cpu_to_le16(EXT4_BG_BLOCK_UNINIT) &&
- ac->ac_2order != 0))
+ if (cr == 0)
ext4_mb_simple_scan_group(ac, &e4b);
else if (cr == 1 &&
ac->ac_g_ex.fe_len == sbi->s_stripe)


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-04-23 22:02:10

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: Question on block group allocation

Hi Andreas:

On Thu, Apr 23, 2009 at 12:08 PM, Andreas Dilger <[email protected]> wrote:
> On Apr 23, 2009 ?09:41 -0700, Curt Wohlgemuth wrote:
>> I'm seeing a performance problem on ext4 vs ext2, and in trying to
>> narrow it down, I've got a question about block allocation in ext4
>> that I'm having trouble figuring out.
>>
>> Using dd, I created (in this order) two 4GB files and a 10GB file in
>> the mount directory.
>>
>> The extent blocks are reasonably close together for the two 4GB files,
>> but the extents for the 10GB file show a huge gap, which seems to hurt
>> the random read performance pretty substantially. ?Here's the output
>> from debugfs:
>>
>> BLOCKS:
>> (IND):8396832, (0-106495):8282112-8388607,
>> (106496-399359):11241472-11534335, (399360-888831):20482048-20971519,
>> (888832-1116159):23889920-24117247, (1116160-1277951):71665664-
>> 71827455, (1277952-1767423):78678016-79167487,
>> (1767424-2125823):102402048-102760447,
>> (2125824-2148351):102768672-102791199,
>> (2148352-2621439):102793216-103266303
>> TOTAL: 2621441
>>
>> Note the gap between blocks 79167487 and 102402048.
>
> Well, there are other even larger gaps for other chunks of the file.

Really? Not that it's important, but I'm not seeing them...

>> I was lucky enough to capture the mb_history from this 10GB create:
>>
>> 29109 14 ? ? ? 735/30720/32758@1114112 735/30720/2048@1114112
>> 735/30720/2048@1114112 ?1 ? ? 0 ? ? 0 ?1568 ?M ? ? 0 ? ? 0
>> 29109 14 ? ? ? 736/0/32758@1116160 ? ? 736/0/2048@1116160
>> 2187/2048/2048@1116160 ?1 ? ? 1 ? ? 0 ?1568 ? ? ? ?0 ? ? 0
>> 29109 14 ? ? ? 2187/4096/32758@1118208 2187/4096/2048@1118208
>> 2187/4096/2048@1118208 ?1 ? ? 0 ? ? 0 ?1568 ?M ? ? 2048 ?4096
>>
>> I've been staring at ext4_mb_regular_allocator() trying to understand
>> why an allocation with a goal block of 736 ends up with a best found
>> extent group of 2187, and I'm stuck -- at least without a lot of
>> printk messages. ?It seems to me that we just cycle through the block
>> groups starting with the goal group until we find a group that fits.
>> Again, according to dumpe2fs, block groups 737, 738, 739, ... all have
>> 32768 free blocks. ?So why we end up with a best fit group of 2187 is
>> a mystery to me.
>
> This is likely the "uninit_bg" feature that is causing the allocations
> to skip groups which are marked BLOCK_UNINIT. ?In some sense the benefit
> of skipping the block bitmap read during e2fsck is probably not at all
> beneficial compared to the cost of the extra seeking during IO. ?As the
> filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways,
> so we might as well just keep the early allocations contiguous.

Ah, thanks! That's what I was missing. Yes, I sort of skipped over
the "is this a good group?" question.

> A simple change to verify this would be something like the following,
> but it hasn't actually been tested.

Tell you what: I'll try this out and see if it helps out my test case.

Thanks,
Curt

>
> --- ./fs/ext4/mballoc.c.uninit ? ?2009-04-08 19:13:13.000000000 -0600
> +++ ./fs/ext4/mballoc.c 2009-04-23 13:02:22.000000000 -0600
> @@ -1742,10 +1723,6 @@ static int ext4_mb_good_group(struct ext
> ? ? ? ?switch (cr) {
> ? ? ? ?case 0:
> ? ? ? ? ? ? ? ?BUG_ON(ac->ac_2order == 0);
> - ? ? ? ? ? ? ? /* If this group is uninitialized, skip it initially */
> - ? ? ? ? ? ? ? desc = ext4_get_group_desc(ac->ac_sb, group, NULL);
> - ? ? ? ? ? ? ? if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
> - ? ? ? ? ? ? ? ? ? ? ? return 0;
>
> ? ? ? ? ? ? ? ?bits = ac->ac_sb->s_blocksize_bits + 1;
> ? ? ? ? ? ? ? ?for (i = ac->ac_2order; i <= bits; i++)
> @@ -2039,9 +2035,7 @@ repeat:
> ? ? ? ? ? ? ? ? ? ? ? ?ac->ac_groups_scanned++;
> ? ? ? ? ? ? ? ? ? ? ? ?desc = ext4_get_group_desc(sb, group, NULL);
> - ? ? ? ? ? ? ? ? ? ? ? if (cr == 0 || (desc->bg_flags &
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? cpu_to_le16(EXT4_BG_BLOCK_UNINIT) &&
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ac->ac_2order != 0))
> + ? ? ? ? ? ? ? ? ? ? ? if (cr == 0)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ext4_mb_simple_scan_group(ac, &e4b);
> ? ? ? ? ? ? ? ? ? ? ? ?else if (cr == 1 &&
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?ac->ac_g_ex.fe_len == sbi->s_stripe)
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2009-04-27 05:29:44

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: Question on block group allocation

Hi Ted:

I don't have access to the actual data right now, because I created
the files and ran the benchmark just before leaving for a few days,
but...

On Sun, Apr 26, 2009 at 8:14 PM, Theodore Tso <[email protected]> wrote:
> On Thu, Apr 23, 2009 at 03:02:05PM -0700, Curt Wohlgemuth wrote:
>> > This is likely the "uninit_bg" feature that is causing the allocations
>> > to skip groups which are marked BLOCK_UNINIT. ?In some sense the benefit
>> > of skipping the block bitmap read during e2fsck is probably not at all
>> > beneficial compared to the cost of the extra seeking during IO. ?As the
>> > filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways,
>> > so we might as well just keep the early allocations contiguous.
>
> Well, I tried out Andreas' patch, by doing an rsync copy from my SSD
> root partition to a 5400 rpm laptop drive, and then ran e2fsck and
> dumpe2fs. ?The results were interesting:
>
> ? ? ? ? ? ? ? Before Patch ? ? ? ? ? ? ? ? ? ? ? After Patch
> ? ? ? ? ? ? ?Time in seconds ? ? ? ? ? ? ? ? ? Time in seconds
> ? ? ? ? ? ?Real / ?User/ ?Sys ? MB/s ? ? ?Real / ?User/ ?Sys ? ?MB/s
> Pass 1 ? ? ?8.52 / 2.21 / 0.46 ?20.43 ? ? ?8.84 / 4.97 / 1.11 ? 19.68
> Pass 2 ? ? 21.16 / 1.02 / 1.86 ?11.30 ? ? ?6.54 / 1.77 / 1.78 ? 36.39
> Pass 3 ? ? ?0.01 / 0.00 / 0.00 139.00 ? ? ?0.01 / 0.01 / 0.00 ?128.90
> Pass 4 ? ? ?0.16 / 0.15 / 0.00 ? 0.00 ? ? ?0.17 / 0.17 / 0.00 ? ?0.00
> Pass 5 ? ? ?2.52 / 1.99 / 0.09 ? 0.79 ? ? ?2.31 / 1.78 / 0.06 ? ?0.86
> Total ? ? ?32.40 / 5.11 / 2.49 ?12.81 ? ? 17.99 / 8.75 / 2.98 ? 23.01
>
> The surprise is in the gross inspection of the dumpe2fs results:
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? Before Patch ? ?After Patch
> # of non-contig files ? ? ? ? ? 762 ? ? ? ? ? ? 779
> # of non-contig directories ? ? 571 ? ? ? ? ? ? 570
> # of BLOCK_UNINIT bg's ? ? ? ? ?307 ? ? ? ? ? ? 293
> # of INODE_UNINIT bg's ? ? ? ? ?503 ? ? ? ? ? ? 503
>
> So the interesting thing is that the patch only "broke open" an
> additional 14 block groups (out of a 333 block groups in use when the
> filesystem was created with the unpatched kernel). ?However, this
> allowed the pass 2 directory time to go *down* by over a factor of
> three (from 21.2 seconds with the unpatched ext4 code to 6.5 seconds
> with the the patch.
>
> I think what the patch did was to diminish allocation pressure on the
> first block group in the flex_bg, so we weren't mixing directory and
> regular file contents. ?This eliminated seeks during pass 2 of e2fsck,
> which was actually a Very Good Thing.
>
>> > A simple change to verify this would be something like the following,
>> > but it hasn't actually been tested.
>>
>> Tell you what: ?I'll try this out and see if it helps out my test case.
>
> Let me know what this does for your test case. ?Hopefully the patch
> also makes things better, since this patch is looking very interesting
> right now.

The random read throughput on the 10GB file went from ~16 MB/s to ~22
MB/s after Andreas' patch; the total fragmentation of the file was
much lower than before his patch.

However, the number of extents went up by quite a bit (I don't have
the debugfs output in front of me at the moment, sorry). It seemed
that no extent crossed a block group; I didn't have time to see if
Andreas' patch disabled flex BGs or not, as to what was going on.

I'll be able to send details out on Tuesday.

Curt

>
> Andreas, can I get a Signed-off-by from you for this patch?
>
> Thanks,
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>

2009-04-27 02:14:17

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on block group allocation

On Thu, Apr 23, 2009 at 03:02:05PM -0700, Curt Wohlgemuth wrote:
> > This is likely the "uninit_bg" feature that is causing the allocations
> > to skip groups which are marked BLOCK_UNINIT. ?In some sense the benefit
> > of skipping the block bitmap read during e2fsck is probably not at all
> > beneficial compared to the cost of the extra seeking during IO. ?As the
> > filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways,
> > so we might as well just keep the early allocations contiguous.

Well, I tried out Andreas' patch, by doing an rsync copy from my SSD
root partition to a 5400 rpm laptop drive, and then ran e2fsck and
dumpe2fs. The results were interesting:

Before Patch After Patch
Time in seconds Time in seconds
Real / User/ Sys MB/s Real / User/ Sys MB/s
Pass 1 8.52 / 2.21 / 0.46 20.43 8.84 / 4.97 / 1.11 19.68
Pass 2 21.16 / 1.02 / 1.86 11.30 6.54 / 1.77 / 1.78 36.39
Pass 3 0.01 / 0.00 / 0.00 139.00 0.01 / 0.01 / 0.00 128.90
Pass 4 0.16 / 0.15 / 0.00 0.00 0.17 / 0.17 / 0.00 0.00
Pass 5 2.52 / 1.99 / 0.09 0.79 2.31 / 1.78 / 0.06 0.86
Total 32.40 / 5.11 / 2.49 12.81 17.99 / 8.75 / 2.98 23.01

The surprise is in the gross inspection of the dumpe2fs results:

Before Patch After Patch
# of non-contig files 762 779
# of non-contig directories 571 570
# of BLOCK_UNINIT bg's 307 293
# of INODE_UNINIT bg's 503 503

So the interesting thing is that the patch only "broke open" an
additional 14 block groups (out of a 333 block groups in use when the
filesystem was created with the unpatched kernel). However, this
allowed the pass 2 directory time to go *down* by over a factor of
three (from 21.2 seconds with the unpatched ext4 code to 6.5 seconds
with the the patch.

I think what the patch did was to diminish allocation pressure on the
first block group in the flex_bg, so we weren't mixing directory and
regular file contents. This eliminated seeks during pass 2 of e2fsck,
which was actually a Very Good Thing.

> > A simple change to verify this would be something like the following,
> > but it hasn't actually been tested.
>
> Tell you what: I'll try this out and see if it helps out my test case.

Let me know what this does for your test case. Hopefully the patch
also makes things better, since this patch is looking very interesting
right now.

Andreas, can I get a Signed-off-by from you for this patch?

Thanks,

- Ted

2009-04-27 10:42:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on block group allocation

On Sun, Apr 26, 2009 at 11:29:39PM -0600, Curt Wohlgemuth wrote:
>
> The random read throughput on the 10GB file went from ~16 MB/s to ~22
> MB/s after Andreas' patch; the total fragmentation of the file was
> much lower than before his patch.
>
> However, the number of extents went up by quite a bit (I don't have
> the debugfs output in front of me at the moment, sorry). It seemed
> that no extent crossed a block group; I didn't have time to see if
> Andreas' patch disabled flex BGs or not, as to what was going on.

Try running e2fsck with the "-E fragcheck" option, and then capture
e2fsck's stdout. It will help with the grunt work of doing the
analysis, in terms of displaying the details of all of the files which
are discontiguous.

- Ted

2009-04-27 22:40:59

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on block group allocation

On Sun, Apr 26, 2009 at 11:29:39PM -0600, Curt Wohlgemuth wrote:
> The random read throughput on the 10GB file went from ~16 MB/s to ~22
> MB/s after Andreas' patch; the total fragmentation of the file was
> much lower than before his patch.
>
> However, the number of extents went up by quite a bit (I don't have
> the debugfs output in front of me at the moment, sorry).

I'm curious what you meant by the combination of these two statements,
"the total fragmentation of the file was much lower than before his
patch", and "the number of extents went up by quite a bit". Can you
send me the debugfs output when you have a chance?

Thanks,

- Ted

2009-04-27 23:13:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: Question on block group allocation

On Apr 23, 2009 13:08 -0600, Andreas Dilger wrote:
> This is likely the "uninit_bg" feature that is causing the allocations
> to skip groups which are marked BLOCK_UNINIT. In some sense the benefit
> of skipping the block bitmap read during e2fsck is probably not at all
> beneficial compared to the cost of the extra seeking during IO. As the
> filesystem gets more full, the BLOCK_UNIIT flags would be cleared anyways,
> so we might as well just keep the early allocations contiguous.
>
> A simple change to verify this would be something like the following,
> but it hasn't actually been tested.
>
> --- ./fs/ext4/mballoc.c.uninit 2009-04-08 19:13:13.000000000 -0600
> +++ ./fs/ext4/mballoc.c 2009-04-23 13:02:22.000000000 -0600
> @@ -1742,10 +1723,6 @@ static int ext4_mb_good_group(struct ext
> switch (cr) {
> case 0:
> BUG_ON(ac->ac_2order == 0);
> - /* If this group is uninitialized, skip it initially */
> - desc = ext4_get_group_desc(ac->ac_sb, group, NULL);
> - if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
> - return 0;
>
> bits = ac->ac_sb->s_blocksize_bits + 1;
> for (i = ac->ac_2order; i <= bits; i++)
> @@ -2039,9 +2035,7 @@ repeat:
> ac->ac_groups_scanned++;
> desc = ext4_get_group_desc(sb, group, NULL);
> - if (cr == 0 || (desc->bg_flags &
> - cpu_to_le16(EXT4_BG_BLOCK_UNINIT) &&
> - ac->ac_2order != 0))
> + if (cr == 0)
> ext4_mb_simple_scan_group(ac, &e4b);
> else if (cr == 1 &&
> ac->ac_g_ex.fe_len == sbi->s_stripe)

Because this is actually proving to be useful:

Signed-off-by: Andreas Dilger <[email protected]>

As we discussed in the call, I suspect BLOCK_UNINIT was more useful in the
past when directories were spread over all groups evenly (pre-Orlov), and
before flex_bg where seeking to read all of the bitmaps was a slow and
painful process. For flex_bg it could be WORSE to skip bitmap reads because
instead of doing contiguous 64kB reads it may now doing read 4kB, seek,
read 4kB, seek, etc.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2009-04-29 18:38:54

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: Question on block group allocation

Hi Ted:

On Mon, Apr 27, 2009 at 3:40 PM, Theodore Tso <[email protected]> wrote:
> On Sun, Apr 26, 2009 at 11:29:39PM -0600, Curt Wohlgemuth wrote:
>> The random read throughput on the 10GB file went from ~16 MB/s to ~22
>> MB/s after Andreas' patch; the total fragmentation of the file was
>> much lower than before his patch.
>>
>> However, the number of extents went up by quite a bit (I don't have
>> the debugfs output in front of me at the moment, sorry).
>
> I'm curious what you meant by the combination of these two statements,
> "the total fragmentation of the file was much lower than before his
> patch", and "the number of extents went up by quite a bit". ?Can you
> send me the debugfs output when you have a chance?

Sorry it's been so long for me to reply.

Okay, my phrasing was not as precise as it could have been. What I
meant by "total fragmentation" was simply that the range of physical
blocks for the 10GB file was much lower with Andreas' patch:

Before patch: 8282112 - 103266303
After patch: 271360 - 5074943

The number of extents is much larger. See the attached debugfs output.

Here's the output of "e2fsck -E fragcheck" on the block devices;
remember, though, that each one has only 3 files:

-rw-rw-r-- 1 root root 10737418240 Apr 23 15:33 10g
-rw-rw-r-- 1 root root 4294967296 Apr 23 15:30 4g
-rw-rw-r-- 1 root root 4294967296 Apr 23 15:30 4g-2
drwx------ 2 root root 16384 Apr 23 15:27 lost+found/

Before patch:
e2fsck 1.41.3 (12-Oct-2008)
/dev/hdm3: clean, 14/45760512 files, 7608255/183010471 blocks

After patch:
e2fsck 1.41.3 (12-Oct-2008)
/dev/hdo3: clean, 14/45760512 files, 7608258/183010471 blocks


Thanks,
Curt


Attachments:
debugfs.before-patch (888.00 B)
debugfs.after-patch (12.65 kB)
Download all attachments

2009-04-29 19:16:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on block group allocation

On Sun, Apr 26, 2009 at 11:29:39PM -0600, Curt Wohlgemuth wrote:
> The random read throughput on the 10GB file went from ~16 MB/s to ~22
> MB/s after Andreas' patch; the total fragmentation of the file was
> much lower than before his patch.
>
> However, the number of extents went up by quite a bit (I don't have
> the debugfs output in front of me at the moment, sorry). It seemed
> that no extent crossed a block group; I didn't have time to see if
> Andreas' patch disabled flex BGs or not, as to what was going on.
>
> I'll be able to send details out on Tuesday.

Hi Curt,

When you have a chance, can you send out the details from your test run?

Many thanks,

- Ted

2009-04-29 19:37:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on block group allocation

On Wed, Apr 29, 2009 at 03:16:47PM -0400, Theodore Tso wrote:
>
> When you have a chance, can you send out the details from your test run?
>

Oops, sorry, our two e-mails overlapped. Sorry, I didn't see your new
e-mail when I sent my ping-o-gram.

On Wed, Apr 29, 2009 at 11:38:49AM -0700, Curt Wohlgemuth wrote:
>
> Okay, my phrasing was not as precise as it could have been. What I
> meant by "total fragmentation" was simply that the range of physical
> blocks for the 10GB file was much lower with Andreas' patch:
>
> Before patch: 8282112 - 103266303
> After patch: 271360 - 5074943
>
> The number of extents is much larger. See the attached debugfs output.

Ah, OK. You didn't attach the "e2fsck -E fragcheck" output, but I'm
going to guess that the blocks for 10g, 4g, and 4g-2 ended up getting
interleaved, possibly because they were written in parallel, and not
one after each other? Each of the extents in the "after" debugfs were
proximately 2k blocks (8 megabytes) in length, and are separated by a
largish cnumber of blocks.

Now, if my theory that the files were written in an interleaved
fashion is correct, if it is also true that they will be read in an
interleaved pattern, the layout on disk might actually be the best
one. If however they are going to be read sequentially, and you
really want them to be allocated contiguously, then if you know what
the final size of these files will be, then the probably the best
thing to do is to use the fallocate system call.

Does that make sense?

- Ted

2009-04-29 20:21:17

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: Question on block group allocation

Hi Ted:

On Wed, Apr 29, 2009 at 12:37 PM, Theodore Tso <[email protected]> wrote:
> On Wed, Apr 29, 2009 at 03:16:47PM -0400, Theodore Tso wrote:
>>
>> When you have a chance, can you send out the details from your test run?
>>
>
> Oops, sorry, our two e-mails overlapped. ?Sorry, I didn't see your new
> e-mail when I sent my ping-o-gram.
>
> On Wed, Apr 29, 2009 at 11:38:49AM -0700, Curt Wohlgemuth wrote:
>>
>> Okay, my phrasing was not as precise as it could have been. ?What I
>> meant by "total fragmentation" was simply that the range of physical
>> blocks for the 10GB file was much lower with Andreas' patch:
>>
>> Before patch: ?8282112 - 103266303
>> After patch: 271360 - 5074943
>>
>> The number of extents is much larger. ?See the attached debugfs output.
>
> Ah, OK. ?You didn't attach the "e2fsck -E fragcheck" output, but I'm
> going to guess that the blocks for 10g, 4g, and 4g-2 ended up getting
> interleaved, possibly because they were written in parallel, and not
> one after each other? ?Each of the extents in the "after" debugfs were
> proximately 2k blocks (8 megabytes) in length, and are separated by a
> largish cnumber of blocks.

Hmm, I thought I attached the output from "e2fsck -E fragcheck"; yes,
I did: one simple line:

/dev/hdm3: clean, 14/45760512 files, 7608255/183010471 blocks

And actually, I created the files sequentially:

dd if=/dev/zero of=$MNT_PT/4g bs=1G count=4
dd if=/dev/zero of=$MNT_PT/4g-2 bs=1G count=4
dd if=/dev/zero of=$MNT_PT/10g bs=1G count=10

> Now, if my theory that the files were written in an interleaved
> fashion is correct, if it is also true that they will be read in an
> interleaved pattern, the layout on disk might actually be the best
> one. ?If however they are going to be read sequentially, and you
> really want them to be allocated contiguously, then if you know what
> the final size of these files will be, then the probably the best
> thing to do is to use the fallocate system call.
>
> Does that make sense?

Sure, in this sense.

The test in question does something like this:

1. Create 20 or so large files, sequentially.
2. Randomly choose a file.
3. Randomly choose an offset in this file.
4. Read from that file/offset a fixed buffer size (say 256k); the file
was opened with O_DIRECT
5. Go back to #2
6. Stop after some time period

This might not be the most realistic workload we want (the test
actually can be run by doing #1 above with multiple threads), but it's
certainly interesting.

The point that I'm interested in is why the physical block spread is
so different for the 10GB file between (a) the above 'dd' command
sequence; and (b) simply creating the "10g" file alone, without
creating the 4GB files first.

I just did (b) above on a kernel without Andreas' patch, on a freshly
formatted ext4 FS, and here's (most of) the debugfs output for it:

BLOCKS:
(IND):164865, (0-63487):34816-98303, (63488-126975):100352-163839, (126976-19046
3):165888-229375, (190464-253951):231424-294911, (253952-481279):296960-524287,
(481280-544767):821248-884735, (544768-706559):886784-1048575, (706560-1196031):
1607680-2097151, (1196032-1453067):2656256-2913291
TOTAL: 1453069

The total spread of the blocks is tiny compared to the total spread
from the 3 "dd" commands above.

I haven't yet really looked at the block allocation results using
Andreas' patch, except for the "10g" file after the three "dd"
commands above. So I'm not sure what the effects are with, say,
larger numbers of files. I'll be doing some more experimentation
soon.

Thanks,
Curt

2009-04-29 21:20:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on block group allocation

On Wed, Apr 29, 2009 at 01:21:09PM -0700, Curt Wohlgemuth wrote:
>
> Hmm, I thought I attached the output from "e2fsck -E fragcheck"; yes,
> I did: one simple line:
>
> /dev/hdm3: clean, 14/45760512 files, 7608255/183010471 blocks

Sorry, I should have been more explicit. You need to do

"e2fsck -f -E fragcheck", and you will get a *heck* of a lot more than
a single line. :-)

> And actually, I created the files sequentially:
>
> dd if=/dev/zero of=$MNT_PT/4g bs=1G count=4
> dd if=/dev/zero of=$MNT_PT/4g-2 bs=1G count=4
> dd if=/dev/zero of=$MNT_PT/10g bs=1G count=10

Really? Hmm, I wouldn't have expected that. So now I'd really love
to see the fragcheck results (both with and without the patch), and/or
the results of debugfs stat'ing all three files, both with and without
the patch.

Thanks,

- Ted

2009-04-29 21:50:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on block group allocation

Oh --- one more question. You did these tests on your 2.6.26-based
kernel with ext4 backports, right? Not 2.6.30 mainline kernel? Did
you backport the changes to the block and inode allocators? i.e.,
this patch (plus a 1 or 2 subsequent bug fixes)?


commit a4912123b688e057084e6557cef8924f7ae5bbde
Author: Theodore Ts'o <[email protected]>
Date: Thu Mar 12 12:18:34 2009 -0400

ext4: New inode/block allocation algorithms for flex_bg filesystems

The find_group_flex() inode allocator is now only used if the
filesystem is mounted using the "oldalloc" mount option. It is
replaced with the original Orlov allocator that has been updated for
flex_bg filesystems (it should behave the same way if flex_bg is
disabled). The inode allocator now functions by taking into account
each flex_bg group, instead of each block group, when deciding whether
or not it's time to allocate a new directory into a fresh flex_bg.

The block allocator has also been changed so that the first block
group in each flex_bg is preferred for use for storing directory
blocks. This keeps directory blocks close together, which is good for
speeding up e2fsck since large directories are more likely to look
like this:

debugfs: stat /home/tytso/Maildir/cur
Inode: 1844562 Type: directory Mode: 0700 Flags: 0x81000
Generation: 1132745781 Version: 0x00000000:0000ad71
User: 15806 Group: 15806 Size: 1060864
File ACL: 0 Directory ACL: 0
Links: 2 Blockcount: 2072
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009
atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009
mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009
crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009
Size of extra inode fields: 28
BLOCKS:
(0):7348651, (1-258):7348654-7348911
TOTAL: 259

Signed-off-by: "Theodore Ts'o" <[email protected]>

- Ted

2009-04-29 22:30:01

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: Question on block group allocation

Hi Ted:

On Wed, Apr 29, 2009 at 2:50 PM, Theodore Tso <[email protected]> wrote:
> Oh --- one more question. ?You did these tests on your 2.6.26-based
> kernel with ext4 backports, right? ?Not 2.6.30 mainline kernel? ?Did
> you backport the changes to the block and inode allocators? ?i.e.,
> this patch (plus a 1 or 2 subsequent bug fixes)?
>
>
> commit a4912123b688e057084e6557cef8924f7ae5bbde
> Author: Theodore Ts'o <[email protected]>
> Date: ? Thu Mar 12 12:18:34 2009 -0400
>
> ? ?ext4: New inode/block allocation algorithms for flex_bg filesystems

Yes, we have this patch. I'm not sure if we have the "1 or 2" bug
fixes you refer to above; do you have commits for these?

I'm regen'ing the e2fsck and debugfs output for the 3 "dd" sequence
above, for our stock kernel and for this + Andreas' patch.

Thanks,
Curt

>
> ? ?The find_group_flex() inode allocator is now only used if the
> ? ?filesystem is mounted using the "oldalloc" mount option. ?It is
> ? ?replaced with the original Orlov allocator that has been updated for
> ? ?flex_bg filesystems (it should behave the same way if flex_bg is
> ? ?disabled). ?The inode allocator now functions by taking into account
> ? ?each flex_bg group, instead of each block group, when deciding whether
> ? ?or not it's time to allocate a new directory into a fresh flex_bg.
>
> ? ?The block allocator has also been changed so that the first block
> ? ?group in each flex_bg is preferred for use for storing directory
> ? ?blocks. ?This keeps directory blocks close together, which is good for
> ? ?speeding up e2fsck since large directories are more likely to look
> ? ?like this:
>
> ? ?debugfs: ?stat /home/tytso/Maildir/cur
> ? ?Inode: 1844562 ? Type: directory ? ?Mode: ?0700 ? Flags: 0x81000
> ? ?Generation: 1132745781 ? ?Version: 0x00000000:0000ad71
> ? ?User: 15806 ? Group: 15806 ? Size: 1060864
> ? ?File ACL: 0 ? ?Directory ACL: 0
> ? ?Links: 2 ? Blockcount: 2072
> ? ?Fragment: ?Address: 0 ? ?Number: 0 ? ?Size: 0
> ? ? ctime: 0x499c0ff4:164961f4 -- Wed Feb 18 08:41:08 2009
> ? ? atime: 0x499c0ff4:00000000 -- Wed Feb 18 08:41:08 2009
> ? ? mtime: 0x49957f51:00000000 -- Fri Feb 13 09:10:25 2009
> ? ?crtime: 0x499c0f57:00d51440 -- Wed Feb 18 08:38:31 2009
> ? ?Size of extra inode fields: 28
> ? ?BLOCKS:
> ? ?(0):7348651, (1-258):7348654-7348911
> ? ?TOTAL: 259
>
> ? ?Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>

2009-05-01 04:39:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Question on block group allocation

On Wed, Apr 29, 2009 at 03:29:55PM -0700, Curt Wohlgemuth wrote:
> Yes, we have this patch. I'm not sure if we have the "1 or 2" bug
> fixes you refer to above; do you have commits for these?

b5451f7b ext4: Fix potential inode allocation soft lockup in Orlov allocator
6b82f3cb ext4: really print the find_group_flex fallback warning only once
7d39db14 ext4: Use struct flex_groups to calculate get_orlov_stats()
9f24e420 ext4: Use atomic_t's in struct flex_groups

- Ted

2009-05-04 15:52:11

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: Question on block group allocation

Hi Ted:

On Wed, Apr 29, 2009 at 2:20 PM, Theodore Tso <[email protected]> wrote:
> On Wed, Apr 29, 2009 at 01:21:09PM -0700, Curt Wohlgemuth wrote:
>>
>> Hmm, I thought I attached the output from "e2fsck -E fragcheck"; yes,
>> I did: one simple line:
>>
>> ? ? ? ? /dev/hdm3: clean, 14/45760512 files, 7608255/183010471 blocks
>
> Sorry, I should have been more explicit. ? You need to do
>
> "e2fsck -f -E fragcheck", and you will get a *heck* of a lot more than
> a single line. ?:-)
>
>> And actually, I created the files sequentially:
>>
>> dd if=/dev/zero of=$MNT_PT/4g bs=1G count=4
>> dd if=/dev/zero of=$MNT_PT/4g-2 bs=1G count=4
>> dd if=/dev/zero of=$MNT_PT/10g bs=1G count=10
>
> Really? ?Hmm, I wouldn't have expected that. ?So now I'd really love
> to see the fragcheck results (both with and without the patch), and/or
> the results of debugfs stat'ing all three files, both with and without
> the patch.

Although it might seem like I've been ignoring this request, in fact
I'm having trouble recreating the problem now. Both the "three dd"
commands above, and the performance problem I was seeing in the
original posting seem to have mysteriously disappeared. I'll keep
trying and let the list know what I find.

Thanks,
Curt

>
> Thanks,
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>