2011-04-04 19:11:54

by Eric Sandeen

[permalink] [raw]
Subject: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

Block devices may set minimum or optimal IO hints equal to
blocksize; in this case there is really nothing for ext4
to do with this information (i.e. search for a block-aligned
allocation?) so don't set fs geometry with single-block
values.

Zeev also reported that with a block-sized stripe, the
ext4 allocator spends time spinning in ext4_mb_scan_aligned(),
oddly enough.

Reported-by: Zeev Tarantov <[email protected]>
Signed-off-by: Eric Sandeen <[email protected]>
---

diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 9798b88..74b838c 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -1135,8 +1135,11 @@ static int get_device_geometry(const char *file,
if ((opt_io == 0) && (psector_size > blocksize))
opt_io = psector_size;

- fs_param->s_raid_stride = min_io / blocksize;
- fs_param->s_raid_stripe_width = opt_io / blocksize;
+ /* setting stripe/stride to blocksize is pointless */
+ if (min_io > blocksize)
+ fs_param->s_raid_stride = min_io / blocksize;
+ if (opt_io > blocksize)
+ fs_param->s_raid_stripe_width = opt_io / blocksize;

rc = blkid_topology_get_alignment_offset(tp);
out:



2011-04-05 08:10:21

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

On 2011-04-04, at 9:11 AM, Eric Sandeen wrote:
> Block devices may set minimum or optimal IO hints equal to
> blocksize; in this case there is really nothing for ext4
> to do with this information (i.e. search for a block-aligned
> allocation?) so don't set fs geometry with single-block
> values.
>
> Zeev also reported that with a block-sized stripe, the
> ext4 allocator spends time spinning in ext4_mb_scan_aligned(),
> oddly enough.
>
> Reported-by: Zeev Tarantov <[email protected]>
> Signed-off-by: Eric Sandeen <[email protected]>
> ---
>
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index 9798b88..74b838c 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -1135,8 +1135,11 @@ static int get_device_geometry(const char *file,
> if ((opt_io == 0) && (psector_size > blocksize))
> opt_io = psector_size;
>
> - fs_param->s_raid_stride = min_io / blocksize;
> - fs_param->s_raid_stripe_width = opt_io / blocksize;
> + /* setting stripe/stride to blocksize is pointless */
> + if (min_io > blocksize)
> + fs_param->s_raid_stride = min_io / blocksize;
> + if (opt_io > blocksize)
> + fs_param->s_raid_stripe_width = opt_io / blocksize;

I don't think it is harmful to specify an mballoc alignment that is an even multiple of the underlying device IO size (e.g. at least 256kB or 512kB).

If the underlying device (e.g. zram) is reporting 16kB or 64kB opt_io size because that is PAGE_SIZE, but blocksize is 4kB, then we will have the same performance problem again.

Cheers, Andreas






2011-04-05 08:43:21

by Zeev Tarantov

[permalink] [raw]
Subject: Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

On Tue, Apr 5, 2011 at 11:10, Andreas Dilger <[email protected]> wrote:
> On 2011-04-04, at 9:11 AM, Eric Sandeen wrote:
>> Block devices may set minimum or optimal IO hints equal to
>> blocksize; in this case there is really nothing for ext4
>> to do with this information (i.e. search for a block-aligned
>> allocation?) so don't set fs geometry with single-block
>> values.
>>
>> Zeev also reported that with a block-sized stripe, the
>> ext4 allocator spends time spinning in ext4_mb_scan_aligned(),
>> oddly enough.
>>
>> Reported-by: Zeev Tarantov <[email protected]>
>> Signed-off-by: Eric Sandeen <[email protected]>
>> ---
>>
>> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
>> index 9798b88..74b838c 100644
>> --- a/misc/mke2fs.c
>> +++ b/misc/mke2fs.c
>> @@ -1135,8 +1135,11 @@ static int get_device_geometry(const char *file,
>> ? ? ? if ((opt_io == 0) && (psector_size > blocksize))
>> ? ? ? ? ? ? ? opt_io = psector_size;
>>
>> - ? ? fs_param->s_raid_stride = min_io / blocksize;
>> - ? ? fs_param->s_raid_stripe_width = opt_io / blocksize;
>> + ? ? /* setting stripe/stride to blocksize is pointless */
>> + ? ? if (min_io > blocksize)
>> + ? ? ? ? ? ? fs_param->s_raid_stride = min_io / blocksize;
>> + ? ? if (opt_io > blocksize)
>> + ? ? ? ? ? ? fs_param->s_raid_stripe_width = opt_io / blocksize;
>
> I don't think it is harmful to specify an mballoc alignment that is an even multiple of the underlying device IO size (e.g. at least 256kB or 512kB).
>
> If the underlying device (e.g. zram) is reporting 16kB or 64kB opt_io size because that is PAGE_SIZE, but blocksize is 4kB, then we will have the same performance problem again.

I think I'm not understanding you. Are you objecting to the patch? If so, why?
If io block is an even multiple of fs blocks, then mballoc really does
need to use the fancier aligning code (though for a ratio that is a
power of two, it shouldn't need to be slow).
In your example, If PAGE_SIZE is 16k and zram reports min and opt io
size as 16k but fs block is 4k, then Eric Sandeen's new code will set
stride and stripe to 4, as it should. Then mballoc will really need to
align blocks in groups of 4 and spending time doing that is not a
performance problem.

This patch is only meant to _not_ set stripe and stride to 1, which
cannot help block allocation and does in fact cause the kernel to
spend more cpu time because it does not detect the degenerate case.
Making the allocator faster in the case of a ratio that is a power of
two is the work of a different patch someone may want to write.
Using larger fs blocks in case the underlying block device prefers
larger io blocks is the work of the admin, I think.

> Cheers, Andreas

-Z.T.

2011-04-05 16:39:13

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

On 4/5/11 1:10 AM, Andreas Dilger wrote:
> On 2011-04-04, at 9:11 AM, Eric Sandeen wrote:
>> Block devices may set minimum or optimal IO hints equal to
>> blocksize; in this case there is really nothing for ext4
>> to do with this information (i.e. search for a block-aligned
>> allocation?) so don't set fs geometry with single-block
>> values.
>>
>> Zeev also reported that with a block-sized stripe, the
>> ext4 allocator spends time spinning in ext4_mb_scan_aligned(),
>> oddly enough.
>>
>> Reported-by: Zeev Tarantov <[email protected]>
>> Signed-off-by: Eric Sandeen <[email protected]>
>> ---
>>
>> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
>> index 9798b88..74b838c 100644
>> --- a/misc/mke2fs.c
>> +++ b/misc/mke2fs.c
>> @@ -1135,8 +1135,11 @@ static int get_device_geometry(const char *file,
>> if ((opt_io == 0) && (psector_size > blocksize))
>> opt_io = psector_size;
>>
>> - fs_param->s_raid_stride = min_io / blocksize;
>> - fs_param->s_raid_stripe_width = opt_io / blocksize;
>> + /* setting stripe/stride to blocksize is pointless */
>> + if (min_io > blocksize)
>> + fs_param->s_raid_stride = min_io / blocksize;
>> + if (opt_io > blocksize)
>> + fs_param->s_raid_stripe_width = opt_io / blocksize;
>
> I don't think it is harmful to specify an mballoc alignment that is
> an even multiple of the underlying device IO size (e.g. at least
> 256kB or 512kB).
>
> If the underlying device (e.g. zram) is reporting 16kB or 64kB opt_io
> size because that is PAGE_SIZE, but blocksize is 4kB, then we will
> have the same performance problem again.>
> Cheers, Andreas

I need to look into why ext4_mb_scan_aligned is so inefficient for a block-sized stripe.

In practice I don't think we've seen this problem with stripe size at 4 or 8 or 16 blocks; it may just be less apparent. I think the function steps through by stripe-sized units, and if that is 1 block, it's a lot of stepping.

while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
...
if (!mb_test_bit(i, bitmap)) {
...
}
i += sbi->s_stripe;
}

But in any case, setting stripe alignment to 1 block makes no sense to me, and I see no reason to do it at mkfs time...

-Eric

2011-04-05 16:56:11

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

On 4/5/11 9:39 AM, Eric Sandeen wrote:
> On 4/5/11 1:10 AM, Andreas Dilger wrote:
>> On 2011-04-04, at 9:11 AM, Eric Sandeen wrote:
>>> Block devices may set minimum or optimal IO hints equal to
>>> blocksize; in this case there is really nothing for ext4
>>> to do with this information (i.e. search for a block-aligned
>>> allocation?) so don't set fs geometry with single-block
>>> values.
>>>
>>> Zeev also reported that with a block-sized stripe, the
>>> ext4 allocator spends time spinning in ext4_mb_scan_aligned(),
>>> oddly enough.
>>>
>>> Reported-by: Zeev Tarantov <[email protected]>
>>> Signed-off-by: Eric Sandeen <[email protected]>
>>> ---
>>>
>>> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
>>> index 9798b88..74b838c 100644
>>> --- a/misc/mke2fs.c
>>> +++ b/misc/mke2fs.c
>>> @@ -1135,8 +1135,11 @@ static int get_device_geometry(const char *file,
>>> if ((opt_io == 0) && (psector_size > blocksize))
>>> opt_io = psector_size;
>>>
>>> - fs_param->s_raid_stride = min_io / blocksize;
>>> - fs_param->s_raid_stripe_width = opt_io / blocksize;
>>> + /* setting stripe/stride to blocksize is pointless */
>>> + if (min_io > blocksize)
>>> + fs_param->s_raid_stride = min_io / blocksize;
>>> + if (opt_io > blocksize)
>>> + fs_param->s_raid_stripe_width = opt_io / blocksize;
>>
>> I don't think it is harmful to specify an mballoc alignment that is
>> an even multiple of the underlying device IO size (e.g. at least
>> 256kB or 512kB).
>>
>> If the underlying device (e.g. zram) is reporting 16kB or 64kB opt_io
>> size because that is PAGE_SIZE, but blocksize is 4kB, then we will
>> have the same performance problem again.>
>> Cheers, Andreas
>
> I need to look into why ext4_mb_scan_aligned is so inefficient for a block-sized stripe.
>
> In practice I don't think we've seen this problem with stripe size at 4 or 8 or 16 blocks; it may just be less apparent. I think the function steps through by stripe-sized units, and if that is 1 block, it's a lot of stepping.
>
> while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
> ...
> if (!mb_test_bit(i, bitmap)) {

Offhand I think maybe mb_find_next_zero_bit would be more efficient.

--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1939,16 +1939,14 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
i = (a * sbi->s_stripe) - first_group_block;

while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
- if (!mb_test_bit(i, bitmap)) {
- max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
- if (max >= sbi->s_stripe) {
- ac->ac_found++;
- ac->ac_b_ex = ex;
- ext4_mb_use_best_found(ac, e4b);
- break;
- }
+ i = mb_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i);
+ max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
+ if (max >= sbi->s_stripe) {
+ ac->ac_found++;
+ ac->ac_b_ex = ex;
+ ext4_mb_use_best_found(ac, e4b);
+ break;
}
- i += sbi->s_stripe;
}
}

totally untested, but I think we have better ways to step through the bitmap.

-Eric

> ...
> }
> i += sbi->s_stripe;
> }
>
> But in any case, setting stripe alignment to 1 block makes no sense to me, and I see no reason to do it at mkfs time...
>
> -Eric
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2011-04-05 22:21:22

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

On 4/5/11 9:56 AM, Eric Sandeen wrote:
> On 4/5/11 9:39 AM, Eric Sandeen wrote:
>> On 4/5/11 1:10 AM, Andreas Dilger wrote:
>>> On 2011-04-04, at 9:11 AM, Eric Sandeen wrote:
>>>> Block devices may set minimum or optimal IO hints equal to
>>>> blocksize; in this case there is really nothing for ext4
>>>> to do with this information (i.e. search for a block-aligned
>>>> allocation?) so don't set fs geometry with single-block
>>>> values.
>>>>
>>>> Zeev also reported that with a block-sized stripe, the
>>>> ext4 allocator spends time spinning in ext4_mb_scan_aligned(),
>>>> oddly enough.
>>>>
>>>> Reported-by: Zeev Tarantov <[email protected]>
>>>> Signed-off-by: Eric Sandeen <[email protected]>
>>>> ---
>>>>
>>>> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
>>>> index 9798b88..74b838c 100644
>>>> --- a/misc/mke2fs.c
>>>> +++ b/misc/mke2fs.c
>>>> @@ -1135,8 +1135,11 @@ static int get_device_geometry(const char *file,
>>>> if ((opt_io == 0) && (psector_size > blocksize))
>>>> opt_io = psector_size;
>>>>
>>>> - fs_param->s_raid_stride = min_io / blocksize;
>>>> - fs_param->s_raid_stripe_width = opt_io / blocksize;
>>>> + /* setting stripe/stride to blocksize is pointless */
>>>> + if (min_io > blocksize)
>>>> + fs_param->s_raid_stride = min_io / blocksize;
>>>> + if (opt_io > blocksize)
>>>> + fs_param->s_raid_stripe_width = opt_io / blocksize;
>>>
>>> I don't think it is harmful to specify an mballoc alignment that is
>>> an even multiple of the underlying device IO size (e.g. at least
>>> 256kB or 512kB).
>>>
>>> If the underlying device (e.g. zram) is reporting 16kB or 64kB opt_io
>>> size because that is PAGE_SIZE, but blocksize is 4kB, then we will
>>> have the same performance problem again.>
>>> Cheers, Andreas
>>
>> I need to look into why ext4_mb_scan_aligned is so inefficient for a block-sized stripe.
>>
>> In practice I don't think we've seen this problem with stripe size at 4 or 8 or 16 blocks; it may just be less apparent. I think the function steps through by stripe-sized units, and if that is 1 block, it's a lot of stepping.
>>
>> while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
>> ...
>> if (!mb_test_bit(i, bitmap)) {
>
> Offhand I think maybe mb_find_next_zero_bit would be more efficient.
>
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1939,16 +1939,14 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
> i = (a * sbi->s_stripe) - first_group_block;
>
> while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
> - if (!mb_test_bit(i, bitmap)) {
> - max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
> - if (max >= sbi->s_stripe) {
> - ac->ac_found++;
> - ac->ac_b_ex = ex;
> - ext4_mb_use_best_found(ac, e4b);
> - break;
> - }
> + i = mb_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i);
> + max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
> + if (max >= sbi->s_stripe) {
> + ac->ac_found++;
> + ac->ac_b_ex = ex;
> + ext4_mb_use_best_found(ac, e4b);
> + break;
> }
> - i += sbi->s_stripe;
> }
> }
>
> totally untested, but I think we have better ways to step through the bitmap.

I tested it ;)

Seems to work fine, though I probably should see how things actually got allocated.

Creating an fs with 1-block stripes & widths as with the original report, and copying a (built) 2.6 linux kernel tree, it took about 7 minutes, and looped in the while() loop above 328215171 times.

With the patch above, it took 6m30s, and looped 25055 times.

For a filesystem with no stripe/stride set, it took 6m26s.

Hm, but subsequent tests w/ the tiny stripe set came around 6m30s as well. So there's no obvious speedup.

Still, avoiding all that looping seems beneficial, I can send a patch after I make sure allocation is still happening as expected.

Zeev, if you'd like to test that patch above with your profiling, that'd be awesome.

Thanks,
-Eric

2011-04-06 06:52:17

by Zeev Tarantov

[permalink] [raw]
Subject: Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

On Wed, Apr 6, 2011 at 01:21, Eric Sandeen <[email protected]> wrote:
> Zeev, if you'd like to test that patch above with your profiling, that'd be awesome.

My wall-clock timings are strange and I disregard them.
On a desktop with X turned off and perf recording system-wide, I trust
the profile to show everything.

In the no-stride case, I don't think you've made it any worse.
In the stripe=stride=1, it went from:

[ perf record: Woken up 13 times to write data ]
[ perf record: Captured and wrote 3.163 MB perf.data (~138211 samples) ]
# Events: 14K cycles
#
# Overhead Command Shared Object
Symbol
# ........ .............. .....................
.......................................
#
13.08% gzip gzip [.] zip
12.65% flush-253:0 [kernel.kallsyms] [k] ext4_mb_scan_aligned
12.46% gzip gzip [.] treat_file.part.4.2264
9.34% flush-253:0 [csnappy_compress] [k] snappy_compress_fragment
5.47% md5sum md5sum [.] digest_file.isra.2.2089
4.70% md5sum [csnappy_decompress] [k]
snappy_decompress_noheader
4.29% md5sum md5sum [.] 0x2fe7
2.16% gzip libc-2.13.so [.] __memcpy_ssse3
1.79% flush-253:0 [kernel.kallsyms] [k] memcpy
1.52% tar [kernel.kallsyms] [k] copy_user_generic_string
1.08% tar [kernel.kallsyms] [k] _raw_spin_lock
1.00% gzip [kernel.kallsyms] [k] copy_user_generic_string
0.87% flush-253:0 [kernel.kallsyms] [k] __memset
0.58% md5sum md5sum [.] __libc_csu_init
0.54% md5sum [zram] [k] zram_make_request
0.50% md5sum [kernel.kallsyms] [k] copy_user_generic_string
0.45% tar [kernel.kallsyms] [k] ext4_mark_iloc_dirty
0.44% tar [kernel.kallsyms] [k] __memset
0.39% gzip gzip [.] treat_stdin.2262
0.35% gzip gzip [.] treat_file.2267
0.33% swapper [kernel.kallsyms] [k] mwait_idle
0.28% dd [kernel.kallsyms] [k] system_call
0.26% md5sum [kernel.kallsyms] [k] memcpy
0.25% flush-253:0 [zram] [k] zram_make_request
0.25% flush-253:0 [kernel.kallsyms] [k] _raw_spin_lock
0.23% tar [kernel.kallsyms] [k] ext4_mb_scan_aligned
0.22% gzip gzip [.] compress_block.2644.2190
0.20% tar [kernel.kallsyms] [k] __find_get_block

to:

[ perf record: Woken up 12 times to write data ]
[ perf record: Captured and wrote 3.091 MB perf.data (~135042 samples) ]
# Events: 13K cycles
#
# Overhead Command Shared Object
Symbol
# ........ ........... .....................
.....................................
#
15.10% gzip gzip [.] zip
14.70% gzip gzip [.] treat_file.part.4.2264
11.10% flush-253:0 [csnappy_compress] [k] snappy_compress_fragment
6.12% md5sum md5sum [.] digest_file.isra.2.2089
5.44% md5sum [csnappy_decompress] [k] snappy_decompress_noheader
5.10% md5sum md5sum [.] 0x3262
2.42% gzip libc-2.13.so [.] __memcpy_ssse3
2.03% flush-253:0 [kernel.kallsyms] [k] memcpy
1.52% tar [kernel.kallsyms] [k] copy_user_generic_string
1.13% gzip [kernel.kallsyms] [k] copy_user_generic_string
0.80% flush-253:0 [kernel.kallsyms] [k] __memset
0.62% md5sum md5sum [.] __libc_csu_init
0.60% md5sum [zram] [k] zram_make_request
0.57% gzip gzip [.] treat_stdin.2262
0.56% md5sum [kernel.kallsyms] [k] copy_user_generic_string
0.55% tar [kernel.kallsyms] [k] __memset
0.49% tar [kernel.kallsyms] [k] ext4_mark_iloc_dirty
0.39% md5sum [kernel.kallsyms] [k] memcpy
0.37% gzip gzip [.] treat_file.2267
0.37% swapper [kernel.kallsyms] [k] mwait_idle
0.28% dd [kernel.kallsyms] [k] system_call
0.28% flush-253:0 [kernel.kallsyms] [k] _raw_spin_lock
0.25% gzip gzip [.] compress_block.2644.2190
0.22% flush-253:0 [kernel.kallsyms] [k] ext4_mark_iloc_dirty
0.22% tar [kernel.kallsyms] [k] __ext4_get_inode_loc
0.22% tar [kernel.kallsyms] [k] _raw_spin_lock
0.21% tar [kernel.kallsyms] [k] mark_page_accessed
0.20% tar [kernel.kallsyms] [k] __find_get_block
0.20% tar tar [.] contains_dot_dot

So that's great, but I still don't want mke2fs to set a stripe and
stride of 1, because that's silly - no code improvement in the
allocator is ever going to be able to use that information.

> Thanks,
> -Eric

Thank you!
-Z.T.

2011-04-06 17:00:46

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

On 4/5/11 11:51 PM, Zeev Tarantov wrote:
> On Wed, Apr 6, 2011 at 01:21, Eric Sandeen <[email protected]> wrote:
>> Zeev, if you'd like to test that patch above with your profiling, that'd be awesome.
>
> My wall-clock timings are strange and I disregard them.
> On a desktop with X turned off and perf recording system-wide, I trust
> the profile to show everything.

Agreed, the wallclock time was just a very coarse sanity check while off at a conference :)

> In the no-stride case, I don't think you've made it any worse.

it shouldn't even go down this path so I hope not ;)

> In the stripe=stride=1, it went from:
>
> [ perf record: Woken up 13 times to write data ]
> [ perf record: Captured and wrote 3.163 MB perf.data (~138211 samples) ]
> # Events: 14K cycles
> #
> # Overhead Command Shared Object
> Symbol
> # ........ .............. .....................
> .......................................
> #
> 13.08% gzip gzip [.] zip
> 12.65% flush-253:0 [kernel.kallsyms] [k] ext4_mb_scan_aligned

...

> to:
>
> [ perf record: Woken up 12 times to write data ]
> [ perf record: Captured and wrote 3.091 MB perf.data (~135042 samples) ]
> # Events: 13K cycles
> #
> # Overhead Command Shared Object
> Symbol
> # ........ ........... .....................
> .....................................
> #
> 15.10% gzip gzip [.] zip
> 14.70% gzip gzip [.] treat_file.part.4.2264
> 11.10% flush-253:0 [csnappy_compress] [k] snappy_compress_fragment
> 6.12% md5sum md5sum [.] digest_file.isra.2.2089
...
> 0.23% tar [kernel.kallsyms] [k] ext4_mb_scan_aligned
...

> So that's great, but I still don't want mke2fs to set a stripe and
> stride of 1, because that's silly - no code improvement in the
> allocator is ever going to be able to use that information.

I totally agree, and I'm sure the mkfs patch will go in as well.

Thanks for the verification of the kernel patch, too.

It should help a bit for sane stripe-sizes, too, with less impact as the stripe size gets bigger.

Thanks,
-Eric

>> Thanks,
>> -Eric
>
> Thank you!
> -Z.T.


2011-04-08 00:13:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs


On 2011-04-05, at 10:56 AM, Eric Sandeen wrote:

> On 4/5/11 9:39 AM, Eric Sandeen wrote:
>> Andreas Dilger wrote:
>>> I don't think it is harmful to specify an mballoc alignment that is
>>> an even multiple of the underlying device IO size (e.g. at least
>>> 256kB or 512kB).
>>>
>>> If the underlying device (e.g. zram) is reporting 16kB or 64kB opt_io
>>> size because that is PAGE_SIZE, but blocksize is 4kB, then we will
>>> have the same performance problem again.>
>>> Cheers, Andreas
>>
>> I need to look into why ext4_mb_scan_aligned is so inefficient for a block-sized stripe.
>>
>> In practice I don't think we've seen this problem with stripe size at 4 or 8 or 16 blocks; it may just be less apparent. I think the function steps through by stripe-sized units, and if that is 1 block, it's a lot of stepping.
>>
>> while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
>> ...
>> if (!mb_test_bit(i, bitmap)) {
>
> Offhand I think maybe mb_find_next_zero_bit would be more efficient.
>
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1939,16 +1939,14 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
> i = (a * sbi->s_stripe) - first_group_block;
>
> while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
> - if (!mb_test_bit(i, bitmap)) {
> - max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
> - if (max >= sbi->s_stripe) {
> - ac->ac_found++;
> - ac->ac_b_ex = ex;
> - ext4_mb_use_best_found(ac, e4b);
> - break;
> - }
> + i = mb_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i);
> + max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
> + if (max >= sbi->s_stripe) {
> + ac->ac_found++;
> + ac->ac_b_ex = ex;
> + ext4_mb_use_best_found(ac, e4b);
> + break;
> }
> - i += sbi->s_stripe;
> }
> }
>
> totally untested, but I think we have better ways to step through the bitmap.

This changes the allocation completely, AFAICS. Instead of doing checks for chunks of free space aligned on sbi->s_stripe boundaries, it is instead finding the first free space of size s_stripe regardless of alignment. That is not good for RAID back-ends, and is the primary reason for ext4_mb_scan_aligned() to exist.

I think my original assertion holds - that regardless of what the "optimal IO" size reported by the underlying device, doing larger allocations at the mballoc level that are even multiples of this size isn't harmful. That avoids not only the performance impact of 4kB-sized "optimal IO", but also the (lesser) impact of 8kB-64kB "optimal IO" allocations as well.

Cheers, Andreas






2011-04-08 00:24:45

by Eric Sandeen

[permalink] [raw]
Subject: Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

On 4/7/11 5:13 PM, Andreas Dilger wrote:
>
> On 2011-04-05, at 10:56 AM, Eric Sandeen wrote:
>
>> On 4/5/11 9:39 AM, Eric Sandeen wrote:
>>> Andreas Dilger wrote:
>>>> I don't think it is harmful to specify an mballoc alignment that is
>>>> an even multiple of the underlying device IO size (e.g. at least
>>>> 256kB or 512kB).
>>>>
>>>> If the underlying device (e.g. zram) is reporting 16kB or 64kB opt_io
>>>> size because that is PAGE_SIZE, but blocksize is 4kB, then we will
>>>> have the same performance problem again.>
>>>> Cheers, Andreas
>>>
>>> I need to look into why ext4_mb_scan_aligned is so inefficient for a block-sized stripe.
>>>
>>> In practice I don't think we've seen this problem with stripe size at 4 or 8 or 16 blocks; it may just be less apparent. I think the function steps through by stripe-sized units, and if that is 1 block, it's a lot of stepping.
>>>
>>> while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
>>> ...
>>> if (!mb_test_bit(i, bitmap)) {
>>
>> Offhand I think maybe mb_find_next_zero_bit would be more efficient.
>>
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -1939,16 +1939,14 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
>> i = (a * sbi->s_stripe) - first_group_block;
>>
>> while (i < EXT4_BLOCKS_PER_GROUP(sb)) {
>> - if (!mb_test_bit(i, bitmap)) {
>> - max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
>> - if (max >= sbi->s_stripe) {
>> - ac->ac_found++;
>> - ac->ac_b_ex = ex;
>> - ext4_mb_use_best_found(ac, e4b);
>> - break;
>> - }
>> + i = mb_find_next_zero_bit(bitmap, EXT4_BLOCKS_PER_GROUP(sb), i);
>> + max = mb_find_extent(e4b, 0, i, sbi->s_stripe, &ex);
>> + if (max >= sbi->s_stripe) {
>> + ac->ac_found++;
>> + ac->ac_b_ex = ex;
>> + ext4_mb_use_best_found(ac, e4b);
>> + break;
>> }
>> - i += sbi->s_stripe;
>> }
>> }
>>
>> totally untested, but I think we have better ways to step through the bitmap.
>
> This changes the allocation completely, AFAICS. Instead of doing
> checks for chunks of free space aligned on sbi->s_stripe boundaries,
> it is instead finding the first free space of size s_stripe
> regardless of alignment. That is not good for RAID back-ends, and is
> the primary reason for ext4_mb_scan_aligned() to exist.

Oh, er, right. It's what I get for coding-at-conference, sorry.

I do wonder if test-bit/advance/test-bit/advance can be made a bit more efficient with something like find_next_bit. I just did it wrong. :(

I'll revisit it when I get back home.

> I think my original assertion holds - that regardless of what the
> "optimal IO" size reported by the underlying device, doing larger
> allocations at the mballoc level that are even multiples of this size
> isn't harmful. That avoids not only the performance impact of
> 4kB-sized "optimal IO", but also the (lesser) impact of 8kB-64kB
> "optimal IO" allocations as well.>
> Cheers, Andreas

I'll give that some thought; really, the whole align-on-a-stripe mechanism needs work, at least outside of the Lustre workload :)

Thanks,
-Eric

2011-05-18 17:20:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] e2fsprogs: don't set stripe/stride to 1 block in mkfs

On Mon, Apr 04, 2011 at 03:11:52PM -0400, Eric Sandeen wrote:
> Block devices may set minimum or optimal IO hints equal to
> blocksize; in this case there is really nothing for ext4
> to do with this information (i.e. search for a block-aligned
> allocation?) so don't set fs geometry with single-block
> values.
>
> Zeev also reported that with a block-sized stripe, the
> ext4 allocator spends time spinning in ext4_mb_scan_aligned(),
> oddly enough.
>
> Reported-by: Zeev Tarantov <[email protected]>
> Signed-off-by: Eric Sandeen <[email protected]>

Applied to the maint branch of e2fsprogs (which will be merged up to
the next/master branches), thanks!!

- Ted