LinuxLists.cc - [PATCH 2/2] ext4: skip non-loaded groups at cr=0/1

2020-04-27 04:04:17

Subject: [PATCH 2/2] ext4: skip non-loaded groups at cr=0/1

Hi, yet another patch.

cr=0 is supposed to be an optimization to save CPU cycles, but if buddy data (in memory)
is not initialized then all this makes no sense as we have to do sync IO taking a lot of cycles.
also, at cr=0 mballoc doesn't store any avaibale chunk. cr=1 also skips groups using heuristic
based on avg. fragment size. it's more useful to skip such groups and switch to cr=2 where
groups will be scanned for available chunks.

using sparse image and dm-slow virtual device of 120TB was simulated. then the image was
formatted and filled using debugfs to mark ~85% of available space as busy. mount process w/o
the patch couldn't complete in half an hour (according to vmstat it would take ~10-11 hours).
with the patch applied mount took ~20 seconds.

Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988
Signed-off-by: Alex Zhuravlev <[email protected]>
Reviewed-by: Andreas Dilger <[email protected]>
---
fs/ext4/mballoc.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index e84c298e739b..83e3e6ab1240 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1877,6 +1877,21 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
return 0;
}

+static inline int ext4_mb_uninit_on_disk(struct super_block *sb,
+ ext4_group_t group)
+{
+ struct ext4_group_desc *desc;
+
+ if (!ext4_has_group_desc_csum(sb))
+ return 0;
+
+ desc = ext4_get_group_desc(sb, group, NULL);
+ if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
+ return 1;
+
+ return 0;
+}
+
/*
* The routine scans buddy structures (not bitmap!) from given order
* to max order and tries to find big enough chunk to satisfy the req
@@ -2060,7 +2075,15 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,

/* We only do this if the grp has never been initialized */
if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
- int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
+ int ret;
+
+ /* cr=0/1 is a very optimistic search to find large
+ * good chunks almost for free. if buddy data is
+ * not ready, then this optimization makes no sense */
+
+ if (cr < 2 && !ext4_mb_uninit_on_disk(ac->ac_sb, group))
+ return 0;
+ ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
if (ret)
return ret;
}
--

2020-05-14 10:05:02

by Ritesh Harjani

[permalink] [raw]

Subject: Re: [PATCH 2/2] ext4: skip non-loaded groups at cr=0/1

On 4/27/20 9:33 AM, Alex Zhuravlev wrote:
> Hi, yet another patch.
Not needed in a commit msg.

>
> cr=0 is supposed to be an optimization to save CPU cycles, but if buddy data (in memory)
> is not initialized then all this makes no sense as we have to do sync IO taking a lot of cycles.
> also, at cr=0 mballoc doesn't store any avaibale chunk. cr=1 also skips groups using heuristic
/s/avaibale/available/

> based on avg. fragment size. it's more useful to skip such groups and switch to cr=2 where
> groups will be scanned for available chunks.
>
> using sparse image and dm-slow virtual device of 120TB was simulated. then the image was
> formatted and filled using debugfs to mark ~85% of available space as busy. mount process w/o
> the patch couldn't complete in half an hour (according to vmstat it would take ~10-11 hours).
> with the patch applied mount took ~20 seconds.

I guess what we should edit the commit msg to explain that it is not the
mount process but the very first write whose performance is improved via
this patch.

>
> Lustre-bug-id: https://jira.whamcloud.com/browse/LU-12988

Not sure if we need this.

> Signed-off-by: Alex Zhuravlev <[email protected]>
> Reviewed-by: Andreas Dilger <[email protected]>
> ---
> fs/ext4/mballoc.c | 25 ++++++++++++++++++++++++-
> 1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index e84c298e739b..83e3e6ab1240 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1877,6 +1877,21 @@ int ext4_mb_find_by_goal(struct ext4_allocation_context *ac,
> return 0;
> }
>
> +static inline int ext4_mb_uninit_on_disk(struct super_block *sb,
> + ext4_group_t group)
> +{
> + struct ext4_group_desc *desc;
> +
> + if (!ext4_has_group_desc_csum(sb))
> + return 0;
> +
> + desc = ext4_get_group_desc(sb, group, NULL);
> + if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))
> + return 1;
> +
> + return 0;
> +}
> +
> /*
> * The routine scans buddy structures (not bitmap!) from given order
> * to max order and tries to find big enough chunk to satisfy the req
> @@ -2060,7 +2075,15 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
>
> /* We only do this if the grp has never been initialized */
> if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
> - int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
> + int ret;
> +
> + /* cr=0/1 is a very optimistic search to find large
> + * good chunks almost for free. if buddy data is
> + * not ready, then this optimization makes no sense */

I guess it will be also helpful to mention a comment related to the
discussion that we had on why this should be ok to skip those groups.
Because this could result into we skipping the group which is closer to
our inode. I somehow couldn't recollect it completely.

> +
> + if (cr < 2 && !ext4_mb_uninit_on_disk(ac->ac_sb, group))
> + return 0;
> + ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
> if (ret)
> return ret;
> }
>

2020-05-15 08:57:29

by Alex Zhuravlev

[permalink] [raw]

Subject: Re: [PATCH 2/2] ext4: skip non-loaded groups at cr=0/1

Thanks for the review, answers inline..

> On 14 May 2020, at 13:04, Ritesh Harjani <[email protected]> wrote:
> Not needed in a commit msg.

OK

>
>> cr=0 is supposed to be an optimization to save CPU cycles, but if buddy data (in memory)
>> is not initialized then all this makes no sense as we have to do sync IO taking a lot of cycles.
>> also, at cr=0 mballoc doesn't store any avaibale chunk. cr=1 also skips groups using heuristic
> /s/avaibale/available/

OK

>
>> based on avg. fragment size. it's more useful to skip such groups and switch to cr=2 where
>> groups will be scanned for available chunks.
>> using sparse image and dm-slow virtual device of 120TB was simulated. then the image was
>> formatted and filled using debugfs to mark ~85% of available space as busy. mount process w/o
>> the patch couldn't complete in half an hour (according to vmstat it would take ~10-11 hours).
>> with the patch applied mount took ~20 seconds.
>
> I guess what we should edit the commit msg to explain that it is not the
> mount process but the very first write whose performance is improved via
> this patch.

Correct

>> + /* cr=0/1 is a very optimistic search to find large
>> + * good chunks almost for free. if buddy data is
>> + * not ready, then this optimization makes no sense */
>
> I guess it will be also helpful to mention a comment related to the
> discussion that we had on why this should be ok to skip those groups.
> Because this could result into we skipping the group which is closer to
> our inode. I somehow couldn't recollect it completely.

Please remind where the discussion took place? I must be missing that.

2020-05-17 07:55:57

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH 2/2] ext4: skip non-loaded groups at cr=0/1

On May 15, 2020, at 2:56 AM, Alex Zhuravlev <[email protected]> wrote:
> On 14 May 2020, at 13:04, Ritesh Harjani <[email protected]> wrote:
>>> + /* cr=0/1 is a very optimistic search to find large
>>> + * good chunks almost for free. if buddy data is
>>> + * not ready, then this optimization makes no sense */
>>
>> I guess it will be also helpful to mention a comment related to the
>> discussion that we had on why this should be ok to skip those groups.
>> Because this could result into we skipping the group which is closer to
>> our inode. I somehow couldn't recollect it completely.
>
> Please remind where the discussion took place? I must be missing that.

Alex, this discussion was during the ext4 weekly developer concall.
I can send you the details if you want to join for next week's call.

The summary of the discussion is that Ted was wondering what happens if
one thread was scanning the filesystem and triggering prefetch on the
groups, but didn't find any loaded (so skips cr=0/1 passes completely),
then does an allocation in its preferred group (assuming there is space
available there, what happens to allocations for other inodes after this?

Presumably, the first thread's prefetch has loaded a bunch of groups, and
even if the second thread starts scanning for blocks at its preferred
group (though a different one from the first thread), it will skip all of
these groups until it finds the group(s) from the first inode allocation
are in memory already, and will proceed to allocate blocks there.

The question is whether this is situation is affecting only a few inode
allocations for a short time after mount, or does this persist for a long
time? I think that it _should_ be only a short time, because these other
threads should all start prefetch on their preferred groups, so even if a
few inodes have their blocks allocated in the "wrong" group, it shouldn't
be a long term problem since the prefetched bitmaps will finish loading
and allow the blocks to be allocated, or skipped if group is fragmented.

Cheers, Andreas

Attachments:

signature.asc (890.00 B)
Message signed with OpenPGP

2020-05-20 08:43:31

by Alex Zhuravlev

[permalink] [raw]

Subject: Re: [PATCH 2/2] ext4: skip non-loaded groups at cr=0/1

> On 17 May 2020, at 10:55, Andreas Dilger <[email protected]> wrote:
>
> The question is whether this is situation is affecting only a few inode
> allocations for a short time after mount, or does this persist for a long
> time? I think that it _should_ be only a short time, because these other
> threads should all start prefetch on their preferred groups, so even if a
> few inodes have their blocks allocated in the "wrong" group, it shouldn't
> be a long term problem since the prefetched bitmaps will finish loading
> and allow the blocks to be allocated, or skipped if group is fragmented.

Yes, that’s the idea - there is a short window when buddy data is being
populated. And for each “cluster” (not just a single group) prefetching
will be initiated by allocation.
It’s possible that some number of inodes will get “bad” blocks right after
after mount.
If you think this is a bad scenario I can introduce couple more things:
1) few times discussed prefetching thread
2) let mballoc wait for the goal group to get ready - this essentials one
more check in ext4_mb_good_group()

Thanks, Alex

2020-05-20 19:34:36

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH 2/2] ext4: skip non-loaded groups at cr=0/1

On May 20, 2020, at 2:40 AM, Alex Zhuravlev <[email protected]> wrote:
>
>
>
>> On 17 May 2020, at 10:55, Andreas Dilger <[email protected]> wrote:
>>
>> The question is whether this is situation is affecting only a few inode
>> allocations for a short time after mount, or does this persist for a long
>> time? I think that it _should_ be only a short time, because these other
>> threads should all start prefetch on their preferred groups, so even if a
>> few inodes have their blocks allocated in the "wrong" group, it shouldn't
>> be a long term problem since the prefetched bitmaps will finish loading
>> and allow the blocks to be allocated, or skipped if group is fragmented.
>
> Yes, that’s the idea - there is a short window when buddy data is being
> populated. And for each “cluster” (not just a single group) prefetching
> will be initiated by allocation.
> It’s possible that some number of inodes will get “bad” blocks right after
> after mount.
> If you think this is a bad scenario I can introduce couple more things:
> 1) few times discussed prefetching thread
> 2) let mballoc wait for the goal group to get ready - this essentials one
> more check in ext4_mb_good_group()

IMHO, this is an acceptable "cache warmup" behavior, not really different
than mballoc doing limited scanning when looking for any other allocation.
Since we already separate inode table blocks and data blocks into separate
groups due to flex_bg, I don't think any group is "better" than another,
so long as the allocations are avoiding worst-case fragmentation (i.e. a
series of one-block allocations).

Cheers, Andreas

Attachments:

signature.asc (890.00 B)
Message signed with OpenPGP

2020-05-20 20:00:42

by Alex Zhuravlev

[permalink] [raw]

Subject: Re: [PATCH 2/2] ext4: skip non-loaded groups at cr=0/1

> On 20 May 2020, at 22:34, Andreas Dilger <[email protected]> wrote:
>
> On May 20, 2020, at 2:40 AM, Alex Zhuravlev <[email protected]> wrote:
>>
>>> On 17 May 2020, at 10:55, Andreas Dilger <[email protected]> wrote:
>>>
>>> The question is whether this is situation is affecting only a few inode
>>> allocations for a short time after mount, or does this persist for a long
>>> time? I think that it _should_ be only a short time, because these other
>>> threads should all start prefetch on their preferred groups, so even if a
>>> few inodes have their blocks allocated in the "wrong" group, it shouldn't
>>> be a long term problem since the prefetched bitmaps will finish loading
>>> and allow the blocks to be allocated, or skipped if group is fragmented.
>>
>> Yes, that’s the idea - there is a short window when buddy data is being
>> populated. And for each “cluster” (not just a single group) prefetching
>> will be initiated by allocation.
>> It’s possible that some number of inodes will get “bad” blocks right after
>> after mount.
>> If you think this is a bad scenario I can introduce couple more things:
>> 1) few times discussed prefetching thread
>> 2) let mballoc wait for the goal group to get ready - this essentials one
>> more check in ext4_mb_good_group()
>
> IMHO, this is an acceptable "cache warmup" behavior, not really different
> than mballoc doing limited scanning when looking for any other allocation.
> Since we already separate inode table blocks and data blocks into separate
> groups due to flex_bg, I don't think any group is "better" than another,
> so long as the allocations are avoiding worst-case fragmentation (i.e. a
> series of one-block allocations).

I tend to agree, but refreshed the patch to enable waiting for the goal group
(one more check). Extra waiting for one group during warmup should be fine, IMO.

Thanks, Alex