2020-05-20 06:42:44

by Ritesh Harjani

[permalink] [raw]
Subject: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

There could be a race in function ext4_mb_discard_group_preallocations()
where the 1st thread may iterate through group's bb_prealloc_list and
remove all the PAs and add to function's local list head.
Now if the 2nd thread comes in to discard the group preallocations,
it will see that the group->bb_prealloc_list is empty and will return 0.

Consider for a case where we have less number of groups
(for e.g. just group 0),
this may even return an -ENOSPC error from ext4_mb_new_blocks()
(where we call for ext4_mb_discard_group_preallocations()).
But that is wrong, since 2nd thread should have waited for 1st thread
to release all the PAs and should have retried for allocation.
Since 1st thread was anyway going to discard the PAs.

The algorithm using this percpu seq counter goes below:
1. We sample the percpu discard_pa_seq counter before trying for block
allocation in ext4_mb_new_blocks().
2. We increment this percpu discard_pa_seq counter when we either allocate
or free these blocks i.e. while marking those blocks as used/free in
mb_mark_used()/mb_free_blocks().
3. We also increment this percpu seq counter when we successfully identify
that the bb_prealloc_list is not empty and hence proceed for discarding
of those PAs inside ext4_mb_discard_group_preallocations().

Now to make sure that the regular fast path of block allocation is not
affected, as a small optimization we only sample the percpu seq counter
on that cpu. Only when the block allocation fails and when freed blocks
found were 0, that is when we sample percpu seq counter for all cpus using
below function ext4_get_discard_pa_seq_sum(). This happens after making
sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.

It can be well argued that why don't just check for grp->bb_free to
see if there are any free blocks to be allocated. So here are the two
concerns which were discussed:-

1. If for some reason the blocks available in the group are not
appropriate for allocation logic (say for e.g.
EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
the retry logic may result into infinte looping since grp->bb_free is
non-zero.

2. Also before preallocation was clubbed with block allocation with the
same ext4_lock_group() held, there were lot of races where grp->bb_free
could not be reliably relied upon.
Due to above, this patch considers discard_pa_seq logic to determine if
we should retry for block allocation. Say if there are are n threads
trying for block allocation and none of those could allocate or discard
any of the blocks, then all of those n threads will fail the block
allocation and return -ENOSPC error. (Since the seq counter for all of
those will match as no block allocation/discard was done during that
duration).

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/mballoc.c | 56 ++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 51 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b75408d72773..754ff9f65199 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -351,6 +351,35 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
ext4_group_t group);
static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);

+/*
+ * The algorithm using this percpu seq counter goes below:
+ * 1. We sample the percpu discard_pa_seq counter before trying for block
+ * allocation in ext4_mb_new_blocks().
+ * 2. We increment this percpu discard_pa_seq counter when we either allocate
+ * or free these blocks i.e. while marking those blocks as used/free in
+ * mb_mark_used()/mb_free_blocks().
+ * 3. We also increment this percpu seq counter when we successfully identify
+ * that the bb_prealloc_list is not empty and hence proceed for discarding
+ * of those PAs inside ext4_mb_discard_group_preallocations().
+ *
+ * Now to make sure that the regular fast path of block allocation is not
+ * affected, as a small optimization we only sample the percpu seq counter
+ * on that cpu. Only when the block allocation fails and when freed blocks
+ * found were 0, that is when we sample percpu seq counter for all cpus using
+ * below function ext4_get_discard_pa_seq_sum(). This happens after making
+ * sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.
+ */
+static DEFINE_PER_CPU(u64, discard_pa_seq);
+static inline u64 ext4_get_discard_pa_seq_sum(void)
+{
+ int __cpu;
+ u64 __seq = 0;
+
+ for_each_possible_cpu(__cpu)
+ __seq += per_cpu(discard_pa_seq, __cpu);
+ return __seq;
+}
+
static inline void *mb_correct_addr_and_bit(int *bit, void *addr)
{
#if BITS_PER_LONG == 64
@@ -1462,6 +1491,7 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
mb_check_buddy(e4b);
mb_free_blocks_double(inode, e4b, first, count);

+ this_cpu_inc(discard_pa_seq);
e4b->bd_info->bb_free += count;
if (first < e4b->bd_info->bb_first_free)
e4b->bd_info->bb_first_free = first;
@@ -1603,6 +1633,7 @@ static int mb_mark_used(struct ext4_buddy *e4b, struct ext4_free_extent *ex)
mb_check_buddy(e4b);
mb_mark_used_double(e4b, start, len);

+ this_cpu_inc(discard_pa_seq);
e4b->bd_info->bb_free -= len;
if (e4b->bd_info->bb_first_free == start)
e4b->bd_info->bb_first_free += len;
@@ -3962,6 +3993,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
INIT_LIST_HEAD(&list);
repeat:
ext4_lock_group(sb, group);
+ this_cpu_inc(discard_pa_seq);
list_for_each_entry_safe(pa, tmp,
&grp->bb_prealloc_list, pa_group_list) {
spin_lock(&pa->pa_lock);
@@ -4544,14 +4576,26 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
}

static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
- struct ext4_allocation_context *ac)
+ struct ext4_allocation_context *ac, u64 *seq)
{
int freed;
+ u64 seq_retry = 0;
+ bool ret = false;

freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
- if (freed)
- return true;
- return false;
+ if (freed) {
+ ret = true;
+ goto out_dbg;
+ }
+ seq_retry = ext4_get_discard_pa_seq_sum();
+ if (seq_retry != *seq) {
+ *seq = seq_retry;
+ ret = true;
+ }
+
+out_dbg:
+ mb_debug(sb, "freed %d, retry ? %s\n", freed, ret ? "yes" : "no");
+ return ret;
}

/*
@@ -4568,6 +4612,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
ext4_fsblk_t block = 0;
unsigned int inquota = 0;
unsigned int reserv_clstrs = 0;
+ u64 seq;

might_sleep();
sb = ar->inode->i_sb;
@@ -4630,6 +4675,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
}

ac->ac_op = EXT4_MB_HISTORY_PREALLOC;
+ seq = *this_cpu_ptr(&discard_pa_seq);
if (!ext4_mb_use_preallocated(ac)) {
ac->ac_op = EXT4_MB_HISTORY_ALLOC;
ext4_mb_normalize_request(ac, ar);
@@ -4666,7 +4712,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
ar->len = ac->ac_b_ex.fe_len;
}
} else {
- if (ext4_mb_discard_preallocations_should_retry(sb, ac))
+ if (ext4_mb_discard_preallocations_should_retry(sb, ac, &seq))
goto repeat;
/*
* If block allocation fails then the pa allocated above
--
2.21.0


2020-06-03 06:49:37

by Marek Szyprowski

[permalink] [raw]
Subject: Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

Hi Ritesh,

On 20.05.2020 08:40, Ritesh Harjani wrote:
> There could be a race in function ext4_mb_discard_group_preallocations()
> where the 1st thread may iterate through group's bb_prealloc_list and
> remove all the PAs and add to function's local list head.
> Now if the 2nd thread comes in to discard the group preallocations,
> it will see that the group->bb_prealloc_list is empty and will return 0.
>
> Consider for a case where we have less number of groups
> (for e.g. just group 0),
> this may even return an -ENOSPC error from ext4_mb_new_blocks()
> (where we call for ext4_mb_discard_group_preallocations()).
> But that is wrong, since 2nd thread should have waited for 1st thread
> to release all the PAs and should have retried for allocation.
> Since 1st thread was anyway going to discard the PAs.
>
> The algorithm using this percpu seq counter goes below:
> 1. We sample the percpu discard_pa_seq counter before trying for block
> allocation in ext4_mb_new_blocks().
> 2. We increment this percpu discard_pa_seq counter when we either allocate
> or free these blocks i.e. while marking those blocks as used/free in
> mb_mark_used()/mb_free_blocks().
> 3. We also increment this percpu seq counter when we successfully identify
> that the bb_prealloc_list is not empty and hence proceed for discarding
> of those PAs inside ext4_mb_discard_group_preallocations().
>
> Now to make sure that the regular fast path of block allocation is not
> affected, as a small optimization we only sample the percpu seq counter
> on that cpu. Only when the block allocation fails and when freed blocks
> found were 0, that is when we sample percpu seq counter for all cpus using
> below function ext4_get_discard_pa_seq_sum(). This happens after making
> sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.
>
> It can be well argued that why don't just check for grp->bb_free to
> see if there are any free blocks to be allocated. So here are the two
> concerns which were discussed:-
>
> 1. If for some reason the blocks available in the group are not
> appropriate for allocation logic (say for e.g.
> EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
> the retry logic may result into infinte looping since grp->bb_free is
> non-zero.
>
> 2. Also before preallocation was clubbed with block allocation with the
> same ext4_lock_group() held, there were lot of races where grp->bb_free
> could not be reliably relied upon.
> Due to above, this patch considers discard_pa_seq logic to determine if
> we should retry for block allocation. Say if there are are n threads
> trying for block allocation and none of those could allocate or discard
> any of the blocks, then all of those n threads will fail the block
> allocation and return -ENOSPC error. (Since the seq counter for all of
> those will match as no block allocation/discard was done during that
> duration).
>
> Signed-off-by: Ritesh Harjani <[email protected]>

This patch landed in yesterday's linux-next and causes following
WARNING/BUG on various Samsung Exynos-based boards:

 BUG: using smp_processor_id() in preemptible [00000000] code: logsave/552
 caller is ext4_mb_new_blocks+0x404/0x1300
 CPU: 3 PID: 552 Comm: logsave Tainted: G        W 5.7.0-next-20200602 #4
 Hardware name: Samsung Exynos (Flattened Device Tree)
 [<c011184c>] (unwind_backtrace) from [<c010d250>] (show_stack+0x10/0x14)
 [<c010d250>] (show_stack) from [<c05185fc>] (dump_stack+0xbc/0xe8)
 [<c05185fc>] (dump_stack) from [<c0ab689c>]
(check_preemption_disabled+0xec/0xf0)
 [<c0ab689c>] (check_preemption_disabled) from [<c03a7b38>]
(ext4_mb_new_blocks+0x404/0x1300)
 [<c03a7b38>] (ext4_mb_new_blocks) from [<c03780f4>]
(ext4_ext_map_blocks+0xc7c/0x10f4)
 [<c03780f4>] (ext4_ext_map_blocks) from [<c03902b4>]
(ext4_map_blocks+0x118/0x5a0)
 [<c03902b4>] (ext4_map_blocks) from [<c0394524>]
(mpage_map_and_submit_extent+0x134/0x9c0)
 [<c0394524>] (mpage_map_and_submit_extent) from [<c03958c8>]
(ext4_writepages+0xb18/0xcb0)
 [<c03958c8>] (ext4_writepages) from [<c02588ec>] (do_writepages+0x20/0x94)
 [<c02588ec>] (do_writepages) from [<c024c688>]
(__filemap_fdatawrite_range+0xac/0xcc)
 [<c024c688>] (__filemap_fdatawrite_range) from [<c024c700>]
(filemap_flush+0x28/0x30)
 [<c024c700>] (filemap_flush) from [<c037eedc>]
(ext4_release_file+0x70/0xac)
 [<c037eedc>] (ext4_release_file) from [<c02c3310>] (__fput+0xc4/0x234)
 [<c02c3310>] (__fput) from [<c014eb74>] (task_work_run+0x88/0xcc)
 [<c014eb74>] (task_work_run) from [<c010ca40>]
(do_work_pending+0x52c/0x5cc)
 [<c010ca40>] (do_work_pending) from [<c0100094>]
(slow_work_pending+0xc/0x20)
 Exception stack(0xec9c1fb0 to 0xec9c1ff8)
 1fa0:                                     00000000 0044969c 0000006c
00000000
 1fc0: 00000001 0045a014 00000241 00000006 00000000 be91abb4 be91abb0
0000000c
 1fe0: 00459fd4 be91ab90 00448ed4 b6e43444 60000050 00000003

Please let me know how I can help debugging this issue. The above log is
from linux-next 20200602 compiled from exynos_defconfig running on ARM
32bit Samsung Exynos4412-based Odroid U3 board, however I don't think
this is Exynos specific issue. Probably I've observed it, because
exynos_defconfig has most of the debugging options enabled.

> ...

Best regards
--
Marek Szyprowski, PhD
Samsung R&D Institute Poland

2020-06-03 10:12:02

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

Hi Marek,

On 6/3/20 12:18 PM, Marek Szyprowski wrote:
> Hi Ritesh,
>
> On 20.05.2020 08:40, Ritesh Harjani wrote:
>> There could be a race in function ext4_mb_discard_group_preallocations()
>> where the 1st thread may iterate through group's bb_prealloc_list and
>> remove all the PAs and add to function's local list head.
>> Now if the 2nd thread comes in to discard the group preallocations,
>> it will see that the group->bb_prealloc_list is empty and will return 0.
>>
>> Consider for a case where we have less number of groups
>> (for e.g. just group 0),
>> this may even return an -ENOSPC error from ext4_mb_new_blocks()
>> (where we call for ext4_mb_discard_group_preallocations()).
>> But that is wrong, since 2nd thread should have waited for 1st thread
>> to release all the PAs and should have retried for allocation.
>> Since 1st thread was anyway going to discard the PAs.
>>
>> The algorithm using this percpu seq counter goes below:
>> 1. We sample the percpu discard_pa_seq counter before trying for block
>> allocation in ext4_mb_new_blocks().
>> 2. We increment this percpu discard_pa_seq counter when we either allocate
>> or free these blocks i.e. while marking those blocks as used/free in
>> mb_mark_used()/mb_free_blocks().
>> 3. We also increment this percpu seq counter when we successfully identify
>> that the bb_prealloc_list is not empty and hence proceed for discarding
>> of those PAs inside ext4_mb_discard_group_preallocations().
>>
>> Now to make sure that the regular fast path of block allocation is not
>> affected, as a small optimization we only sample the percpu seq counter
>> on that cpu. Only when the block allocation fails and when freed blocks
>> found were 0, that is when we sample percpu seq counter for all cpus using
>> below function ext4_get_discard_pa_seq_sum(). This happens after making
>> sure that all the PAs on grp->bb_prealloc_list got freed or if it's empty.
>>
>> It can be well argued that why don't just check for grp->bb_free to
>> see if there are any free blocks to be allocated. So here are the two
>> concerns which were discussed:-
>>
>> 1. If for some reason the blocks available in the group are not
>> appropriate for allocation logic (say for e.g.
>> EXT4_MB_HINT_GOAL_ONLY, although this is not yet implemented), then
>> the retry logic may result into infinte looping since grp->bb_free is
>> non-zero.
>>
>> 2. Also before preallocation was clubbed with block allocation with the
>> same ext4_lock_group() held, there were lot of races where grp->bb_free
>> could not be reliably relied upon.
>> Due to above, this patch considers discard_pa_seq logic to determine if
>> we should retry for block allocation. Say if there are are n threads
>> trying for block allocation and none of those could allocate or discard
>> any of the blocks, then all of those n threads will fail the block
>> allocation and return -ENOSPC error. (Since the seq counter for all of
>> those will match as no block allocation/discard was done during that
>> duration).
>>
>> Signed-off-by: Ritesh Harjani <[email protected]>
>
> This patch landed in yesterday's linux-next and causes following
> WARNING/BUG on various Samsung Exynos-based boards:
>
>  BUG: using smp_processor_id() in preemptible [00000000] code: logsave/552
>  caller is ext4_mb_new_blocks+0x404/0x1300

Yes, this is being discussed in the community.
I have submitted a patch which should help fix this warning msg.
Feel free to give this a try on your setup.

https://marc.info/?l=linux-ext4&m=159110574414645&w=2


-ritesh

2020-06-09 10:21:45

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

On Wed, Jun 03, 2020 at 03:40:16PM +0530, Ritesh Harjani wrote:
> Yes, this is being discussed in the community.
> I have submitted a patch which should help fix this warning msg.
> Feel free to give this a try on your setup.
>
> https://marc.info/?l=linux-ext4&m=159110574414645&w=2

I just triggered the same thing here too. Looking at your fix and not
even pretending to know what's going on with that percpu sequence
counting, I can't help but wonder why do you wanna do:

seq = *raw_cpu_ptr(&discard_pa_seq);

instead of simply doing:

seq = this_cpu_read(discard_pa_seq);

?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2020-06-09 10:50:46

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCHv5 3/5] ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling



On 6/9/20 3:50 PM, Borislav Petkov wrote:
> On Wed, Jun 03, 2020 at 03:40:16PM +0530, Ritesh Harjani wrote:
>> Yes, this is being discussed in the community.
>> I have submitted a patch which should help fix this warning msg.
>> Feel free to give this a try on your setup.
>>
>> https://marc.info/?l=linux-ext4&m=159110574414645&w=2
>
> I just triggered the same thing here too. Looking at your fix and not
> even pretending to know what's going on with that percpu sequence
> counting, I can't help but wonder why do you wanna do:
>
> seq = *raw_cpu_ptr(&discard_pa_seq);
>
> instead of simply doing:
>
> seq = this_cpu_read(discard_pa_seq);
>

That's correct. Thanks for pointing that out.
I guess in my development version of code I had seq as a u64 pointer
variable which later I had changed to u64 variable but I guess I
continued using pcpu ptr APIs for that.

Let me re-submit that patch with your Suggested-by tag and corresponding
changes.

-riteshh