2021-09-29 01:13:31

by 许春光

[permalink] [raw]
Subject: [PATCH v2] ext4: fix a possible ABBA deadlock dued to busy PA

From: Chunguang Xu <[email protected]>

We found on older kernel (3.10) that in the scenario of insufficient
disk space, system may trigger an ABBA deadlock problem, it seems that
this problem still exists in latest kernel, try to fix it here. The
main process triggered by this problem is that task A occupies the PA
and waits for the jbd2 transaction finish, the jbd2 transaction waits
for the completion of task B's IO (plug_list), but task B waits for
the release of PA by task A to finish discard, which indirectly forms
an ABBA deadlock. The related calltrace is as follows:

Task A
vfs_write
ext4_mb_new_blocks()
ext4_mb_mark_diskspace_used() JBD2
jbd2_journal_get_write_access() -> jbd2_journal_commit_transaction()
->schedule() filemap_fdatawait()
| |
| Task B |
| do_unlinkat() |
| ext4_evict_inode() |
| jbd2_journal_begin_ordered_truncate() |
| filemap_fdatawrite_range() |
| ext4_mb_new_blocks() |
-ext4_mb_discard_group_preallocations() <-----

Here, try to cancel ext4_mb_discard_group_preallocations() internal
retry due to PA busy, and do a limited number of retries inside
ext4_mb_discard_preallocations(), which can circumvent the above
problems, but also has some advantages:

1. Since the PA is in a busy state, if other groups have free PAs,
keeping the current PA may help to reduce fragmentation.
2. Continue to traverse forward instead of waiting for the current
group PA to be released. In most scenarios, the PA discard time
can be reduced.

However, in the case of smaller free space, if only a few groups have
space, then due to multiple traversals of the group, it may increase
CPU overhead. But in contrast, I feel that the overall benefit is
better than the cost.

Signed-off-by: Chunguang Xu <[email protected]>
---
v2: reset busy to zero before goto repeat.

fs/ext4/mballoc.c | 35 ++++++++++++++++-------------------
1 file changed, 16 insertions(+), 19 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 72bfac2..72de6c1 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4814,7 +4814,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
*/
static noinline_for_stack int
ext4_mb_discard_group_preallocations(struct super_block *sb,
- ext4_group_t group, int needed)
+ ext4_group_t group, int needed, int *busy)
{
struct ext4_group_info *grp = ext4_get_group_info(sb, group);
struct buffer_head *bitmap_bh = NULL;
@@ -4822,8 +4822,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
struct list_head list;
struct ext4_buddy e4b;
int err;
- int busy = 0;
- int free, free_total = 0;
+ int free = 0;

mb_debug(sb, "discard preallocation for group %u\n", group);
if (list_empty(&grp->bb_prealloc_list))
@@ -4850,15 +4849,13 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1;

INIT_LIST_HEAD(&list);
-repeat:
- free = 0;
ext4_lock_group(sb, group);
list_for_each_entry_safe(pa, tmp,
&grp->bb_prealloc_list, pa_group_list) {
spin_lock(&pa->pa_lock);
if (atomic_read(&pa->pa_count)) {
spin_unlock(&pa->pa_lock);
- busy = 1;
+ *busy = 1;
continue;
}
if (pa->pa_deleted) {
@@ -4898,22 +4895,13 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
}

- free_total += free;
-
- /* if we still need more blocks and some PAs were used, try again */
- if (free_total < needed && busy) {
- ext4_unlock_group(sb, group);
- cond_resched();
- busy = 0;
- goto repeat;
- }
ext4_unlock_group(sb, group);
ext4_mb_unload_buddy(&e4b);
put_bh(bitmap_bh);
out_dbg:
mb_debug(sb, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n",
- free_total, group, grp->bb_free);
- return free_total;
+ free, group, grp->bb_free);
+ return free;
}

/*
@@ -5455,13 +5443,22 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
{
ext4_group_t i, ngroups = ext4_get_groups_count(sb);
int ret;
- int freed = 0;
+ int freed = 0, busy = 0;
+ int retry = 0;

trace_ext4_mb_discard_preallocations(sb, needed);
+ repeat:
+ retry++;
for (i = 0; i < ngroups && needed > 0; i++) {
- ret = ext4_mb_discard_group_preallocations(sb, i, needed);
+ ret = ext4_mb_discard_group_preallocations(sb, i, needed, &busy);
freed += ret;
needed -= ret;
+ cond_resched();
+ }
+
+ if (needed > 0 && busy && retry < 3) {
+ busy = 0;
+ goto repeat;
}

return freed;
--
1.8.3.1


2021-11-11 10:03:43

by Yujie Liu

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix a possible ABBA deadlock dued to busy PA

Hi Chunguang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on tytso-ext4/dev]
[also build test WARNING on v5.15 next-20211109]
[cannot apply to tytso-fscrypt/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/brookxu/ext4-fix-a-possible-ABBA-deadlock-dued-to-busy-PA/20210929-091410
base: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: i386-randconfig-c001-20210929 (attached as .config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project dc6e8dfdfe7efecfda318d43a06fae18b40eb498)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/5e956ecab47126f1d065a1ac6e6d5077e02e6f87
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review brookxu/ext4-fix-a-possible-ABBA-deadlock-dued-to-busy-PA/20210929-091410
git checkout 5e956ecab47126f1d065a1ac6e6d5077e02e6f87
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=i386 clang-analyzer

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>


clang-analyzer warnings: (new ones prefixed by >>)

>> fs/ext4/mballoc.c:4849:3: warning: Value stored to 'needed' is never read [clang-analyzer-deadcode.DeadStores]
needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1;
^

vim +/needed +4849 fs/ext4/mballoc.c

c9de560ded61faa Alex Tomas 2008-01-29 4805
c9de560ded61faa Alex Tomas 2008-01-29 4806 /*
c9de560ded61faa Alex Tomas 2008-01-29 4807 * releases all preallocations in given group
c9de560ded61faa Alex Tomas 2008-01-29 4808 *
c9de560ded61faa Alex Tomas 2008-01-29 4809 * first, we need to decide discard policy:
c9de560ded61faa Alex Tomas 2008-01-29 4810 * - when do we discard
c9de560ded61faa Alex Tomas 2008-01-29 4811 * 1) ENOSPC
c9de560ded61faa Alex Tomas 2008-01-29 4812 * - how many do we discard
c9de560ded61faa Alex Tomas 2008-01-29 4813 * 1) how many requested
c9de560ded61faa Alex Tomas 2008-01-29 4814 */
4ddfef7b41aebbb Eric Sandeen 2008-04-29 4815 static noinline_for_stack int
4ddfef7b41aebbb Eric Sandeen 2008-04-29 4816 ext4_mb_discard_group_preallocations(struct super_block *sb,
5e956ecab47126f Chunguang Xu 2021-09-29 4817 ext4_group_t group, int needed, int *busy)
c9de560ded61faa Alex Tomas 2008-01-29 4818 {
c9de560ded61faa Alex Tomas 2008-01-29 4819 struct ext4_group_info *grp = ext4_get_group_info(sb, group);
c9de560ded61faa Alex Tomas 2008-01-29 4820 struct buffer_head *bitmap_bh = NULL;
c9de560ded61faa Alex Tomas 2008-01-29 4821 struct ext4_prealloc_space *pa, *tmp;
c9de560ded61faa Alex Tomas 2008-01-29 4822 struct list_head list;
c9de560ded61faa Alex Tomas 2008-01-29 4823 struct ext4_buddy e4b;
c9de560ded61faa Alex Tomas 2008-01-29 4824 int err;
5e956ecab47126f Chunguang Xu 2021-09-29 4825 int free = 0;
c9de560ded61faa Alex Tomas 2008-01-29 4826
d3df14535f4a5b5 Ritesh Harjani 2020-05-10 4827 mb_debug(sb, "discard preallocation for group %u\n", group);
c9de560ded61faa Alex Tomas 2008-01-29 4828 if (list_empty(&grp->bb_prealloc_list))
bbc4ec77e9f9c7a Ritesh Harjani 2020-05-10 4829 goto out_dbg;
c9de560ded61faa Alex Tomas 2008-01-29 4830
574ca174c97f790 Theodore Ts'o 2008-07-11 4831 bitmap_bh = ext4_read_block_bitmap(sb, group);
9008a58e5dcee01 Darrick J. Wong 2015-10-17 4832 if (IS_ERR(bitmap_bh)) {
9008a58e5dcee01 Darrick J. Wong 2015-10-17 4833 err = PTR_ERR(bitmap_bh);
54d3adbc29f0c7c Theodore Ts'o 2020-03-28 4834 ext4_error_err(sb, -err,
54d3adbc29f0c7c Theodore Ts'o 2020-03-28 4835 "Error %d reading block bitmap for %u",
9008a58e5dcee01 Darrick J. Wong 2015-10-17 4836 err, group);
bbc4ec77e9f9c7a Ritesh Harjani 2020-05-10 4837 goto out_dbg;
c9de560ded61faa Alex Tomas 2008-01-29 4838 }
c9de560ded61faa Alex Tomas 2008-01-29 4839
c9de560ded61faa Alex Tomas 2008-01-29 4840 err = ext4_mb_load_buddy(sb, group, &e4b);
ce89f46cb833f89 Aneesh Kumar K.V 2008-07-23 4841 if (err) {
9651e6b2e20648d Konstantin Khlebnikov 2017-05-21 4842 ext4_warning(sb, "Error %d loading buddy information for %u",
9651e6b2e20648d Konstantin Khlebnikov 2017-05-21 4843 err, group);
ce89f46cb833f89 Aneesh Kumar K.V 2008-07-23 4844 put_bh(bitmap_bh);
bbc4ec77e9f9c7a Ritesh Harjani 2020-05-10 4845 goto out_dbg;
ce89f46cb833f89 Aneesh Kumar K.V 2008-07-23 4846 }
c9de560ded61faa Alex Tomas 2008-01-29 4847
c9de560ded61faa Alex Tomas 2008-01-29 4848 if (needed == 0)
7137d7a48e2213e Theodore Ts'o 2011-09-09 @4849 needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1;
c9de560ded61faa Alex Tomas 2008-01-29 4850
c9de560ded61faa Alex Tomas 2008-01-29 4851 INIT_LIST_HEAD(&list);
c9de560ded61faa Alex Tomas 2008-01-29 4852 ext4_lock_group(sb, group);
c9de560ded61faa Alex Tomas 2008-01-29 4853 list_for_each_entry_safe(pa, tmp,
c9de560ded61faa Alex Tomas 2008-01-29 4854 &grp->bb_prealloc_list, pa_group_list) {
c9de560ded61faa Alex Tomas 2008-01-29 4855 spin_lock(&pa->pa_lock);
c9de560ded61faa Alex Tomas 2008-01-29 4856 if (atomic_read(&pa->pa_count)) {
c9de560ded61faa Alex Tomas 2008-01-29 4857 spin_unlock(&pa->pa_lock);
5e956ecab47126f Chunguang Xu 2021-09-29 4858 *busy = 1;
c9de560ded61faa Alex Tomas 2008-01-29 4859 continue;
c9de560ded61faa Alex Tomas 2008-01-29 4860 }
c9de560ded61faa Alex Tomas 2008-01-29 4861 if (pa->pa_deleted) {
c9de560ded61faa Alex Tomas 2008-01-29 4862 spin_unlock(&pa->pa_lock);
c9de560ded61faa Alex Tomas 2008-01-29 4863 continue;
c9de560ded61faa Alex Tomas 2008-01-29 4864 }
c9de560ded61faa Alex Tomas 2008-01-29 4865
c9de560ded61faa Alex Tomas 2008-01-29 4866 /* seems this one can be freed ... */
27bc446e2def38d brookxu 2020-08-17 4867 ext4_mb_mark_pa_deleted(sb, pa);
c9de560ded61faa Alex Tomas 2008-01-29 4868
70022da804f0f3f Ye Bin 2020-09-16 4869 if (!free)
70022da804f0f3f Ye Bin 2020-09-16 4870 this_cpu_inc(discard_pa_seq);
70022da804f0f3f Ye Bin 2020-09-16 4871
c9de560ded61faa Alex Tomas 2008-01-29 4872 /* we can trust pa_free ... */
c9de560ded61faa Alex Tomas 2008-01-29 4873 free += pa->pa_free;
c9de560ded61faa Alex Tomas 2008-01-29 4874
c9de560ded61faa Alex Tomas 2008-01-29 4875 spin_unlock(&pa->pa_lock);
c9de560ded61faa Alex Tomas 2008-01-29 4876
c9de560ded61faa Alex Tomas 2008-01-29 4877 list_del(&pa->pa_group_list);
c9de560ded61faa Alex Tomas 2008-01-29 4878 list_add(&pa->u.pa_tmp_list, &list);
c9de560ded61faa Alex Tomas 2008-01-29 4879 }
c9de560ded61faa Alex Tomas 2008-01-29 4880
c9de560ded61faa Alex Tomas 2008-01-29 4881 /* now free all selected PAs */
c9de560ded61faa Alex Tomas 2008-01-29 4882 list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
c9de560ded61faa Alex Tomas 2008-01-29 4883
c9de560ded61faa Alex Tomas 2008-01-29 4884 /* remove from object (inode or locality group) */
c9de560ded61faa Alex Tomas 2008-01-29 4885 spin_lock(pa->pa_obj_lock);
c9de560ded61faa Alex Tomas 2008-01-29 4886 list_del_rcu(&pa->pa_inode_list);
c9de560ded61faa Alex Tomas 2008-01-29 4887 spin_unlock(pa->pa_obj_lock);
c9de560ded61faa Alex Tomas 2008-01-29 4888
cc0fb9ad7dbc5a1 Aneesh Kumar K.V 2009-03-27 4889 if (pa->pa_type == MB_GROUP_PA)
3e1e5f501632460 Eric Sandeen 2010-10-27 4890 ext4_mb_release_group_pa(&e4b, pa);
c9de560ded61faa Alex Tomas 2008-01-29 4891 else
3e1e5f501632460 Eric Sandeen 2010-10-27 4892 ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
c9de560ded61faa Alex Tomas 2008-01-29 4893
c9de560ded61faa Alex Tomas 2008-01-29 4894 list_del(&pa->u.pa_tmp_list);
c9de560ded61faa Alex Tomas 2008-01-29 4895 call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
c9de560ded61faa Alex Tomas 2008-01-29 4896 }
c9de560ded61faa Alex Tomas 2008-01-29 4897
c9de560ded61faa Alex Tomas 2008-01-29 4898 ext4_unlock_group(sb, group);
e39e07fdfd98be8 Jing Zhang 2010-05-14 4899 ext4_mb_unload_buddy(&e4b);
c9de560ded61faa Alex Tomas 2008-01-29 4900 put_bh(bitmap_bh);
bbc4ec77e9f9c7a Ritesh Harjani 2020-05-10 4901 out_dbg:
d3df14535f4a5b5 Ritesh Harjani 2020-05-10 4902 mb_debug(sb, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n",
5e956ecab47126f Chunguang Xu 2021-09-29 4903 free, group, grp->bb_free);
5e956ecab47126f Chunguang Xu 2021-09-29 4904 return free;
c9de560ded61faa Alex Tomas 2008-01-29 4905 }
c9de560ded61faa Alex Tomas 2008-01-29 4906

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
.config.gz (35.37 kB)

2021-11-22 17:50:23

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix a possible ABBA deadlock dued to busy PA

Hello,

it seems this patch has fallen through the cracks.

On Wed 29-09-21 09:12:25, brookxu wrote:
> From: Chunguang Xu <[email protected]>
>
> We found on older kernel (3.10) that in the scenario of insufficient
> disk space, system may trigger an ABBA deadlock problem, it seems that
> this problem still exists in latest kernel, try to fix it here. The
> main process triggered by this problem is that task A occupies the PA
> and waits for the jbd2 transaction finish, the jbd2 transaction waits
> for the completion of task B's IO (plug_list), but task B waits for
> the release of PA by task A to finish discard, which indirectly forms
> an ABBA deadlock. The related calltrace is as follows:
>
> Task A
> vfs_write
> ext4_mb_new_blocks()
> ext4_mb_mark_diskspace_used() JBD2
> jbd2_journal_get_write_access() -> jbd2_journal_commit_transaction()
> ->schedule() filemap_fdatawait()
> | |
> | Task B |
> | do_unlinkat() |
> | ext4_evict_inode() |
> | jbd2_journal_begin_ordered_truncate() |
> | filemap_fdatawrite_range() |
> | ext4_mb_new_blocks() |
> -ext4_mb_discard_group_preallocations() <-----
>
> Here, try to cancel ext4_mb_discard_group_preallocations() internal
> retry due to PA busy, and do a limited number of retries inside
> ext4_mb_discard_preallocations(), which can circumvent the above
> problems, but also has some advantages:
>
> 1. Since the PA is in a busy state, if other groups have free PAs,
> keeping the current PA may help to reduce fragmentation.
> 2. Continue to traverse forward instead of waiting for the current
> group PA to be released. In most scenarios, the PA discard time
> can be reduced.
>
> However, in the case of smaller free space, if only a few groups have
> space, then due to multiple traversals of the group, it may increase
> CPU overhead. But in contrast, I feel that the overall benefit is
> better than the cost.
>
> Signed-off-by: Chunguang Xu <[email protected]>

Thanks for the patch! I guess this is a reasonable middle-ground so feel
free to add:

Reviewed-by: Jan Kara <[email protected]>

I think Ritesh was last touching this logic so let's check he doesn't see
anything wrong with this change. Ritesh?

Honza


> ---
> v2: reset busy to zero before goto repeat.
>
> fs/ext4/mballoc.c | 35 ++++++++++++++++-------------------
> 1 file changed, 16 insertions(+), 19 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 72bfac2..72de6c1 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4814,7 +4814,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
> */
> static noinline_for_stack int
> ext4_mb_discard_group_preallocations(struct super_block *sb,
> - ext4_group_t group, int needed)
> + ext4_group_t group, int needed, int *busy)
> {
> struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> struct buffer_head *bitmap_bh = NULL;
> @@ -4822,8 +4822,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
> struct list_head list;
> struct ext4_buddy e4b;
> int err;
> - int busy = 0;
> - int free, free_total = 0;
> + int free = 0;
>
> mb_debug(sb, "discard preallocation for group %u\n", group);
> if (list_empty(&grp->bb_prealloc_list))
> @@ -4850,15 +4849,13 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
> needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1;
>
> INIT_LIST_HEAD(&list);
> -repeat:
> - free = 0;
> ext4_lock_group(sb, group);
> list_for_each_entry_safe(pa, tmp,
> &grp->bb_prealloc_list, pa_group_list) {
> spin_lock(&pa->pa_lock);
> if (atomic_read(&pa->pa_count)) {
> spin_unlock(&pa->pa_lock);
> - busy = 1;
> + *busy = 1;
> continue;
> }
> if (pa->pa_deleted) {
> @@ -4898,22 +4895,13 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
> call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
> }
>
> - free_total += free;
> -
> - /* if we still need more blocks and some PAs were used, try again */
> - if (free_total < needed && busy) {
> - ext4_unlock_group(sb, group);
> - cond_resched();
> - busy = 0;
> - goto repeat;
> - }
> ext4_unlock_group(sb, group);
> ext4_mb_unload_buddy(&e4b);
> put_bh(bitmap_bh);
> out_dbg:
> mb_debug(sb, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n",
> - free_total, group, grp->bb_free);
> - return free_total;
> + free, group, grp->bb_free);
> + return free;
> }
>
> /*
> @@ -5455,13 +5443,22 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
> {
> ext4_group_t i, ngroups = ext4_get_groups_count(sb);
> int ret;
> - int freed = 0;
> + int freed = 0, busy = 0;
> + int retry = 0;
>
> trace_ext4_mb_discard_preallocations(sb, needed);
> + repeat:
> + retry++;
> for (i = 0; i < ngroups && needed > 0; i++) {
> - ret = ext4_mb_discard_group_preallocations(sb, i, needed);
> + ret = ext4_mb_discard_group_preallocations(sb, i, needed, &busy);
> freed += ret;
> needed -= ret;
> + cond_resched();
> + }
> +
> + if (needed > 0 && busy && retry < 3) {
> + busy = 0;
> + goto repeat;
> }
>
> return freed;
> --
> 1.8.3.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-11-23 01:26:27

by 许春光

[permalink] [raw]
Subject: Re: [PATCH v2] ext4: fix a possible ABBA deadlock dued to busy PA

Hi Jan & riteshh:

Thanks for your time to review this patch. But kernel test robot has
reported an warning for the v2 version. I have fixed it and submitted
the v3 version, maybe we should goto v3 version, thanks all.

Jan Kara wrote on 2021/11/23 1:50:
> Hello,
>
> it seems this patch has fallen through the cracks.
>
> On Wed 29-09-21 09:12:25, brookxu wrote:
>> From: Chunguang Xu <[email protected]>
>>
>> We found on older kernel (3.10) that in the scenario of insufficient
>> disk space, system may trigger an ABBA deadlock problem, it seems that
>> this problem still exists in latest kernel, try to fix it here. The
>> main process triggered by this problem is that task A occupies the PA
>> and waits for the jbd2 transaction finish, the jbd2 transaction waits
>> for the completion of task B's IO (plug_list), but task B waits for
>> the release of PA by task A to finish discard, which indirectly forms
>> an ABBA deadlock. The related calltrace is as follows:
>>
>> Task A
>> vfs_write
>> ext4_mb_new_blocks()
>> ext4_mb_mark_diskspace_used() JBD2
>> jbd2_journal_get_write_access() -> jbd2_journal_commit_transaction()
>> ->schedule() filemap_fdatawait()
>> | |
>> | Task B |
>> | do_unlinkat() |
>> | ext4_evict_inode() |
>> | jbd2_journal_begin_ordered_truncate() |
>> | filemap_fdatawrite_range() |
>> | ext4_mb_new_blocks() |
>> -ext4_mb_discard_group_preallocations() <-----
>>
>> Here, try to cancel ext4_mb_discard_group_preallocations() internal
>> retry due to PA busy, and do a limited number of retries inside
>> ext4_mb_discard_preallocations(), which can circumvent the above
>> problems, but also has some advantages:
>>
>> 1. Since the PA is in a busy state, if other groups have free PAs,
>> keeping the current PA may help to reduce fragmentation.
>> 2. Continue to traverse forward instead of waiting for the current
>> group PA to be released. In most scenarios, the PA discard time
>> can be reduced.
>>
>> However, in the case of smaller free space, if only a few groups have
>> space, then due to multiple traversals of the group, it may increase
>> CPU overhead. But in contrast, I feel that the overall benefit is
>> better than the cost.
>>
>> Signed-off-by: Chunguang Xu <[email protected]>
>
> Thanks for the patch! I guess this is a reasonable middle-ground so feel
> free to add:
>
> Reviewed-by: Jan Kara <[email protected]>
>
> I think Ritesh was last touching this logic so let's check he doesn't see
> anything wrong with this change. Ritesh?
>
> Honza
>
>
>> ---
>> v2: reset busy to zero before goto repeat.
>>
>> fs/ext4/mballoc.c | 35 ++++++++++++++++-------------------
>> 1 file changed, 16 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
>> index 72bfac2..72de6c1 100644
>> --- a/fs/ext4/mballoc.c
>> +++ b/fs/ext4/mballoc.c
>> @@ -4814,7 +4814,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
>> */
>> static noinline_for_stack int
>> ext4_mb_discard_group_preallocations(struct super_block *sb,
>> - ext4_group_t group, int needed)
>> + ext4_group_t group, int needed, int *busy)
>> {
>> struct ext4_group_info *grp = ext4_get_group_info(sb, group);
>> struct buffer_head *bitmap_bh = NULL;
>> @@ -4822,8 +4822,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
>> struct list_head list;
>> struct ext4_buddy e4b;
>> int err;
>> - int busy = 0;
>> - int free, free_total = 0;
>> + int free = 0;
>>
>> mb_debug(sb, "discard preallocation for group %u\n", group);
>> if (list_empty(&grp->bb_prealloc_list))
>> @@ -4850,15 +4849,13 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
>> needed = EXT4_CLUSTERS_PER_GROUP(sb) + 1;
>>
>> INIT_LIST_HEAD(&list);
>> -repeat:
>> - free = 0;
>> ext4_lock_group(sb, group);
>> list_for_each_entry_safe(pa, tmp,
>> &grp->bb_prealloc_list, pa_group_list) {
>> spin_lock(&pa->pa_lock);
>> if (atomic_read(&pa->pa_count)) {
>> spin_unlock(&pa->pa_lock);
>> - busy = 1;
>> + *busy = 1;
>> continue;
>> }
>> if (pa->pa_deleted) {
>> @@ -4898,22 +4895,13 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
>> call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
>> }
>>
>> - free_total += free;
>> -
>> - /* if we still need more blocks and some PAs were used, try again */
>> - if (free_total < needed && busy) {
>> - ext4_unlock_group(sb, group);
>> - cond_resched();
>> - busy = 0;
>> - goto repeat;
>> - }
>> ext4_unlock_group(sb, group);
>> ext4_mb_unload_buddy(&e4b);
>> put_bh(bitmap_bh);
>> out_dbg:
>> mb_debug(sb, "discarded (%d) blocks preallocated for group %u bb_free (%d)\n",
>> - free_total, group, grp->bb_free);
>> - return free_total;
>> + free, group, grp->bb_free);
>> + return free;
>> }
>>
>> /*
>> @@ -5455,13 +5443,22 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
>> {
>> ext4_group_t i, ngroups = ext4_get_groups_count(sb);
>> int ret;
>> - int freed = 0;
>> + int freed = 0, busy = 0;
>> + int retry = 0;
>>
>> trace_ext4_mb_discard_preallocations(sb, needed);
>> + repeat:
>> + retry++;
>> for (i = 0; i < ngroups && needed > 0; i++) {
>> - ret = ext4_mb_discard_group_preallocations(sb, i, needed);
>> + ret = ext4_mb_discard_group_preallocations(sb, i, needed, &busy);
>> freed += ret;
>> needed -= ret;
>> + cond_resched();
>> + }
>> +
>> + if (needed > 0 && busy && retry < 3) {
>> + busy = 0;
>> + goto repeat;
>> }
>>
>> return freed;
>> --
>> 1.8.3.1
>>