LinuxLists.cc - [RFCv4 0/6] Improve ext4 handling of ENOSPC with multi-threaded use-case

2020-05-10 07:09:24

Subject: [RFCv4 0/6] Improve ext4 handling of ENOSPC with multi-threaded use-case

Hello All,

v3 -> v4:
1. Splitted code cleanups and debug improvements as a separate patch series.
2. Dropped rcu_barrier() approach since it did cause some latency
in my testing of ENOSPC handling.
3. This patch series takes a different approach to improve the multi-threaded
ENOSPC handling in ext4 mballoc code. Below mail gives more details.

Background
==========
Consider a case where your disk is close to full but still enough space
remains for your multi-threaded application to run. Now when this application
threads tries to write (e.g. sparse file followed by mmap write or
even fallocate multiple files) in parallel, then with current code of
ext4 multi-block allocator, the application may get an ENOSPC error in some
cases. Examining disk space at this time, we see there is sufficient space
remaining for your application to continue to run.

Additional info:
============================
1. Our internal test team was easily able to reproduce this ENOSPC error on
an upstream kernel with 2GB ext4 image, with 64K blocksize. They didn't try
above 2GB and reprorted this issue directly to dev team. On examining the
free space when the application gets ENOSPC, the free space left was more
then 50% of filesystem size in some cases.

2. For debugging/development of these patches, I used below script [1] to
trigger this issue quite frequently on a 64K blocksize setup with 240MB
ext4 image.

Summary of patches and problem with current design
==================================================
There were 3 main problems which these patches tries to address and hence
improve the ENOSPC handling in ext4's multi-block allocator code.

1. Patch-2: Earlier we were considering the group is good or not (means
checking if it has enough free blocks to serve your request) without taking
the group's lock. This could result into a race where, if another thread is
discarding the group's prealloc list, then the allocation thread will not
consider those about to be free blocks and will fail will return that group
is not fit for allocation thus eventually fails with ENOSPC error.

2. Patch-4: Discard PA algoritm only scans the PA list to free up the
additional blocks which got added to PA. This is done by the same thread-A
which at 1st couldn't allocate any blocks. But there is a window where,
once the blocks were allocated (say by some other thread-B previously) we
drop the group's lock and then checks to see if some of these blocks could
be added to prealloc list of the group from where we allocated some blocks.
After that we take the lock and add these additional blocks allocated by
thread-B to the PA list. But say if thread-A tries to scan the PA list
between this time interval then there is possibilty that it won't find any
blocks added to the PA list and hence may return ENOSPC error.
Hence this patch tries to add those additional blocks to the PA list just
after the blocks are marked as used with the same group's spinlock held.

3. Patch-3: Introduces a per cpu discard_pa_seq counter which is increased
whenever there is block allocation/freeing or when the discarding of any
group's PA list has started. With this we could know when to stop the
retrying logic and return ENOSPC error if there is actually no free space
left.
There is an optimization done in the block allocation fast path with this
approach that, before starting the block allocation, we only sample the
percpu seq count on that cpu. Only when the allocation fails and discard
couldn't free up any of the blocks in all of the group's PA list, that is
when we sample the percpu seq counter sum over all possible cpus to check
if we need to retry.

Testing:
=========
Tested fstests with default bs of 4K and bs == PAGESIZE ("-g auto")
No new failures were reported with this patch series in this testing.

NOTE:
1. This patch series is based on top of mballoc code cleanup patch series
posted at [2].
2. Patch-2 & Patch-3 is intentionally kept separate for better reviewer's
attention on what each patch is trying to address.

References:
===========
[v3]: https://patchwork.ozlabs.org/project/linux-ext4/cover/[email protected]/
[1]: https://github.com/riteshharjani/LinuxStudy/blob/master/tools/test_mballoc.sh
[2]: https://patchwork.ozlabs.org/project/linux-ext4/cover/[email protected]/

Ritesh Harjani (6):
ext4: mballoc: Refactor ext4_mb_good_group()
ext4: mballoc: Use ext4_lock_group() around calculations involving bb_free
ext4: mballoc: Optimize ext4_mb_good_group_nolock further if grp needs init
ext4: mballoc: Add blocks to PA list under same spinlock after allocating blocks
ext4: mballoc: Refactor ext4_mb_discard_preallocations()
ext4: mballoc: Introduce pcpu seqcnt for freeing PA to improve ENOSPC handling

fs/ext4/mballoc.c | 249 +++++++++++++++++++++++++++++++++-------------
1 file changed, 179 insertions(+), 70 deletions(-)

--
2.21.0

2020-05-10 07:09:28

by Ritesh Harjani

[permalink] [raw]

Subject: [RFCv4 1/6] ext4: mballoc: Refactor ext4_mb_good_group()

ext4_mb_good_group() definition was changed some time back
and now it even initializes the buddy cache (via ext4_mb_init_group()),
if in case the EXT4_MB_GRP_NEED_INIT() is true for a group.
Note that ext4_mb_init_group() could sleep and so should not be called
under a spinlock held.
This is fine as of now because ext4_mb_good_group() is called before
loading the buddy bitmap without ext4_lock_group() held
and again called after loading the bitmap, only this time with
ext4_lock_group() held.
But still this whole thing is confusing.

So this patch refactors out ext4_mb_good_group_nolock() which should be
called when without holding ext4_lock_group().
Also in further patches we hold the spinlock (ext4_lock_group()) while
doing any calculations which involves grp->bb_free or grp->bb_fragments.

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/mballoc.c | 80 ++++++++++++++++++++++++++++++-----------------
1 file changed, 52 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 33a69424942c..da11a4a738bd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2066,15 +2066,16 @@ void ext4_mb_scan_aligned(struct ext4_allocation_context *ac,
}

/*
- * This is now called BEFORE we load the buddy bitmap.
+ * This is also called BEFORE we load the buddy bitmap.
* Returns either 1 or 0 indicating that the group is either suitable
- * for the allocation or not. In addition it can also return negative
- * error code when something goes wrong.
+ * for the allocation or not. This should be called with ext4_lock_group()
+ * held to protect grp->bb_free from getting changed due to another
+ * allocation/discard is happening in parallel.
*/
-static int ext4_mb_good_group(struct ext4_allocation_context *ac,
+static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
ext4_group_t group, int cr)
{
- unsigned free, fragments;
+ ext4_grpblk_t free, fragments;
int flex_size = ext4_flex_bg_size(EXT4_SB(ac->ac_sb));
struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);

@@ -2082,23 +2083,16 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,

free = grp->bb_free;
if (free == 0)
- return 0;
+ return false;
if (cr <= 2 && free < ac->ac_g_ex.fe_len)
- return 0;
+ return false;

if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
- return 0;
-
- /* We only do this if the grp has never been initialized */
- if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
- int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
- if (ret)
- return ret;
- }
+ return false;

fragments = grp->bb_fragments;
if (fragments == 0)
- return 0;
+ return false;

switch (cr) {
case 0:
@@ -2108,31 +2102,63 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
if ((ac->ac_flags & EXT4_MB_HINT_DATA) &&
(flex_size >= EXT4_FLEX_SIZE_DIR_ALLOC_SCHEME) &&
((group % flex_size) == 0))
- return 0;
+ return false;

if ((ac->ac_2order > ac->ac_sb->s_blocksize_bits+1) ||
(free / fragments) >= ac->ac_g_ex.fe_len)
- return 1;
+ return true;

if (grp->bb_largest_free_order < ac->ac_2order)
- return 0;
+ return false;

- return 1;
+ return true;
case 1:
if ((free / fragments) >= ac->ac_g_ex.fe_len)
- return 1;
+ return true;
break;
case 2:
if (free >= ac->ac_g_ex.fe_len)
- return 1;
+ return true;
break;
case 3:
- return 1;
+ return true;
default:
BUG();
}

- return 0;
+ return false;
+}
+
+/*
+ * This could return negative error code if something goes wrong
+ * during ext4_mb_init_group(). This should not be called with
+ * ext4_lock_group() held.
+ */
+static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
+ ext4_group_t group, int cr)
+{
+ struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
+ ext4_grpblk_t free;
+ int ret = 0;
+
+ free = grp->bb_free;
+ if (free == 0)
+ goto out;
+ if (cr <= 2 && free < ac->ac_g_ex.fe_len)
+ goto out;
+ if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
+ goto out;
+
+ /* We only do this if the grp has never been initialized */
+ if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
+ ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
+ if (ret)
+ return ret;
+ }
+
+ ret = ext4_mb_good_group(ac, group, cr);
+out:
+ return ret;
}

static noinline_for_stack int
@@ -2220,7 +2246,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
group = 0;

/* This now checks without needing the buddy page */
- ret = ext4_mb_good_group(ac, group, cr);
+ ret = ext4_mb_good_group_nolock(ac, group, cr);
if (ret <= 0) {
if (!first_err)
first_err = ret;
@@ -2238,11 +2264,9 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
* block group
*/
ret = ext4_mb_good_group(ac, group, cr);
- if (ret <= 0) {
+ if (ret == 0) {
ext4_unlock_group(sb, group);
ext4_mb_unload_buddy(&e4b);
- if (!first_err)
- first_err = ret;
continue;
}

--
2.21.0

2020-05-10 07:09:31

by Ritesh Harjani

[permalink] [raw]

Subject: [RFCv4 3/6] ext4: mballoc: Optimize ext4_mb_good_group_nolock further if grp needs init

Currently in case if EXT4_MB_GRP_NEED_INIT(grp) is true, then we first
check for few things like grp->bb_free etc with spinlock (ext4_lock_group)
held. Then we drop the lock only to initialize the group's buddy cache
and then again take the lock and check for ext4_mb_good_group().

Once this step is done we return to ext4_mb_regular_allocator(), load
the buddy and then again take the lock only to check
ext4_mb_good_group(), which was anyways done in previous step.

I believe we can optimize one step here where if the group needs init
we can check for only few things and return early. Then recheck for
ext4_mb_good_group() only once after loading the buddy cache.

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/mballoc.c | 44 +++++++++++++++++++++-----------------------
1 file changed, 21 insertions(+), 23 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index dcd05ff7c012..7d766dc34cdd 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2134,33 +2134,34 @@ static bool ext4_mb_good_group(struct ext4_allocation_context *ac,
* during ext4_mb_init_group(). This should not be called with
* ext4_lock_group() held.
*/
-static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
+static bool ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
ext4_group_t group, int cr)
{
struct ext4_group_info *grp = ext4_get_group_info(ac->ac_sb, group);
struct super_block *sb = ac->ac_sb;
ext4_grpblk_t free;
- int ret = 0;
+ bool ret = false, need_init = EXT4_MB_GRP_NEED_INIT(grp);

ext4_lock_group(sb, group);
- free = grp->bb_free;
- if (free == 0)
- goto out;
- if (cr <= 2 && free < ac->ac_g_ex.fe_len)
- goto out;
- if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
- goto out;
- ext4_unlock_group(sb, group);
-
- /* We only do this if the grp has never been initialized */
- if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
- ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
- if (ret)
- return ret;
+ /*
+ * If the group needs init then no need to call ext4_mb_init_group()
+ * after dropping the lock. It's better we check bb_free/other things
+ * here and if it meets the criteria than return true. Later we
+ * will anyway check for good group after loading the buddy cache
+ * which, if required will call ext4_mb_init_group() from within.
+ */
+ if (unlikely(need_init)) {
+ free = grp->bb_free;
+ if (free == 0)
+ goto out;
+ if (cr <= 2 && free < ac->ac_g_ex.fe_len)
+ goto out;
+ if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
+ goto out;
+ ret = true;
+ } else {
+ ret = ext4_mb_good_group(ac, group, cr);
}
-
- ext4_lock_group(sb, group);
- ret = ext4_mb_good_group(ac, group, cr);
out:
ext4_unlock_group(sb, group);
return ret;
@@ -2252,11 +2253,8 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)

/* This now checks without needing the buddy page */
ret = ext4_mb_good_group_nolock(ac, group, cr);
- if (ret <= 0) {
- if (!first_err)
- first_err = ret;
+ if (ret == 0)
continue;
- }

err = ext4_mb_load_buddy(sb, group, &e4b);
if (err)
--
2.21.0

2020-05-10 07:10:47

by Ritesh Harjani

[permalink] [raw]

Subject: [RFCv4 5/6] ext4: mballoc: Refactor ext4_mb_discard_preallocations()

Implement ext4_mb_discard_preallocations_should_retry()
which we will need in later patches to add more logic
like check for sequence number match to see if we should
retry for block allocation or not.

There should be no functionality change in this patch.

Signed-off-by: Ritesh Harjani <[email protected]>
---
fs/ext4/mballoc.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 574ce400a3b5..c18670924bbe 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4570,6 +4570,17 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
return freed;
}

+static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
+ struct ext4_allocation_context *ac)
+{
+ int freed;
+
+ freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
+ if (freed)
+ return true;
+ return false;
+}
+
/*
* Main entry point into mballoc to allocate blocks
* it tries to use preallocation first, then falls back
@@ -4578,7 +4589,6 @@ static int ext4_mb_discard_preallocations(struct super_block *sb, int needed)
ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
struct ext4_allocation_request *ar, int *errp)
{
- int freed;
struct ext4_allocation_context *ac = NULL;
struct ext4_sb_info *sbi;
struct super_block *sb;
@@ -4683,8 +4693,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
ar->len = ac->ac_b_ex.fe_len;
}
} else {
- freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
- if (freed)
+ if (ext4_mb_discard_preallocations_should_retry(sb, ac))
goto repeat;
/*
* If block allocation fails then the pa allocated above
--
2.21.0