2022-09-27 09:20:55

by Ojaswin Mujoo

[permalink] [raw]
Subject: [RFC v2 0/8] ext4: Convert inode preallocation list to an rbtree

This patch series aim to improve the performance and scalability of
inode preallocation by changing inode preallocation linked list to an
rbtree. I've ran xfstests quick on this series and plan to run auto group
as well to confirm we have no regressions.

** Shortcomings of existing implementation **

Right now, we add all the inode preallocations(PAs) to a per inode linked
list ei->i_prealloc_list. To prevent the list from growing infinitely
during heavy sparse workloads, the lenght of this list was capped at 512
and a trimming logic was added to trim the list whenever it grew over
this threshold, in patch 27bc446e2. This was discussed in detail in the
following lore thread [1].

[1] https://lore.kernel.org/all/[email protected]/

But from our testing, we noticed that the current implementation still
had issues with scalability as the performance degraded when the PAs
stored in the list grew. Most of the degradation was seen in
ext4_mb_normalize_request() and ext4_mb_use_preallocated() functions as
they iterated the inode PA list.

** Improvements in this patchset **

To counter the above shortcomings, this patch series modifies the inode
PA list to an rbtree, which:

- improves the performance of functions discussed above due to the
improved lookup speed.

- improves scalability by changing lookup complexity from O(n) to
O(logn). We no longer need the trimming logic as well.

As a result, the RCU implementation was needed to be changed since
lockless lookups of rbtrees do have some issues like skipping
subtrees. Hence, RCU was replaced with read write locks for inode
PAs. More information can be found in Patch 7 (that has the core
changes).

** Performance Numbers **

Performance numbers were collected with and without these patches, using an
nvme device. Details of tests/benchmarks used are as follows:

Test 1: 200,000 1KiB sparse writes using (fio)
Test 2: Fill 5GiB w/ random writes, 1KiB burst size using (fio)
Test 3: Test 2, but do 4 sequential writes before jumping to random
offset (fio)
Test 4: Fill 8GB FS w/ 2KiB files, 64 threads in parallel (fsmark)

+──────────+──────────────────+────────────────+──────────────────+──────────────────+
| | nodelalloc | delalloc |
+──────────+──────────────────+────────────────+──────────────────+──────────────────+
| | Unpatched | Patched | Unpatched | Patched |
+──────────+──────────────────+────────────────+──────────────────+──────────────────+
| Test 1 | 11.8 MB/s | 23.3 MB/s | 27.2 MB/s | 63.7 MB/s |
| Test 2 | 1617 MB/s | 1740 MB/s | 2223 MB/s | 2208 MB/s |
| Test 3 | 1715 MB/s | 1823 MB/s | 2346 MB/s | 2364 MB/s |
| Test 4 | 14284 files/sec | 14347 files/s | 13762 files/sec | 13882 files/sec |
+──────────+──────────────────+────────────────+──────────────────+──────────────────+

In test 1, we almost see 100 to 200% increase in performance due to the high number
of sparse writes highlighting the bottleneck in the unpatched kernel. Further, on running
"perf diff patched.data unpatched.data" for test 1, we see something as follows:

2.83% +29.67% [kernel.vmlinux] [k] _raw_spin_lock
...
+3.33% [ext4] [k] ext4_mb_normalize_request.constprop.30
0.25% +2.81% [ext4] [k] ext4_mb_use_preallocated

Here we can see that the biggest different is in the _raw_spin_lock() function
of unpatched kernel, that is called from `ext4_mb_normalize_request()` as seen
here:

32.47% fio [kernel.vmlinux] [k] _raw_spin_lock
|
---_raw_spin_lock
|
--32.22%--ext4_mb_normalize_request.constprop.30

This is comming from the spin_lock(&pa->pa_lock) that is called for
each PA that we iterate over, in ext4_mb_normalize_request(). Since in rbtrees,
we lookup log(n) PAs rather than n PAs, this spin lock is taken less frequently,
as evident in the perf.

Furthermore, we see some improvements in other tests however since they don't
exercise the PA traversal path as much as test 1, the improvements are relatively
smaller.

** Summary of patches **

- Patch 1-5: Abstractions/Minor optimizations
- Patch 6: Split common inode & locality group specific fields to a union
- Patch 7: Core changes to move inode PA logic from list to rbtree
- Patch 8: Remove the trim logic as it is not needed

** Changes since v2 **
- Added a function definition that was deleted during v2 rebase

** Changes since v1 **

- Rebased over ext4 dev branch which includes Jan's patchset [1]
that changed some code in mballoc.c

[1] https://lore.kernel.org/all/[email protected]/

Ojaswin Mujoo (8):
ext4: Stop searching if PA doesn't satisfy non-extent file
ext4: Refactor code related to freeing PAs
ext4: Refactor code in ext4_mb_normalize_request() and
ext4_mb_use_preallocated()
ext4: Move overlap assert logic into a separate function
ext4: Abstract out overlap fix/check logic in
ext4_mb_normalize_request()
ext4: Convert pa->pa_inode_list and pa->pa_obj_lock into a union
ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list
ext4: Remove the logic to trim inode PAs

Documentation/admin-guide/ext4.rst | 3 -
fs/ext4/ext4.h | 5 +-
fs/ext4/mballoc.c | 437 ++++++++++++++++++-----------
fs/ext4/mballoc.h | 17 +-
fs/ext4/super.c | 4 +-
fs/ext4/sysfs.c | 2 -
6 files changed, 293 insertions(+), 175 deletions(-)

--
2.31.1


2022-09-27 09:20:55

by Ojaswin Mujoo

[permalink] [raw]
Subject: [RFC v3 3/8] ext4: Refactor code in ext4_mb_normalize_request() and ext4_mb_use_preallocated()

Change some variable names to be more consistent and
refactor some of the code to make it easier to read.

There are no functional changes in this patch

Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/mballoc.c | 97 ++++++++++++++++++++++++-----------------------
1 file changed, 49 insertions(+), 48 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8be6f8765a6f..84950df709bb 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4000,7 +4000,8 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
loff_t orig_size __maybe_unused;
ext4_lblk_t start;
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
- struct ext4_prealloc_space *pa;
+ struct ext4_prealloc_space *tmp_pa;
+ ext4_lblk_t tmp_pa_start, tmp_pa_end;

/* do normalize only data requests, metadata requests
do not need preallocation */
@@ -4103,56 +4104,53 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,

/* check we don't cross already preallocated blocks */
rcu_read_lock();
- list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
- ext4_lblk_t pa_end;
-
- if (pa->pa_deleted)
+ list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
+ if (tmp_pa->pa_deleted)
continue;
- spin_lock(&pa->pa_lock);
- if (pa->pa_deleted) {
- spin_unlock(&pa->pa_lock);
+ spin_lock(&tmp_pa->pa_lock);
+ if (tmp_pa->pa_deleted) {
+ spin_unlock(&tmp_pa->pa_lock);
continue;
}

- pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb),
- pa->pa_len);
+ tmp_pa_start = tmp_pa->pa_lstart;
+ tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);

/* PA must not overlap original request */
- BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end ||
- ac->ac_o_ex.fe_logical < pa->pa_lstart));
+ BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
+ ac->ac_o_ex.fe_logical < tmp_pa_start));

/* skip PAs this normalized request doesn't overlap with */
- if (pa->pa_lstart >= end || pa_end <= start) {
- spin_unlock(&pa->pa_lock);
+ if (tmp_pa_start >= end || tmp_pa_end <= start) {
+ spin_unlock(&tmp_pa->pa_lock);
continue;
}
- BUG_ON(pa->pa_lstart <= start && pa_end >= end);
+ BUG_ON(tmp_pa_start <= start && tmp_pa_end >= end);

/* adjust start or end to be adjacent to this pa */
- if (pa_end <= ac->ac_o_ex.fe_logical) {
- BUG_ON(pa_end < start);
- start = pa_end;
- } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) {
- BUG_ON(pa->pa_lstart > end);
- end = pa->pa_lstart;
+ if (tmp_pa_end <= ac->ac_o_ex.fe_logical) {
+ BUG_ON(tmp_pa_end < start);
+ start = tmp_pa_end;
+ } else if (tmp_pa_start > ac->ac_o_ex.fe_logical) {
+ BUG_ON(tmp_pa_start > end);
+ end = tmp_pa_start;
}
- spin_unlock(&pa->pa_lock);
+ spin_unlock(&tmp_pa->pa_lock);
}
rcu_read_unlock();
size = end - start;

/* XXX: extra loop to check we really don't overlap preallocations */
rcu_read_lock();
- list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
- ext4_lblk_t pa_end;
+ list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
+ spin_lock(&tmp_pa->pa_lock);
+ if (tmp_pa->pa_deleted == 0) {
+ tmp_pa_start = tmp_pa->pa_lstart;
+ tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);

- spin_lock(&pa->pa_lock);
- if (pa->pa_deleted == 0) {
- pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb),
- pa->pa_len);
- BUG_ON(!(start >= pa_end || end <= pa->pa_lstart));
+ BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
}
- spin_unlock(&pa->pa_lock);
+ spin_unlock(&tmp_pa->pa_lock);
}
rcu_read_unlock();

@@ -4362,7 +4360,8 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
int order, i;
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
struct ext4_locality_group *lg;
- struct ext4_prealloc_space *pa, *cpa = NULL;
+ struct ext4_prealloc_space *tmp_pa, *cpa = NULL;
+ ext4_lblk_t tmp_pa_start, tmp_pa_end;
ext4_fsblk_t goal_block;

/* only data can be preallocated */
@@ -4371,18 +4370,20 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)

/* first, try per-file preallocation */
rcu_read_lock();
- list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
+ list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {

/* all fields in this condition don't change,
* so we can skip locking for them */
- if (ac->ac_o_ex.fe_logical < pa->pa_lstart ||
- ac->ac_o_ex.fe_logical >= (pa->pa_lstart +
- EXT4_C2B(sbi, pa->pa_len)))
+ tmp_pa_start = tmp_pa->pa_lstart;
+ tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+
+ if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
+ ac->ac_o_ex.fe_logical >= tmp_pa_end)
continue;

/* non-extent files can't have physical blocks past 2^32 */
if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
- (pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) >
+ (tmp_pa->pa_pstart + EXT4_C2B(sbi, tmp_pa->pa_len) >
EXT4_MAX_BLOCK_FILE_PHYS)) {
/*
* Since PAs don't overlap, we won't find any
@@ -4392,16 +4393,16 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
}

/* found preallocated blocks, use them */
- spin_lock(&pa->pa_lock);
- if (pa->pa_deleted == 0 && pa->pa_free) {
- atomic_inc(&pa->pa_count);
- ext4_mb_use_inode_pa(ac, pa);
- spin_unlock(&pa->pa_lock);
+ spin_lock(&tmp_pa->pa_lock);
+ if (tmp_pa->pa_deleted == 0 && tmp_pa->pa_free) {
+ atomic_inc(&tmp_pa->pa_count);
+ ext4_mb_use_inode_pa(ac, tmp_pa);
+ spin_unlock(&tmp_pa->pa_lock);
ac->ac_criteria = 10;
rcu_read_unlock();
return true;
}
- spin_unlock(&pa->pa_lock);
+ spin_unlock(&tmp_pa->pa_lock);
}
rcu_read_unlock();

@@ -4425,16 +4426,16 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
*/
for (i = order; i < PREALLOC_TB_SIZE; i++) {
rcu_read_lock();
- list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[i],
+ list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[i],
pa_inode_list) {
- spin_lock(&pa->pa_lock);
- if (pa->pa_deleted == 0 &&
- pa->pa_free >= ac->ac_o_ex.fe_len) {
+ spin_lock(&tmp_pa->pa_lock);
+ if (tmp_pa->pa_deleted == 0 &&
+ tmp_pa->pa_free >= ac->ac_o_ex.fe_len) {

cpa = ext4_mb_check_group_pa(goal_block,
- pa, cpa);
+ tmp_pa, cpa);
}
- spin_unlock(&pa->pa_lock);
+ spin_unlock(&tmp_pa->pa_lock);
}
rcu_read_unlock();
}
--
2.31.1

2022-09-27 09:21:50

by Ojaswin Mujoo

[permalink] [raw]
Subject: [RFC v3 2/8] ext4: Refactor code related to freeing PAs

This patch makes the following changes:

* Rename ext4_mb_pa_free to ext4_mb_pa_put_free
to better reflect its purpose

* Add new ext4_mb_pa_free() which only handles freeing

* Refactor ext4_mb_pa_callback() to use ext4_mb_pa_free()

There are no functional changes in this patch

Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/mballoc.c | 29 ++++++++++++++++++++---------
1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 2e3eb632a216..8be6f8765a6f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4531,16 +4531,21 @@ static void ext4_mb_mark_pa_deleted(struct super_block *sb,
}
}

-static void ext4_mb_pa_callback(struct rcu_head *head)
+static void inline ext4_mb_pa_free(struct ext4_prealloc_space *pa)
{
- struct ext4_prealloc_space *pa;
- pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
-
+ BUG_ON(!pa);
BUG_ON(atomic_read(&pa->pa_count));
BUG_ON(pa->pa_deleted == 0);
kmem_cache_free(ext4_pspace_cachep, pa);
}

+static void ext4_mb_pa_callback(struct rcu_head *head)
+{
+ struct ext4_prealloc_space *pa;
+ pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
+ ext4_mb_pa_free(pa);
+}
+
/*
* drops a reference to preallocated space descriptor
* if this was the last reference and the space is consumed
@@ -5067,14 +5072,20 @@ static int ext4_mb_pa_alloc(struct ext4_allocation_context *ac)
return 0;
}

-static void ext4_mb_pa_free(struct ext4_allocation_context *ac)
+static void ext4_mb_pa_put_free(struct ext4_allocation_context *ac)
{
struct ext4_prealloc_space *pa = ac->ac_pa;

BUG_ON(!pa);
ac->ac_pa = NULL;
WARN_ON(!atomic_dec_and_test(&pa->pa_count));
- kmem_cache_free(ext4_pspace_cachep, pa);
+ /*
+ * current function is only called due to an error or due to
+ * len of found blocks < len of requested blocks hence the PA has not
+ * been added to grp->bb_prealloc_list. So we don't need to lock it
+ */
+ pa->pa_deleted = 1;
+ ext4_mb_pa_free(pa);
}

#ifdef CONFIG_EXT4_DEBUG
@@ -5623,13 +5634,13 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
* So we have to free this pa here itself.
*/
if (*errp) {
- ext4_mb_pa_free(ac);
+ ext4_mb_pa_put_free(ac);
ext4_discard_allocated_blocks(ac);
goto errout;
}
if (ac->ac_status == AC_STATUS_FOUND &&
ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len)
- ext4_mb_pa_free(ac);
+ ext4_mb_pa_put_free(ac);
}
if (likely(ac->ac_status == AC_STATUS_FOUND)) {
*errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_clstrs);
@@ -5648,7 +5659,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
* If block allocation fails then the pa allocated above
* needs to be freed here itself.
*/
- ext4_mb_pa_free(ac);
+ ext4_mb_pa_put_free(ac);
*errp = -ENOSPC;
}

--
2.31.1

2022-09-27 09:23:14

by Ojaswin Mujoo

[permalink] [raw]
Subject: [RFC v3 4/8] ext4: Move overlap assert logic into a separate function

Abstract out the logic to double check for overlaps in normalize_pa to
a separate function. Since there has been no reports in past where we
have seen any overlaps which hits this bug_on(), in future we can
consider calling this function under "#ifdef AGGRESSIVE_CHECK" only.

There are no functional changes in this patch

Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/mballoc.c | 36 ++++++++++++++++++++++++------------
1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 84950df709bb..d1ce34888dcc 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3985,6 +3985,29 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
mb_debug(sb, "goal %u blocks for locality group\n", ac->ac_g_ex.fe_len);
}

+static inline void
+ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
+ ext4_lblk_t start, ext4_lblk_t end)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
+ struct ext4_prealloc_space *tmp_pa;
+ ext4_lblk_t tmp_pa_start, tmp_pa_end;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
+ spin_lock(&tmp_pa->pa_lock);
+ if (tmp_pa->pa_deleted == 0) {
+ tmp_pa_start = tmp_pa->pa_lstart;
+ tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+
+ BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
+ }
+ spin_unlock(&tmp_pa->pa_lock);
+ }
+ rcu_read_unlock();
+}
+
/*
* Normalization means making request better in terms of
* size and alignment
@@ -4141,18 +4164,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
size = end - start;

/* XXX: extra loop to check we really don't overlap preallocations */
- rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
- spin_lock(&tmp_pa->pa_lock);
- if (tmp_pa->pa_deleted == 0) {
- tmp_pa_start = tmp_pa->pa_lstart;
- tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
-
- BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
- }
- spin_unlock(&tmp_pa->pa_lock);
- }
- rcu_read_unlock();
+ ext4_mb_pa_assert_overlap(ac, start, end);

/*
* In this function "start" and "size" are normalized for better
--
2.31.1

2022-09-27 09:27:03

by Ojaswin Mujoo

[permalink] [raw]
Subject: [RFC v3 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list

Currently, the kernel uses i_prealloc_list to hold all the inode
preallocations. This is known to cause degradation in performance in
workloads which perform large number of sparse writes on a single file.
This is mainly because functions like ext4_mb_normalize_request() and
ext4_mb_use_preallocated() iterate over this complete list, resulting in
slowdowns when large number of PAs are present.

Patch 27bc446e2 partially fixed this by enforcing a limit of 512 for
the inode preallocation list and adding logic to continually trim the
list if it grows above the threshold, however our testing revealed that
a hardcoded value is not suitable for all kinds of workloads.

To optimize this, add an rbtree to the inode and hold the inode
preallocations in this rbtree. This will make iterating over inode PAs
faster and scale much better than a linked list. Additionally, we also
had to remove the LRU logic that was added during trimming of the list
(in ext4_mb_release_context()) as it will add extra overhead in rbtree.
The discards now happen in the lowest-logical-offset-first order.

** Locking notes **

With the introduction of rbtree to maintain inode PAs, we can't use RCU
to walk the tree for searching since it can result in partial traversals
which might miss some nodes(or entire subtrees) while discards happen
in parallel (which happens under a lock). Hence this patch converts the
ei->i_prealloc_lock spin_lock to rw_lock.

Almost all the codepaths that read/modify the PA rbtrees are protected
by the higher level inode->i_data_sem (except
ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the
only place we need lock protection is when one thread is reading
"searching" the PA rbtree (earlier protected under rcu_read_lock()) and
another is "deleting" the PAs in ext4_mb_discard_group_preallocations()
function (which iterates all the PAs using the grp->bb_prealloc_list and
deletes PAs from the tree without taking any inode lock (i_data_sem)).

So, this patch converts all rcu_read_lock/unlock() paths for inode list
PA to use read_lock() and all places where we were using
ei->i_prealloc_lock spinlock will now be using write_lock().

Note that this makes the fast path (searching of the right PA e.g.
ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use
read_lock() instead of rcu_read_lock/unlock(). Ths also will now block
due to slow discard path (ext4_mb_discard_group_preallocations()) which
uses write_lock().

But this is not as bad as it looks. This is because -

1. The slow path only occurs when the normal allocation failed and we
can say that we are low on disk space. One can argue this scenario
won't be much frequent.

2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock
for deleting every individual PA. This gives enough opportunity for
the fast path to acquire the read_lock for searching the PA inode
list.

Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/ext4.h | 4 +-
fs/ext4/mballoc.c | 192 ++++++++++++++++++++++++++++++++--------------
fs/ext4/mballoc.h | 6 +-
fs/ext4/super.c | 4 +-
4 files changed, 140 insertions(+), 66 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3bf9a6926798..d54b972f1f0f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1120,8 +1120,8 @@ struct ext4_inode_info {

/* mballoc */
atomic_t i_prealloc_active;
- struct list_head i_prealloc_list;
- spinlock_t i_prealloc_lock;
+ struct rb_root i_prealloc_node;
+ rwlock_t i_prealloc_lock;

/* extents status tree */
struct ext4_es_tree i_es_tree;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b91710fe881f..cd19b9e84767 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3985,6 +3985,24 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
mb_debug(sb, "goal %u blocks for locality group\n", ac->ac_g_ex.fe_len);
}

+/*
+ * This function returns the next element to look at during inode
+ * PA rbtree walk. We assume that we have held the inode PA rbtree lock
+ * (ei->i_prealloc_lock)
+ *
+ * new_start The start of the range we want to compare
+ * cur_start The existing start that we are comparing against
+ * node The node of the rb_tree
+ */
+static inline struct rb_node*
+ext4_mb_pa_rb_next_iter(int new_start, int cur_start, struct rb_node *node)
+{
+ if (new_start < cur_start)
+ return node->rb_left;
+ else
+ return node->rb_right;
+}
+
static inline void
ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
ext4_lblk_t start, ext4_lblk_t end)
@@ -3993,27 +4011,31 @@ ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
struct ext4_prealloc_space *tmp_pa;
ext4_lblk_t tmp_pa_start, tmp_pa_end;
+ struct rb_node *iter;

- rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
- spin_lock(&tmp_pa->pa_lock);
- if (tmp_pa->pa_deleted == 0) {
- tmp_pa_start = tmp_pa->pa_lstart;
- tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+ read_lock(&ei->i_prealloc_lock);
+ iter = ei->i_prealloc_node.rb_node;
+ while (iter) {
+ tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
+ pa_node.inode_node);
+ tmp_pa_start = tmp_pa->pa_lstart;
+ tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);

+ spin_lock(&tmp_pa->pa_lock);
+ if (tmp_pa->pa_deleted == 0)
BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
- }
spin_unlock(&tmp_pa->pa_lock);
+
+ iter = ext4_mb_pa_rb_next_iter(start, tmp_pa_start, iter);
}
- rcu_read_unlock();
+ read_unlock(&ei->i_prealloc_lock);
}
-
/*
* Given an allocation context "ac" and a range "start", "end", check
* and adjust boundaries if the range overlaps with any of the existing
* preallocatoins stored in the corresponding inode of the allocation context.
*
- *Parameters:
+ * Parameters:
* ac allocation context
* start start of the new range
* end end of the new range
@@ -4025,6 +4047,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_prealloc_space *tmp_pa;
+ struct rb_node *iter;
ext4_lblk_t new_start, new_end;
ext4_lblk_t tmp_pa_start, tmp_pa_end;

@@ -4032,19 +4055,29 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
new_end = *end;

/* check we don't cross already preallocated blocks */
- rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
- if (tmp_pa->pa_deleted)
+ read_lock(&ei->i_prealloc_lock);
+ iter = ei->i_prealloc_node.rb_node;
+ while (iter) {
+ tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
+ pa_node.inode_node);
+ tmp_pa_start = tmp_pa->pa_lstart;
+ tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+
+ /*
+ * If pa is deleted, ignore overlaps and just iterate in rbtree
+ * based on tmp_pa_start
+ */
+ if (tmp_pa->pa_deleted) {
+ iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
continue;
+ }
spin_lock(&tmp_pa->pa_lock);
if (tmp_pa->pa_deleted) {
spin_unlock(&tmp_pa->pa_lock);
+ iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
continue;
}

- tmp_pa_start = tmp_pa->pa_lstart;
- tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
-
/* PA must not overlap original request */
BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
ac->ac_o_ex.fe_logical < tmp_pa_start));
@@ -4052,6 +4085,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
/* skip PAs this normalized request doesn't overlap with */
if (tmp_pa_start >= new_end || tmp_pa_end <= new_start) {
spin_unlock(&tmp_pa->pa_lock);
+ iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
continue;
}
BUG_ON(tmp_pa_start <= new_start && tmp_pa_end >= new_end);
@@ -4065,8 +4099,9 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
new_end = tmp_pa_start;
}
spin_unlock(&tmp_pa->pa_lock);
+ iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
}
- rcu_read_unlock();
+ read_unlock(&ei->i_prealloc_lock);

/* XXX: extra loop to check we really don't overlap preallocations */
ext4_mb_pa_assert_overlap(ac, new_start, new_end);
@@ -4193,7 +4228,6 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
ext4_mb_pa_adjust_overlap(ac, &start, &end);

size = end - start;
-
/*
* In this function "start" and "size" are normalized for better
* alignment and length such that we could preallocate more blocks.
@@ -4402,6 +4436,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
struct ext4_locality_group *lg;
struct ext4_prealloc_space *tmp_pa, *cpa = NULL;
ext4_lblk_t tmp_pa_start, tmp_pa_end;
+ struct rb_node *iter;
ext4_fsblk_t goal_block;

/* only data can be preallocated */
@@ -4409,17 +4444,23 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
return false;

/* first, try per-file preallocation */
- rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
+ read_lock(&ei->i_prealloc_lock);
+ iter = ei->i_prealloc_node.rb_node;
+ while (iter) {
+ tmp_pa = rb_entry(iter, struct ext4_prealloc_space, pa_node.inode_node);

/* all fields in this condition don't change,
* so we can skip locking for them */
tmp_pa_start = tmp_pa->pa_lstart;
tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);

+ /* original request start doesn't lie in this PA */
if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
- ac->ac_o_ex.fe_logical >= tmp_pa_end)
+ ac->ac_o_ex.fe_logical >= tmp_pa_end) {
+ iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical,
+ tmp_pa_start, iter);
continue;
+ }

/* non-extent files can't have physical blocks past 2^32 */
if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
@@ -4439,12 +4480,14 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
ext4_mb_use_inode_pa(ac, tmp_pa);
spin_unlock(&tmp_pa->pa_lock);
ac->ac_criteria = 10;
- rcu_read_unlock();
+ read_unlock(&ei->i_prealloc_lock);
return true;
}
spin_unlock(&tmp_pa->pa_lock);
+ iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical,
+ tmp_pa_start, iter);
}
- rcu_read_unlock();
+ read_unlock(&ei->i_prealloc_lock);

/* can we use group allocation? */
if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC))
@@ -4596,6 +4639,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
{
ext4_group_t grp;
ext4_fsblk_t grp_blk;
+ struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);

/* in this short window concurrent discard can set pa_deleted */
spin_lock(&pa->pa_lock);
@@ -4641,16 +4685,51 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
ext4_unlock_group(sb, grp);

if (pa->pa_type == MB_INODE_PA) {
- spin_lock(pa->pa_node_lock.inode_lock);
- list_del_rcu(&pa->pa_node.inode_list);
- spin_unlock(pa->pa_node_lock.inode_lock);
+ write_lock(pa->pa_node_lock.inode_lock);
+ rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
+ write_unlock(pa->pa_node_lock.inode_lock);
+ ext4_mb_pa_free(pa);
} else {
spin_lock(pa->pa_node_lock.lg_lock);
list_del_rcu(&pa->pa_node.lg_list);
spin_unlock(pa->pa_node_lock.lg_lock);
+ call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
}
+}
+
+static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
+ int (*cmp)(struct rb_node *, struct rb_node *))
+{
+ struct rb_node **iter = &root->rb_node, *parent = NULL;
+
+ while (*iter) {
+ parent = *iter;
+ if (cmp(new, *iter) > 0)
+ iter = &((*iter)->rb_left);
+ else
+ iter = &((*iter)->rb_right);
+ }
+
+ rb_link_node(new, parent, iter);
+ rb_insert_color(new, root);
+}
+
+static int ext4_mb_pa_cmp(struct rb_node *new, struct rb_node *cur)
+{
+ ext4_grpblk_t cur_start, new_start;
+ struct ext4_prealloc_space *cur_pa = rb_entry(cur,
+ struct ext4_prealloc_space,
+ pa_node.inode_node);
+ struct ext4_prealloc_space *new_pa = rb_entry(new,
+ struct ext4_prealloc_space,
+ pa_node.inode_node);
+ cur_start = cur_pa->pa_lstart;
+ new_start = new_pa->pa_lstart;

- call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+ if (new_start < cur_start)
+ return 1;
+ else
+ return -1;
}

/*
@@ -4716,7 +4795,6 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
pa->pa_len = ac->ac_b_ex.fe_len;
pa->pa_free = pa->pa_len;
spin_lock_init(&pa->pa_lock);
- INIT_LIST_HEAD(&pa->pa_node.inode_list);
INIT_LIST_HEAD(&pa->pa_group_list);
pa->pa_deleted = 0;
pa->pa_type = MB_INODE_PA;
@@ -4736,9 +4814,10 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)

list_add(&pa->pa_group_list, &grp->bb_prealloc_list);

- spin_lock(pa->pa_node_lock.inode_lock);
- list_add_rcu(&pa->pa_node.inode_list, &ei->i_prealloc_list);
- spin_unlock(pa->pa_node_lock.inode_lock);
+ write_lock(pa->pa_node_lock.inode_lock);
+ ext4_mb_rb_insert(&ei->i_prealloc_node, &pa->pa_node.inode_node,
+ ext4_mb_pa_cmp);
+ write_unlock(pa->pa_node_lock.inode_lock);
atomic_inc(&ei->i_prealloc_active);
}

@@ -4904,6 +4983,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
struct ext4_prealloc_space *pa, *tmp;
struct list_head list;
struct ext4_buddy e4b;
+ struct ext4_inode_info *ei;
int err;
int free = 0;

@@ -4967,18 +5047,21 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
list_del_rcu(&pa->pa_node.lg_list);
spin_unlock(pa->pa_node_lock.lg_lock);
} else {
- spin_lock(pa->pa_node_lock.inode_lock);
- list_del_rcu(&pa->pa_node.inode_list);
- spin_unlock(pa->pa_node_lock.inode_lock);
+ write_lock(pa->pa_node_lock.inode_lock);
+ ei = EXT4_I(pa->pa_inode);
+ rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
+ write_unlock(pa->pa_node_lock.inode_lock);
}

- if (pa->pa_type == MB_GROUP_PA)
+ list_del(&pa->u.pa_tmp_list);
+
+ if (pa->pa_type == MB_GROUP_PA) {
ext4_mb_release_group_pa(&e4b, pa);
- else
+ call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+ } else {
ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
-
- list_del(&pa->u.pa_tmp_list);
- call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+ ext4_mb_pa_free(pa);
+ }
}

ext4_unlock_group(sb, group);
@@ -5008,6 +5091,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
ext4_group_t group = 0;
struct list_head list;
struct ext4_buddy e4b;
+ struct rb_node *iter;
int err;

if (!S_ISREG(inode->i_mode)) {
@@ -5030,17 +5114,18 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)

repeat:
/* first, collect all pa's in the inode */
- spin_lock(&ei->i_prealloc_lock);
- while (!list_empty(&ei->i_prealloc_list) && needed) {
- pa = list_entry(ei->i_prealloc_list.prev,
- struct ext4_prealloc_space, pa_node.inode_list);
+ write_lock(&ei->i_prealloc_lock);
+ for (iter = rb_first(&ei->i_prealloc_node); iter && needed; iter = rb_next(iter)) {
+ pa = rb_entry(iter, struct ext4_prealloc_space,
+ pa_node.inode_node);
BUG_ON(pa->pa_node_lock.inode_lock != &ei->i_prealloc_lock);
+
spin_lock(&pa->pa_lock);
if (atomic_read(&pa->pa_count)) {
/* this shouldn't happen often - nobody should
* use preallocation while we're discarding it */
spin_unlock(&pa->pa_lock);
- spin_unlock(&ei->i_prealloc_lock);
+ write_unlock(&ei->i_prealloc_lock);
ext4_msg(sb, KERN_ERR,
"uh-oh! used pa while discarding");
WARN_ON(1);
@@ -5051,7 +5136,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
if (pa->pa_deleted == 0) {
ext4_mb_mark_pa_deleted(sb, pa);
spin_unlock(&pa->pa_lock);
- list_del_rcu(&pa->pa_node.inode_list);
+ rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
list_add(&pa->u.pa_tmp_list, &list);
needed--;
continue;
@@ -5059,7 +5144,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)

/* someone is deleting pa right now */
spin_unlock(&pa->pa_lock);
- spin_unlock(&ei->i_prealloc_lock);
+ write_unlock(&ei->i_prealloc_lock);

/* we have to wait here because pa_deleted
* doesn't mean pa is already unlinked from
@@ -5076,7 +5161,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
schedule_timeout_uninterruptible(HZ);
goto repeat;
}
- spin_unlock(&ei->i_prealloc_lock);
+ write_unlock(&ei->i_prealloc_lock);

list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
BUG_ON(pa->pa_type != MB_INODE_PA);
@@ -5108,7 +5193,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
put_bh(bitmap_bh);

list_del(&pa->u.pa_tmp_list);
- call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+ ext4_mb_pa_free(pa);
}
}

@@ -5484,7 +5569,6 @@ static void ext4_mb_trim_inode_pa(struct inode *inode)
static int ext4_mb_release_context(struct ext4_allocation_context *ac)
{
struct inode *inode = ac->ac_inode;
- struct ext4_inode_info *ei = EXT4_I(inode);
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_prealloc_space *pa = ac->ac_pa;
if (pa) {
@@ -5511,16 +5595,6 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
}
}

- if (pa->pa_type == MB_INODE_PA) {
- /*
- * treat per-inode prealloc list as a lru list, then try
- * to trim the least recently used PA.
- */
- spin_lock(pa->pa_node_lock.inode_lock);
- list_move(&pa->pa_node.inode_list, &ei->i_prealloc_list);
- spin_unlock(pa->pa_node_lock.inode_lock);
- }
-
ext4_mb_put_pa(ac, ac->ac_sb, pa);
}
if (ac->ac_bitmap_page)
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 398a6688c341..f8e8ee493867 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -115,7 +115,7 @@ struct ext4_free_data {

struct ext4_prealloc_space {
union {
- struct list_head inode_list; /* for inode PAs */
+ struct rb_node inode_node; /* for inode PA rbtree */
struct list_head lg_list; /* for lg PAs */
} pa_node;
struct list_head pa_group_list;
@@ -132,10 +132,10 @@ struct ext4_prealloc_space {
ext4_grpblk_t pa_free; /* how many blocks are free */
unsigned short pa_type; /* pa type. inode or group */
union {
- spinlock_t *inode_lock; /* locks the inode list holding this PA */
+ rwlock_t *inode_lock; /* locks the rbtree holding this PA */
spinlock_t *lg_lock; /* locks the lg list holding this PA */
} pa_node_lock;
- struct inode *pa_inode; /* hack, for history only */
+ struct inode *pa_inode; /* used to get the inode during group discard */
};

enum {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 9a66abcca1a8..5e4fd4ea65fc 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1330,9 +1330,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)

inode_set_iversion(&ei->vfs_inode, 1);
spin_lock_init(&ei->i_raw_lock);
- INIT_LIST_HEAD(&ei->i_prealloc_list);
+ ei->i_prealloc_node = RB_ROOT;
atomic_set(&ei->i_prealloc_active, 0);
- spin_lock_init(&ei->i_prealloc_lock);
+ rwlock_init(&ei->i_prealloc_lock);
ext4_es_init_tree(&ei->i_es_tree);
rwlock_init(&ei->i_es_lock);
INIT_LIST_HEAD(&ei->i_es_list);
--
2.31.1

2022-09-27 09:33:02

by Ojaswin Mujoo

[permalink] [raw]
Subject: [RFC v3 6/8] ext4: Convert pa->pa_inode_list and pa->pa_obj_lock into a union

** Splitting pa->pa_inode_list **

Currently, we use the same pa->pa_inode_list to add a pa to either
the inode preallocation list or the locality group preallocation list.
For better clarity, split this list into a union of 2 list_heads and use
either of the them based on the type of pa.

** Splitting pa->pa_obj_lock **

Currently, pa->pa_obj_lock is either assigned &ei->i_prealloc_lock for
inode PAs or lg_prealloc_lock for lg PAs, and is then used to lock the
lists containing these PAs. Make the distinction between the 2 PA types
clear by changing this lock to a union of 2 locks. Explicitly use the
pa_lock_node.inode_lock for inode PAs and pa_lock_node.lg_lock for lg
PAs.

This patch is required so that the locality group preallocation code
remains the same as in upcoming patches we are going to make changes to
inode preallocation code to move from list to rbtree based
implementation. This patch also makes it easier to review the upcoming
patches.

There are no functional changes in this patch.

Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/mballoc.c | 76 +++++++++++++++++++++++++++--------------------
fs/ext4/mballoc.h | 10 +++++--
2 files changed, 52 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index dda9a72c81d9..b91710fe881f 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3995,7 +3995,7 @@ ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
ext4_lblk_t tmp_pa_start, tmp_pa_end;

rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
+ list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
spin_lock(&tmp_pa->pa_lock);
if (tmp_pa->pa_deleted == 0) {
tmp_pa_start = tmp_pa->pa_lstart;
@@ -4033,7 +4033,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,

/* check we don't cross already preallocated blocks */
rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
+ list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
if (tmp_pa->pa_deleted)
continue;
spin_lock(&tmp_pa->pa_lock);
@@ -4410,7 +4410,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)

/* first, try per-file preallocation */
rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
+ list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {

/* all fields in this condition don't change,
* so we can skip locking for them */
@@ -4467,7 +4467,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
for (i = order; i < PREALLOC_TB_SIZE; i++) {
rcu_read_lock();
list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[i],
- pa_inode_list) {
+ pa_node.lg_list) {
spin_lock(&tmp_pa->pa_lock);
if (tmp_pa->pa_deleted == 0 &&
tmp_pa->pa_free >= ac->ac_o_ex.fe_len) {
@@ -4640,9 +4640,15 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
list_del(&pa->pa_group_list);
ext4_unlock_group(sb, grp);

- spin_lock(pa->pa_obj_lock);
- list_del_rcu(&pa->pa_inode_list);
- spin_unlock(pa->pa_obj_lock);
+ if (pa->pa_type == MB_INODE_PA) {
+ spin_lock(pa->pa_node_lock.inode_lock);
+ list_del_rcu(&pa->pa_node.inode_list);
+ spin_unlock(pa->pa_node_lock.inode_lock);
+ } else {
+ spin_lock(pa->pa_node_lock.lg_lock);
+ list_del_rcu(&pa->pa_node.lg_list);
+ spin_unlock(pa->pa_node_lock.lg_lock);
+ }

call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
}
@@ -4710,7 +4716,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
pa->pa_len = ac->ac_b_ex.fe_len;
pa->pa_free = pa->pa_len;
spin_lock_init(&pa->pa_lock);
- INIT_LIST_HEAD(&pa->pa_inode_list);
+ INIT_LIST_HEAD(&pa->pa_node.inode_list);
INIT_LIST_HEAD(&pa->pa_group_list);
pa->pa_deleted = 0;
pa->pa_type = MB_INODE_PA;
@@ -4725,14 +4731,14 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
ei = EXT4_I(ac->ac_inode);
grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);

- pa->pa_obj_lock = &ei->i_prealloc_lock;
+ pa->pa_node_lock.inode_lock = &ei->i_prealloc_lock;
pa->pa_inode = ac->ac_inode;

list_add(&pa->pa_group_list, &grp->bb_prealloc_list);

- spin_lock(pa->pa_obj_lock);
- list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list);
- spin_unlock(pa->pa_obj_lock);
+ spin_lock(pa->pa_node_lock.inode_lock);
+ list_add_rcu(&pa->pa_node.inode_list, &ei->i_prealloc_list);
+ spin_unlock(pa->pa_node_lock.inode_lock);
atomic_inc(&ei->i_prealloc_active);
}

@@ -4764,7 +4770,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
pa->pa_len = ac->ac_b_ex.fe_len;
pa->pa_free = pa->pa_len;
spin_lock_init(&pa->pa_lock);
- INIT_LIST_HEAD(&pa->pa_inode_list);
+ INIT_LIST_HEAD(&pa->pa_node.lg_list);
INIT_LIST_HEAD(&pa->pa_group_list);
pa->pa_deleted = 0;
pa->pa_type = MB_GROUP_PA;
@@ -4780,7 +4786,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
lg = ac->ac_lg;
BUG_ON(lg == NULL);

- pa->pa_obj_lock = &lg->lg_prealloc_lock;
+ pa->pa_node_lock.lg_lock = &lg->lg_prealloc_lock;
pa->pa_inode = NULL;

list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
@@ -4956,9 +4962,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {

/* remove from object (inode or locality group) */
- spin_lock(pa->pa_obj_lock);
- list_del_rcu(&pa->pa_inode_list);
- spin_unlock(pa->pa_obj_lock);
+ if (pa->pa_type == MB_GROUP_PA) {
+ spin_lock(pa->pa_node_lock.lg_lock);
+ list_del_rcu(&pa->pa_node.lg_list);
+ spin_unlock(pa->pa_node_lock.lg_lock);
+ } else {
+ spin_lock(pa->pa_node_lock.inode_lock);
+ list_del_rcu(&pa->pa_node.inode_list);
+ spin_unlock(pa->pa_node_lock.inode_lock);
+ }

if (pa->pa_type == MB_GROUP_PA)
ext4_mb_release_group_pa(&e4b, pa);
@@ -5021,8 +5033,8 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
spin_lock(&ei->i_prealloc_lock);
while (!list_empty(&ei->i_prealloc_list) && needed) {
pa = list_entry(ei->i_prealloc_list.prev,
- struct ext4_prealloc_space, pa_inode_list);
- BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
+ struct ext4_prealloc_space, pa_node.inode_list);
+ BUG_ON(pa->pa_node_lock.inode_lock != &ei->i_prealloc_lock);
spin_lock(&pa->pa_lock);
if (atomic_read(&pa->pa_count)) {
/* this shouldn't happen often - nobody should
@@ -5039,7 +5051,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
if (pa->pa_deleted == 0) {
ext4_mb_mark_pa_deleted(sb, pa);
spin_unlock(&pa->pa_lock);
- list_del_rcu(&pa->pa_inode_list);
+ list_del_rcu(&pa->pa_node.inode_list);
list_add(&pa->u.pa_tmp_list, &list);
needed--;
continue;
@@ -5331,7 +5343,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,

spin_lock(&lg->lg_prealloc_lock);
list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[order],
- pa_inode_list,
+ pa_node.lg_list,
lockdep_is_held(&lg->lg_prealloc_lock)) {
spin_lock(&pa->pa_lock);
if (atomic_read(&pa->pa_count)) {
@@ -5354,7 +5366,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
ext4_mb_mark_pa_deleted(sb, pa);
spin_unlock(&pa->pa_lock);

- list_del_rcu(&pa->pa_inode_list);
+ list_del_rcu(&pa->pa_node.lg_list);
list_add(&pa->u.pa_tmp_list, &discard_list);

total_entries--;
@@ -5415,7 +5427,7 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
/* Add the prealloc space to lg */
spin_lock(&lg->lg_prealloc_lock);
list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order],
- pa_inode_list,
+ pa_node.lg_list,
lockdep_is_held(&lg->lg_prealloc_lock)) {
spin_lock(&tmp_pa->pa_lock);
if (tmp_pa->pa_deleted) {
@@ -5424,8 +5436,8 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
}
if (!added && pa->pa_free < tmp_pa->pa_free) {
/* Add to the tail of the previous entry */
- list_add_tail_rcu(&pa->pa_inode_list,
- &tmp_pa->pa_inode_list);
+ list_add_tail_rcu(&pa->pa_node.lg_list,
+ &tmp_pa->pa_node.lg_list);
added = 1;
/*
* we want to count the total
@@ -5436,7 +5448,7 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
lg_prealloc_count++;
}
if (!added)
- list_add_tail_rcu(&pa->pa_inode_list,
+ list_add_tail_rcu(&pa->pa_node.lg_list,
&lg->lg_prealloc_list[order]);
spin_unlock(&lg->lg_prealloc_lock);

@@ -5492,9 +5504,9 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
* doesn't grow big.
*/
if (likely(pa->pa_free)) {
- spin_lock(pa->pa_obj_lock);
- list_del_rcu(&pa->pa_inode_list);
- spin_unlock(pa->pa_obj_lock);
+ spin_lock(pa->pa_node_lock.lg_lock);
+ list_del_rcu(&pa->pa_node.lg_list);
+ spin_unlock(pa->pa_node_lock.lg_lock);
ext4_mb_add_n_trim(ac);
}
}
@@ -5504,9 +5516,9 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
* treat per-inode prealloc list as a lru list, then try
* to trim the least recently used PA.
*/
- spin_lock(pa->pa_obj_lock);
- list_move(&pa->pa_inode_list, &ei->i_prealloc_list);
- spin_unlock(pa->pa_obj_lock);
+ spin_lock(pa->pa_node_lock.inode_lock);
+ list_move(&pa->pa_node.inode_list, &ei->i_prealloc_list);
+ spin_unlock(pa->pa_node_lock.inode_lock);
}

ext4_mb_put_pa(ac, ac->ac_sb, pa);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index dcda2a943cee..398a6688c341 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -114,7 +114,10 @@ struct ext4_free_data {
};

struct ext4_prealloc_space {
- struct list_head pa_inode_list;
+ union {
+ struct list_head inode_list; /* for inode PAs */
+ struct list_head lg_list; /* for lg PAs */
+ } pa_node;
struct list_head pa_group_list;
union {
struct list_head pa_tmp_list;
@@ -128,7 +131,10 @@ struct ext4_prealloc_space {
ext4_grpblk_t pa_len; /* len of preallocated chunk */
ext4_grpblk_t pa_free; /* how many blocks are free */
unsigned short pa_type; /* pa type. inode or group */
- spinlock_t *pa_obj_lock;
+ union {
+ spinlock_t *inode_lock; /* locks the inode list holding this PA */
+ spinlock_t *lg_lock; /* locks the lg list holding this PA */
+ } pa_node_lock;
struct inode *pa_inode; /* hack, for history only */
};

--
2.31.1

2022-09-27 09:36:32

by Ojaswin Mujoo

[permalink] [raw]
Subject: [RFC v3 5/8] ext4: Abstract out overlap fix/check logic in ext4_mb_normalize_request()

Abstract out the logic of fixing PA overlaps in ext4_mb_normalize_request to
improve readability of code. This also makes it easier to make changes
to the overlap logic in future.

There are no functional changes in this patch

Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/mballoc.c | 110 +++++++++++++++++++++++++++++-----------------
1 file changed, 69 insertions(+), 41 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d1ce34888dcc..dda9a72c81d9 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4008,6 +4008,74 @@ ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
rcu_read_unlock();
}

+/*
+ * Given an allocation context "ac" and a range "start", "end", check
+ * and adjust boundaries if the range overlaps with any of the existing
+ * preallocatoins stored in the corresponding inode of the allocation context.
+ *
+ *Parameters:
+ * ac allocation context
+ * start start of the new range
+ * end end of the new range
+ */
+static inline void
+ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
+ ext4_lblk_t *start, ext4_lblk_t *end)
+{
+ struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
+ struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+ struct ext4_prealloc_space *tmp_pa;
+ ext4_lblk_t new_start, new_end;
+ ext4_lblk_t tmp_pa_start, tmp_pa_end;
+
+ new_start = *start;
+ new_end = *end;
+
+ /* check we don't cross already preallocated blocks */
+ rcu_read_lock();
+ list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
+ if (tmp_pa->pa_deleted)
+ continue;
+ spin_lock(&tmp_pa->pa_lock);
+ if (tmp_pa->pa_deleted) {
+ spin_unlock(&tmp_pa->pa_lock);
+ continue;
+ }
+
+ tmp_pa_start = tmp_pa->pa_lstart;
+ tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+
+ /* PA must not overlap original request */
+ BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
+ ac->ac_o_ex.fe_logical < tmp_pa_start));
+
+ /* skip PAs this normalized request doesn't overlap with */
+ if (tmp_pa_start >= new_end || tmp_pa_end <= new_start) {
+ spin_unlock(&tmp_pa->pa_lock);
+ continue;
+ }
+ BUG_ON(tmp_pa_start <= new_start && tmp_pa_end >= new_end);
+
+ /* adjust start or end to be adjacent to this pa */
+ if (tmp_pa_end <= ac->ac_o_ex.fe_logical) {
+ BUG_ON(tmp_pa_end < new_start);
+ new_start = tmp_pa_end;
+ } else if (tmp_pa_start > ac->ac_o_ex.fe_logical) {
+ BUG_ON(tmp_pa_start > new_end);
+ new_end = tmp_pa_start;
+ }
+ spin_unlock(&tmp_pa->pa_lock);
+ }
+ rcu_read_unlock();
+
+ /* XXX: extra loop to check we really don't overlap preallocations */
+ ext4_mb_pa_assert_overlap(ac, new_start, new_end);
+
+ *start = new_start;
+ *end = new_end;
+ return;
+}
+
/*
* Normalization means making request better in terms of
* size and alignment
@@ -4022,9 +4090,6 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
loff_t size, start_off;
loff_t orig_size __maybe_unused;
ext4_lblk_t start;
- struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
- struct ext4_prealloc_space *tmp_pa;
- ext4_lblk_t tmp_pa_start, tmp_pa_end;

/* do normalize only data requests, metadata requests
do not need preallocation */
@@ -4125,47 +4190,10 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,

end = start + size;

- /* check we don't cross already preallocated blocks */
- rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
- if (tmp_pa->pa_deleted)
- continue;
- spin_lock(&tmp_pa->pa_lock);
- if (tmp_pa->pa_deleted) {
- spin_unlock(&tmp_pa->pa_lock);
- continue;
- }
-
- tmp_pa_start = tmp_pa->pa_lstart;
- tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
-
- /* PA must not overlap original request */
- BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
- ac->ac_o_ex.fe_logical < tmp_pa_start));
-
- /* skip PAs this normalized request doesn't overlap with */
- if (tmp_pa_start >= end || tmp_pa_end <= start) {
- spin_unlock(&tmp_pa->pa_lock);
- continue;
- }
- BUG_ON(tmp_pa_start <= start && tmp_pa_end >= end);
+ ext4_mb_pa_adjust_overlap(ac, &start, &end);

- /* adjust start or end to be adjacent to this pa */
- if (tmp_pa_end <= ac->ac_o_ex.fe_logical) {
- BUG_ON(tmp_pa_end < start);
- start = tmp_pa_end;
- } else if (tmp_pa_start > ac->ac_o_ex.fe_logical) {
- BUG_ON(tmp_pa_start > end);
- end = tmp_pa_start;
- }
- spin_unlock(&tmp_pa->pa_lock);
- }
- rcu_read_unlock();
size = end - start;

- /* XXX: extra loop to check we really don't overlap preallocations */
- ext4_mb_pa_assert_overlap(ac, start, end);
-
/*
* In this function "start" and "size" are normalized for better
* alignment and length such that we could preallocate more blocks.
--
2.31.1

2022-09-27 09:55:27

by Ojaswin Mujoo

[permalink] [raw]
Subject: [RFC v3 8/8] ext4: Remove the logic to trim inode PAs

Earlier, inode PAs were stored in a linked list. This caused a need to
periodically trim the list down inorder to avoid growing it to a very
large size, as this would severly affect performance during list
iteration.

Recent patches changed this list to an rbtree, and since the tree scales
up much better, we no longer need to have the trim functionality, hence
remove it.

Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
---
Documentation/admin-guide/ext4.rst | 3 ---
fs/ext4/ext4.h | 1 -
fs/ext4/mballoc.c | 20 --------------------
fs/ext4/mballoc.h | 5 -----
fs/ext4/sysfs.c | 2 --
5 files changed, 31 deletions(-)

diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst
index 4c559e08d11e..5740d85439ff 100644
--- a/Documentation/admin-guide/ext4.rst
+++ b/Documentation/admin-guide/ext4.rst
@@ -489,9 +489,6 @@ Files in /sys/fs/ext4/<devname>:
multiple of this tuning parameter if the stripe size is not set in the
ext4 superblock

- mb_max_inode_prealloc
- The maximum length of per-inode ext4_prealloc_space list.
-
mb_max_to_scan
The maximum number of extents the multiblock allocator will search to
find the best extent.
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d54b972f1f0f..bca4b41cc192 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1612,7 +1612,6 @@ struct ext4_sb_info {
unsigned int s_mb_stats;
unsigned int s_mb_order2_reqs;
unsigned int s_mb_group_prealloc;
- unsigned int s_mb_max_inode_prealloc;
unsigned int s_max_dir_size_kb;
/* where last allocation was done - for stream allocation */
unsigned long s_mb_last_group;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index cd19b9e84767..57e1ec88477a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3420,7 +3420,6 @@ int ext4_mb_init(struct super_block *sb)
sbi->s_mb_stats = MB_DEFAULT_STATS;
sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
- sbi->s_mb_max_inode_prealloc = MB_DEFAULT_MAX_INODE_PREALLOC;
/*
* The default group preallocation is 512, which for 4k block
* sizes translates to 2 megabytes. However for bigalloc file
@@ -5546,29 +5545,11 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
return ;
}

-/*
- * if per-inode prealloc list is too long, trim some PA
- */
-static void ext4_mb_trim_inode_pa(struct inode *inode)
-{
- struct ext4_inode_info *ei = EXT4_I(inode);
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- int count, delta;
-
- count = atomic_read(&ei->i_prealloc_active);
- delta = (sbi->s_mb_max_inode_prealloc >> 2) + 1;
- if (count > sbi->s_mb_max_inode_prealloc + delta) {
- count -= sbi->s_mb_max_inode_prealloc;
- ext4_discard_preallocations(inode, count);
- }
-}
-
/*
* release all resource we used in allocation
*/
static int ext4_mb_release_context(struct ext4_allocation_context *ac)
{
- struct inode *inode = ac->ac_inode;
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_prealloc_space *pa = ac->ac_pa;
if (pa) {
@@ -5604,7 +5585,6 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)
mutex_unlock(&ac->ac_lg->lg_mutex);
ext4_mb_collect_stats(ac);
- ext4_mb_trim_inode_pa(inode);
return 0;
}

diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index f8e8ee493867..6d85ee8674a6 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -73,11 +73,6 @@
*/
#define MB_DEFAULT_GROUP_PREALLOC 512

-/*
- * maximum length of inode prealloc list
- */
-#define MB_DEFAULT_MAX_INODE_PREALLOC 512
-
/*
* Number of groups to search linearly before performing group scanning
* optimization.
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index d233c24ea342..f0d42cf44c71 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -214,7 +214,6 @@ EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
-EXT4_RW_ATTR_SBI_UI(mb_max_inode_prealloc, s_mb_max_inode_prealloc);
EXT4_RW_ATTR_SBI_UI(mb_max_linear_groups, s_mb_max_linear_groups);
EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
@@ -264,7 +263,6 @@ static struct attribute *ext4_attrs[] = {
ATTR_LIST(mb_order2_req),
ATTR_LIST(mb_stream_req),
ATTR_LIST(mb_group_prealloc),
- ATTR_LIST(mb_max_inode_prealloc),
ATTR_LIST(mb_max_linear_groups),
ATTR_LIST(max_writeback_mb_bump),
ATTR_LIST(extent_max_zeroout_kb),
--
2.31.1

2022-09-29 11:39:51

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC v3 5/8] ext4: Abstract out overlap fix/check logic in ext4_mb_normalize_request()

On Tue 27-09-22 14:46:45, Ojaswin Mujoo wrote:
> Abstract out the logic of fixing PA overlaps in ext4_mb_normalize_request to
> improve readability of code. This also makes it easier to make changes
> to the overlap logic in future.
>
> There are no functional changes in this patch
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/mballoc.c | 110 +++++++++++++++++++++++++++++-----------------
> 1 file changed, 69 insertions(+), 41 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index d1ce34888dcc..dda9a72c81d9 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4008,6 +4008,74 @@ ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
> rcu_read_unlock();
> }
>
> +/*
> + * Given an allocation context "ac" and a range "start", "end", check
> + * and adjust boundaries if the range overlaps with any of the existing
> + * preallocatoins stored in the corresponding inode of the allocation context.
> + *
> + *Parameters:
> + * ac allocation context
> + * start start of the new range
> + * end end of the new range
> + */
> +static inline void
> +ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
> + ext4_lblk_t *start, ext4_lblk_t *end)
> +{
> + struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + struct ext4_prealloc_space *tmp_pa;
> + ext4_lblk_t new_start, new_end;
> + ext4_lblk_t tmp_pa_start, tmp_pa_end;
> +
> + new_start = *start;
> + new_end = *end;
> +
> + /* check we don't cross already preallocated blocks */
> + rcu_read_lock();
> + list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
> + if (tmp_pa->pa_deleted)
> + continue;
> + spin_lock(&tmp_pa->pa_lock);
> + if (tmp_pa->pa_deleted) {
> + spin_unlock(&tmp_pa->pa_lock);
> + continue;
> + }
> +
> + tmp_pa_start = tmp_pa->pa_lstart;
> + tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> +
> + /* PA must not overlap original request */
> + BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
> + ac->ac_o_ex.fe_logical < tmp_pa_start));
> +
> + /* skip PAs this normalized request doesn't overlap with */
> + if (tmp_pa_start >= new_end || tmp_pa_end <= new_start) {
> + spin_unlock(&tmp_pa->pa_lock);
> + continue;
> + }
> + BUG_ON(tmp_pa_start <= new_start && tmp_pa_end >= new_end);
> +
> + /* adjust start or end to be adjacent to this pa */
> + if (tmp_pa_end <= ac->ac_o_ex.fe_logical) {
> + BUG_ON(tmp_pa_end < new_start);
> + new_start = tmp_pa_end;
> + } else if (tmp_pa_start > ac->ac_o_ex.fe_logical) {
> + BUG_ON(tmp_pa_start > new_end);
> + new_end = tmp_pa_start;
> + }
> + spin_unlock(&tmp_pa->pa_lock);
> + }
> + rcu_read_unlock();
> +
> + /* XXX: extra loop to check we really don't overlap preallocations */
> + ext4_mb_pa_assert_overlap(ac, new_start, new_end);
> +
> + *start = new_start;
> + *end = new_end;
> + return;
> +}
> +
> /*
> * Normalization means making request better in terms of
> * size and alignment
> @@ -4022,9 +4090,6 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> loff_t size, start_off;
> loff_t orig_size __maybe_unused;
> ext4_lblk_t start;
> - struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
> - struct ext4_prealloc_space *tmp_pa;
> - ext4_lblk_t tmp_pa_start, tmp_pa_end;
>
> /* do normalize only data requests, metadata requests
> do not need preallocation */
> @@ -4125,47 +4190,10 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>
> end = start + size;
>
> - /* check we don't cross already preallocated blocks */
> - rcu_read_lock();
> - list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
> - if (tmp_pa->pa_deleted)
> - continue;
> - spin_lock(&tmp_pa->pa_lock);
> - if (tmp_pa->pa_deleted) {
> - spin_unlock(&tmp_pa->pa_lock);
> - continue;
> - }
> -
> - tmp_pa_start = tmp_pa->pa_lstart;
> - tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> -
> - /* PA must not overlap original request */
> - BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
> - ac->ac_o_ex.fe_logical < tmp_pa_start));
> -
> - /* skip PAs this normalized request doesn't overlap with */
> - if (tmp_pa_start >= end || tmp_pa_end <= start) {
> - spin_unlock(&tmp_pa->pa_lock);
> - continue;
> - }
> - BUG_ON(tmp_pa_start <= start && tmp_pa_end >= end);
> + ext4_mb_pa_adjust_overlap(ac, &start, &end);
>
> - /* adjust start or end to be adjacent to this pa */
> - if (tmp_pa_end <= ac->ac_o_ex.fe_logical) {
> - BUG_ON(tmp_pa_end < start);
> - start = tmp_pa_end;
> - } else if (tmp_pa_start > ac->ac_o_ex.fe_logical) {
> - BUG_ON(tmp_pa_start > end);
> - end = tmp_pa_start;
> - }
> - spin_unlock(&tmp_pa->pa_lock);
> - }
> - rcu_read_unlock();
> size = end - start;
>
> - /* XXX: extra loop to check we really don't overlap preallocations */
> - ext4_mb_pa_assert_overlap(ac, start, end);
> -
> /*
> * In this function "start" and "size" are normalized for better
> * alignment and length such that we could preallocate more blocks.
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-29 11:41:41

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC v3 3/8] ext4: Refactor code in ext4_mb_normalize_request() and ext4_mb_use_preallocated()

On Tue 27-09-22 14:46:43, Ojaswin Mujoo wrote:
> Change some variable names to be more consistent and
> refactor some of the code to make it easier to read.
>
> There are no functional changes in this patch
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>

Looks good, although I have to say I don't find renaming pa -> tmp_pa
making the code any more readable. Anyways, feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/mballoc.c | 97 ++++++++++++++++++++++++-----------------------
> 1 file changed, 49 insertions(+), 48 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 8be6f8765a6f..84950df709bb 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4000,7 +4000,8 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> loff_t orig_size __maybe_unused;
> ext4_lblk_t start;
> struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
> - struct ext4_prealloc_space *pa;
> + struct ext4_prealloc_space *tmp_pa;
> + ext4_lblk_t tmp_pa_start, tmp_pa_end;
>
> /* do normalize only data requests, metadata requests
> do not need preallocation */
> @@ -4103,56 +4104,53 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>
> /* check we don't cross already preallocated blocks */
> rcu_read_lock();
> - list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
> - ext4_lblk_t pa_end;
> -
> - if (pa->pa_deleted)
> + list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
> + if (tmp_pa->pa_deleted)
> continue;
> - spin_lock(&pa->pa_lock);
> - if (pa->pa_deleted) {
> - spin_unlock(&pa->pa_lock);
> + spin_lock(&tmp_pa->pa_lock);
> + if (tmp_pa->pa_deleted) {
> + spin_unlock(&tmp_pa->pa_lock);
> continue;
> }
>
> - pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb),
> - pa->pa_len);
> + tmp_pa_start = tmp_pa->pa_lstart;
> + tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
>
> /* PA must not overlap original request */
> - BUG_ON(!(ac->ac_o_ex.fe_logical >= pa_end ||
> - ac->ac_o_ex.fe_logical < pa->pa_lstart));
> + BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
> + ac->ac_o_ex.fe_logical < tmp_pa_start));
>
> /* skip PAs this normalized request doesn't overlap with */
> - if (pa->pa_lstart >= end || pa_end <= start) {
> - spin_unlock(&pa->pa_lock);
> + if (tmp_pa_start >= end || tmp_pa_end <= start) {
> + spin_unlock(&tmp_pa->pa_lock);
> continue;
> }
> - BUG_ON(pa->pa_lstart <= start && pa_end >= end);
> + BUG_ON(tmp_pa_start <= start && tmp_pa_end >= end);
>
> /* adjust start or end to be adjacent to this pa */
> - if (pa_end <= ac->ac_o_ex.fe_logical) {
> - BUG_ON(pa_end < start);
> - start = pa_end;
> - } else if (pa->pa_lstart > ac->ac_o_ex.fe_logical) {
> - BUG_ON(pa->pa_lstart > end);
> - end = pa->pa_lstart;
> + if (tmp_pa_end <= ac->ac_o_ex.fe_logical) {
> + BUG_ON(tmp_pa_end < start);
> + start = tmp_pa_end;
> + } else if (tmp_pa_start > ac->ac_o_ex.fe_logical) {
> + BUG_ON(tmp_pa_start > end);
> + end = tmp_pa_start;
> }
> - spin_unlock(&pa->pa_lock);
> + spin_unlock(&tmp_pa->pa_lock);
> }
> rcu_read_unlock();
> size = end - start;
>
> /* XXX: extra loop to check we really don't overlap preallocations */
> rcu_read_lock();
> - list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
> - ext4_lblk_t pa_end;
> + list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
> + spin_lock(&tmp_pa->pa_lock);
> + if (tmp_pa->pa_deleted == 0) {
> + tmp_pa_start = tmp_pa->pa_lstart;
> + tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
>
> - spin_lock(&pa->pa_lock);
> - if (pa->pa_deleted == 0) {
> - pa_end = pa->pa_lstart + EXT4_C2B(EXT4_SB(ac->ac_sb),
> - pa->pa_len);
> - BUG_ON(!(start >= pa_end || end <= pa->pa_lstart));
> + BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
> }
> - spin_unlock(&pa->pa_lock);
> + spin_unlock(&tmp_pa->pa_lock);
> }
> rcu_read_unlock();
>
> @@ -4362,7 +4360,8 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
> int order, i;
> struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
> struct ext4_locality_group *lg;
> - struct ext4_prealloc_space *pa, *cpa = NULL;
> + struct ext4_prealloc_space *tmp_pa, *cpa = NULL;
> + ext4_lblk_t tmp_pa_start, tmp_pa_end;
> ext4_fsblk_t goal_block;
>
> /* only data can be preallocated */
> @@ -4371,18 +4370,20 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
>
> /* first, try per-file preallocation */
> rcu_read_lock();
> - list_for_each_entry_rcu(pa, &ei->i_prealloc_list, pa_inode_list) {
> + list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
>
> /* all fields in this condition don't change,
> * so we can skip locking for them */
> - if (ac->ac_o_ex.fe_logical < pa->pa_lstart ||
> - ac->ac_o_ex.fe_logical >= (pa->pa_lstart +
> - EXT4_C2B(sbi, pa->pa_len)))
> + tmp_pa_start = tmp_pa->pa_lstart;
> + tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> +
> + if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
> + ac->ac_o_ex.fe_logical >= tmp_pa_end)
> continue;
>
> /* non-extent files can't have physical blocks past 2^32 */
> if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
> - (pa->pa_pstart + EXT4_C2B(sbi, pa->pa_len) >
> + (tmp_pa->pa_pstart + EXT4_C2B(sbi, tmp_pa->pa_len) >
> EXT4_MAX_BLOCK_FILE_PHYS)) {
> /*
> * Since PAs don't overlap, we won't find any
> @@ -4392,16 +4393,16 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
> }
>
> /* found preallocated blocks, use them */
> - spin_lock(&pa->pa_lock);
> - if (pa->pa_deleted == 0 && pa->pa_free) {
> - atomic_inc(&pa->pa_count);
> - ext4_mb_use_inode_pa(ac, pa);
> - spin_unlock(&pa->pa_lock);
> + spin_lock(&tmp_pa->pa_lock);
> + if (tmp_pa->pa_deleted == 0 && tmp_pa->pa_free) {
> + atomic_inc(&tmp_pa->pa_count);
> + ext4_mb_use_inode_pa(ac, tmp_pa);
> + spin_unlock(&tmp_pa->pa_lock);
> ac->ac_criteria = 10;
> rcu_read_unlock();
> return true;
> }
> - spin_unlock(&pa->pa_lock);
> + spin_unlock(&tmp_pa->pa_lock);
> }
> rcu_read_unlock();
>
> @@ -4425,16 +4426,16 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
> */
> for (i = order; i < PREALLOC_TB_SIZE; i++) {
> rcu_read_lock();
> - list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[i],
> + list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[i],
> pa_inode_list) {
> - spin_lock(&pa->pa_lock);
> - if (pa->pa_deleted == 0 &&
> - pa->pa_free >= ac->ac_o_ex.fe_len) {
> + spin_lock(&tmp_pa->pa_lock);
> + if (tmp_pa->pa_deleted == 0 &&
> + tmp_pa->pa_free >= ac->ac_o_ex.fe_len) {
>
> cpa = ext4_mb_check_group_pa(goal_block,
> - pa, cpa);
> + tmp_pa, cpa);
> }
> - spin_unlock(&pa->pa_lock);
> + spin_unlock(&tmp_pa->pa_lock);
> }
> rcu_read_unlock();
> }
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-29 11:46:39

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC v3 2/8] ext4: Refactor code related to freeing PAs

On Tue 27-09-22 14:46:42, Ojaswin Mujoo wrote:
> This patch makes the following changes:
>
> * Rename ext4_mb_pa_free to ext4_mb_pa_put_free
> to better reflect its purpose
>
> * Add new ext4_mb_pa_free() which only handles freeing
>
> * Refactor ext4_mb_pa_callback() to use ext4_mb_pa_free()
>
> There are no functional changes in this patch
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/mballoc.c | 29 ++++++++++++++++++++---------
> 1 file changed, 20 insertions(+), 9 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 2e3eb632a216..8be6f8765a6f 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4531,16 +4531,21 @@ static void ext4_mb_mark_pa_deleted(struct super_block *sb,
> }
> }
>
> -static void ext4_mb_pa_callback(struct rcu_head *head)
> +static void inline ext4_mb_pa_free(struct ext4_prealloc_space *pa)
> {
> - struct ext4_prealloc_space *pa;
> - pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
> -
> + BUG_ON(!pa);
> BUG_ON(atomic_read(&pa->pa_count));
> BUG_ON(pa->pa_deleted == 0);
> kmem_cache_free(ext4_pspace_cachep, pa);
> }
>
> +static void ext4_mb_pa_callback(struct rcu_head *head)
> +{
> + struct ext4_prealloc_space *pa;
> + pa = container_of(head, struct ext4_prealloc_space, u.pa_rcu);
> + ext4_mb_pa_free(pa);
> +}
> +
> /*
> * drops a reference to preallocated space descriptor
> * if this was the last reference and the space is consumed
> @@ -5067,14 +5072,20 @@ static int ext4_mb_pa_alloc(struct ext4_allocation_context *ac)
> return 0;
> }
>
> -static void ext4_mb_pa_free(struct ext4_allocation_context *ac)
> +static void ext4_mb_pa_put_free(struct ext4_allocation_context *ac)
> {
> struct ext4_prealloc_space *pa = ac->ac_pa;
>
> BUG_ON(!pa);
> ac->ac_pa = NULL;
> WARN_ON(!atomic_dec_and_test(&pa->pa_count));
> - kmem_cache_free(ext4_pspace_cachep, pa);
> + /*
> + * current function is only called due to an error or due to
> + * len of found blocks < len of requested blocks hence the PA has not
> + * been added to grp->bb_prealloc_list. So we don't need to lock it
> + */
> + pa->pa_deleted = 1;
> + ext4_mb_pa_free(pa);
> }
>
> #ifdef CONFIG_EXT4_DEBUG
> @@ -5623,13 +5634,13 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> * So we have to free this pa here itself.
> */
> if (*errp) {
> - ext4_mb_pa_free(ac);
> + ext4_mb_pa_put_free(ac);
> ext4_discard_allocated_blocks(ac);
> goto errout;
> }
> if (ac->ac_status == AC_STATUS_FOUND &&
> ac->ac_o_ex.fe_len >= ac->ac_f_ex.fe_len)
> - ext4_mb_pa_free(ac);
> + ext4_mb_pa_put_free(ac);
> }
> if (likely(ac->ac_status == AC_STATUS_FOUND)) {
> *errp = ext4_mb_mark_diskspace_used(ac, handle, reserv_clstrs);
> @@ -5648,7 +5659,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
> * If block allocation fails then the pa allocated above
> * needs to be freed here itself.
> */
> - ext4_mb_pa_free(ac);
> + ext4_mb_pa_put_free(ac);
> *errp = -ENOSPC;
> }
>
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-29 11:48:54

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC v3 4/8] ext4: Move overlap assert logic into a separate function

On Tue 27-09-22 14:46:44, Ojaswin Mujoo wrote:
> Abstract out the logic to double check for overlaps in normalize_pa to
> a separate function. Since there has been no reports in past where we
> have seen any overlaps which hits this bug_on(), in future we can
> consider calling this function under "#ifdef AGGRESSIVE_CHECK" only.
>
> There are no functional changes in this patch
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

And I agree it might be interesting to move this code under appropriate
ifdef.

Honza

> ---
> fs/ext4/mballoc.c | 36 ++++++++++++++++++++++++------------
> 1 file changed, 24 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 84950df709bb..d1ce34888dcc 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3985,6 +3985,29 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
> mb_debug(sb, "goal %u blocks for locality group\n", ac->ac_g_ex.fe_len);
> }
>
> +static inline void
> +ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
> + ext4_lblk_t start, ext4_lblk_t end)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> + struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
> + struct ext4_prealloc_space *tmp_pa;
> + ext4_lblk_t tmp_pa_start, tmp_pa_end;
> +
> + rcu_read_lock();
> + list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
> + spin_lock(&tmp_pa->pa_lock);
> + if (tmp_pa->pa_deleted == 0) {
> + tmp_pa_start = tmp_pa->pa_lstart;
> + tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> +
> + BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
> + }
> + spin_unlock(&tmp_pa->pa_lock);
> + }
> + rcu_read_unlock();
> +}
> +
> /*
> * Normalization means making request better in terms of
> * size and alignment
> @@ -4141,18 +4164,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> size = end - start;
>
> /* XXX: extra loop to check we really don't overlap preallocations */
> - rcu_read_lock();
> - list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
> - spin_lock(&tmp_pa->pa_lock);
> - if (tmp_pa->pa_deleted == 0) {
> - tmp_pa_start = tmp_pa->pa_lstart;
> - tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> -
> - BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
> - }
> - spin_unlock(&tmp_pa->pa_lock);
> - }
> - rcu_read_unlock();
> + ext4_mb_pa_assert_overlap(ac, start, end);
>
> /*
> * In this function "start" and "size" are normalized for better
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-29 11:49:55

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC v3 6/8] ext4: Convert pa->pa_inode_list and pa->pa_obj_lock into a union

On Tue 27-09-22 14:46:46, Ojaswin Mujoo wrote:
> ** Splitting pa->pa_inode_list **
>
> Currently, we use the same pa->pa_inode_list to add a pa to either
> the inode preallocation list or the locality group preallocation list.
> For better clarity, split this list into a union of 2 list_heads and use
> either of the them based on the type of pa.
>
> ** Splitting pa->pa_obj_lock **
>
> Currently, pa->pa_obj_lock is either assigned &ei->i_prealloc_lock for
> inode PAs or lg_prealloc_lock for lg PAs, and is then used to lock the
> lists containing these PAs. Make the distinction between the 2 PA types
> clear by changing this lock to a union of 2 locks. Explicitly use the
> pa_lock_node.inode_lock for inode PAs and pa_lock_node.lg_lock for lg
> PAs.
>
> This patch is required so that the locality group preallocation code
> remains the same as in upcoming patches we are going to make changes to
> inode preallocation code to move from list to rbtree based
> implementation. This patch also makes it easier to review the upcoming
> patches.
>
> There are no functional changes in this patch.
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>

Nice intermediate step. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/mballoc.c | 76 +++++++++++++++++++++++++++--------------------
> fs/ext4/mballoc.h | 10 +++++--
> 2 files changed, 52 insertions(+), 34 deletions(-)
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index dda9a72c81d9..b91710fe881f 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3995,7 +3995,7 @@ ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
> ext4_lblk_t tmp_pa_start, tmp_pa_end;
>
> rcu_read_lock();
> - list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
> + list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
> spin_lock(&tmp_pa->pa_lock);
> if (tmp_pa->pa_deleted == 0) {
> tmp_pa_start = tmp_pa->pa_lstart;
> @@ -4033,7 +4033,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
>
> /* check we don't cross already preallocated blocks */
> rcu_read_lock();
> - list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
> + list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
> if (tmp_pa->pa_deleted)
> continue;
> spin_lock(&tmp_pa->pa_lock);
> @@ -4410,7 +4410,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
>
> /* first, try per-file preallocation */
> rcu_read_lock();
> - list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
> + list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
>
> /* all fields in this condition don't change,
> * so we can skip locking for them */
> @@ -4467,7 +4467,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
> for (i = order; i < PREALLOC_TB_SIZE; i++) {
> rcu_read_lock();
> list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[i],
> - pa_inode_list) {
> + pa_node.lg_list) {
> spin_lock(&tmp_pa->pa_lock);
> if (tmp_pa->pa_deleted == 0 &&
> tmp_pa->pa_free >= ac->ac_o_ex.fe_len) {
> @@ -4640,9 +4640,15 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
> list_del(&pa->pa_group_list);
> ext4_unlock_group(sb, grp);
>
> - spin_lock(pa->pa_obj_lock);
> - list_del_rcu(&pa->pa_inode_list);
> - spin_unlock(pa->pa_obj_lock);
> + if (pa->pa_type == MB_INODE_PA) {
> + spin_lock(pa->pa_node_lock.inode_lock);
> + list_del_rcu(&pa->pa_node.inode_list);
> + spin_unlock(pa->pa_node_lock.inode_lock);
> + } else {
> + spin_lock(pa->pa_node_lock.lg_lock);
> + list_del_rcu(&pa->pa_node.lg_list);
> + spin_unlock(pa->pa_node_lock.lg_lock);
> + }
>
> call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
> }
> @@ -4710,7 +4716,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
> pa->pa_len = ac->ac_b_ex.fe_len;
> pa->pa_free = pa->pa_len;
> spin_lock_init(&pa->pa_lock);
> - INIT_LIST_HEAD(&pa->pa_inode_list);
> + INIT_LIST_HEAD(&pa->pa_node.inode_list);
> INIT_LIST_HEAD(&pa->pa_group_list);
> pa->pa_deleted = 0;
> pa->pa_type = MB_INODE_PA;
> @@ -4725,14 +4731,14 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
> ei = EXT4_I(ac->ac_inode);
> grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);
>
> - pa->pa_obj_lock = &ei->i_prealloc_lock;
> + pa->pa_node_lock.inode_lock = &ei->i_prealloc_lock;
> pa->pa_inode = ac->ac_inode;
>
> list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
>
> - spin_lock(pa->pa_obj_lock);
> - list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list);
> - spin_unlock(pa->pa_obj_lock);
> + spin_lock(pa->pa_node_lock.inode_lock);
> + list_add_rcu(&pa->pa_node.inode_list, &ei->i_prealloc_list);
> + spin_unlock(pa->pa_node_lock.inode_lock);
> atomic_inc(&ei->i_prealloc_active);
> }
>
> @@ -4764,7 +4770,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
> pa->pa_len = ac->ac_b_ex.fe_len;
> pa->pa_free = pa->pa_len;
> spin_lock_init(&pa->pa_lock);
> - INIT_LIST_HEAD(&pa->pa_inode_list);
> + INIT_LIST_HEAD(&pa->pa_node.lg_list);
> INIT_LIST_HEAD(&pa->pa_group_list);
> pa->pa_deleted = 0;
> pa->pa_type = MB_GROUP_PA;
> @@ -4780,7 +4786,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
> lg = ac->ac_lg;
> BUG_ON(lg == NULL);
>
> - pa->pa_obj_lock = &lg->lg_prealloc_lock;
> + pa->pa_node_lock.lg_lock = &lg->lg_prealloc_lock;
> pa->pa_inode = NULL;
>
> list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
> @@ -4956,9 +4962,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
> list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
>
> /* remove from object (inode or locality group) */
> - spin_lock(pa->pa_obj_lock);
> - list_del_rcu(&pa->pa_inode_list);
> - spin_unlock(pa->pa_obj_lock);
> + if (pa->pa_type == MB_GROUP_PA) {
> + spin_lock(pa->pa_node_lock.lg_lock);
> + list_del_rcu(&pa->pa_node.lg_list);
> + spin_unlock(pa->pa_node_lock.lg_lock);
> + } else {
> + spin_lock(pa->pa_node_lock.inode_lock);
> + list_del_rcu(&pa->pa_node.inode_list);
> + spin_unlock(pa->pa_node_lock.inode_lock);
> + }
>
> if (pa->pa_type == MB_GROUP_PA)
> ext4_mb_release_group_pa(&e4b, pa);
> @@ -5021,8 +5033,8 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
> spin_lock(&ei->i_prealloc_lock);
> while (!list_empty(&ei->i_prealloc_list) && needed) {
> pa = list_entry(ei->i_prealloc_list.prev,
> - struct ext4_prealloc_space, pa_inode_list);
> - BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
> + struct ext4_prealloc_space, pa_node.inode_list);
> + BUG_ON(pa->pa_node_lock.inode_lock != &ei->i_prealloc_lock);
> spin_lock(&pa->pa_lock);
> if (atomic_read(&pa->pa_count)) {
> /* this shouldn't happen often - nobody should
> @@ -5039,7 +5051,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
> if (pa->pa_deleted == 0) {
> ext4_mb_mark_pa_deleted(sb, pa);
> spin_unlock(&pa->pa_lock);
> - list_del_rcu(&pa->pa_inode_list);
> + list_del_rcu(&pa->pa_node.inode_list);
> list_add(&pa->u.pa_tmp_list, &list);
> needed--;
> continue;
> @@ -5331,7 +5343,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
>
> spin_lock(&lg->lg_prealloc_lock);
> list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[order],
> - pa_inode_list,
> + pa_node.lg_list,
> lockdep_is_held(&lg->lg_prealloc_lock)) {
> spin_lock(&pa->pa_lock);
> if (atomic_read(&pa->pa_count)) {
> @@ -5354,7 +5366,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
> ext4_mb_mark_pa_deleted(sb, pa);
> spin_unlock(&pa->pa_lock);
>
> - list_del_rcu(&pa->pa_inode_list);
> + list_del_rcu(&pa->pa_node.lg_list);
> list_add(&pa->u.pa_tmp_list, &discard_list);
>
> total_entries--;
> @@ -5415,7 +5427,7 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
> /* Add the prealloc space to lg */
> spin_lock(&lg->lg_prealloc_lock);
> list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order],
> - pa_inode_list,
> + pa_node.lg_list,
> lockdep_is_held(&lg->lg_prealloc_lock)) {
> spin_lock(&tmp_pa->pa_lock);
> if (tmp_pa->pa_deleted) {
> @@ -5424,8 +5436,8 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
> }
> if (!added && pa->pa_free < tmp_pa->pa_free) {
> /* Add to the tail of the previous entry */
> - list_add_tail_rcu(&pa->pa_inode_list,
> - &tmp_pa->pa_inode_list);
> + list_add_tail_rcu(&pa->pa_node.lg_list,
> + &tmp_pa->pa_node.lg_list);
> added = 1;
> /*
> * we want to count the total
> @@ -5436,7 +5448,7 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
> lg_prealloc_count++;
> }
> if (!added)
> - list_add_tail_rcu(&pa->pa_inode_list,
> + list_add_tail_rcu(&pa->pa_node.lg_list,
> &lg->lg_prealloc_list[order]);
> spin_unlock(&lg->lg_prealloc_lock);
>
> @@ -5492,9 +5504,9 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
> * doesn't grow big.
> */
> if (likely(pa->pa_free)) {
> - spin_lock(pa->pa_obj_lock);
> - list_del_rcu(&pa->pa_inode_list);
> - spin_unlock(pa->pa_obj_lock);
> + spin_lock(pa->pa_node_lock.lg_lock);
> + list_del_rcu(&pa->pa_node.lg_list);
> + spin_unlock(pa->pa_node_lock.lg_lock);
> ext4_mb_add_n_trim(ac);
> }
> }
> @@ -5504,9 +5516,9 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
> * treat per-inode prealloc list as a lru list, then try
> * to trim the least recently used PA.
> */
> - spin_lock(pa->pa_obj_lock);
> - list_move(&pa->pa_inode_list, &ei->i_prealloc_list);
> - spin_unlock(pa->pa_obj_lock);
> + spin_lock(pa->pa_node_lock.inode_lock);
> + list_move(&pa->pa_node.inode_list, &ei->i_prealloc_list);
> + spin_unlock(pa->pa_node_lock.inode_lock);
> }
>
> ext4_mb_put_pa(ac, ac->ac_sb, pa);
> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index dcda2a943cee..398a6688c341 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -114,7 +114,10 @@ struct ext4_free_data {
> };
>
> struct ext4_prealloc_space {
> - struct list_head pa_inode_list;
> + union {
> + struct list_head inode_list; /* for inode PAs */
> + struct list_head lg_list; /* for lg PAs */
> + } pa_node;
> struct list_head pa_group_list;
> union {
> struct list_head pa_tmp_list;
> @@ -128,7 +131,10 @@ struct ext4_prealloc_space {
> ext4_grpblk_t pa_len; /* len of preallocated chunk */
> ext4_grpblk_t pa_free; /* how many blocks are free */
> unsigned short pa_type; /* pa type. inode or group */
> - spinlock_t *pa_obj_lock;
> + union {
> + spinlock_t *inode_lock; /* locks the inode list holding this PA */
> + spinlock_t *lg_lock; /* locks the lg list holding this PA */
> + } pa_node_lock;
> struct inode *pa_inode; /* hack, for history only */
> };
>
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-29 12:53:54

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC v3 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list

On Tue 27-09-22 14:46:47, Ojaswin Mujoo wrote:
> Currently, the kernel uses i_prealloc_list to hold all the inode
> preallocations. This is known to cause degradation in performance in
> workloads which perform large number of sparse writes on a single file.
> This is mainly because functions like ext4_mb_normalize_request() and
> ext4_mb_use_preallocated() iterate over this complete list, resulting in
> slowdowns when large number of PAs are present.
>
> Patch 27bc446e2 partially fixed this by enforcing a limit of 512 for
> the inode preallocation list and adding logic to continually trim the
> list if it grows above the threshold, however our testing revealed that
> a hardcoded value is not suitable for all kinds of workloads.
>
> To optimize this, add an rbtree to the inode and hold the inode
> preallocations in this rbtree. This will make iterating over inode PAs
> faster and scale much better than a linked list. Additionally, we also
> had to remove the LRU logic that was added during trimming of the list
> (in ext4_mb_release_context()) as it will add extra overhead in rbtree.
> The discards now happen in the lowest-logical-offset-first order.
>
> ** Locking notes **
>
> With the introduction of rbtree to maintain inode PAs, we can't use RCU
> to walk the tree for searching since it can result in partial traversals
> which might miss some nodes(or entire subtrees) while discards happen
> in parallel (which happens under a lock). Hence this patch converts the
> ei->i_prealloc_lock spin_lock to rw_lock.
>
> Almost all the codepaths that read/modify the PA rbtrees are protected
> by the higher level inode->i_data_sem (except
> ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the
> only place we need lock protection is when one thread is reading
> "searching" the PA rbtree (earlier protected under rcu_read_lock()) and
> another is "deleting" the PAs in ext4_mb_discard_group_preallocations()
> function (which iterates all the PAs using the grp->bb_prealloc_list and
> deletes PAs from the tree without taking any inode lock (i_data_sem)).
>
> So, this patch converts all rcu_read_lock/unlock() paths for inode list
> PA to use read_lock() and all places where we were using
> ei->i_prealloc_lock spinlock will now be using write_lock().
>
> Note that this makes the fast path (searching of the right PA e.g.
> ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use
> read_lock() instead of rcu_read_lock/unlock(). Ths also will now block
> due to slow discard path (ext4_mb_discard_group_preallocations()) which
> uses write_lock().
>
> But this is not as bad as it looks. This is because -
>
> 1. The slow path only occurs when the normal allocation failed and we
> can say that we are low on disk space. One can argue this scenario
> won't be much frequent.
>
> 2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock
> for deleting every individual PA. This gives enough opportunity for
> the fast path to acquire the read_lock for searching the PA inode
> list.
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>

I've found couple of smaller issues. See below.

> ---
> fs/ext4/ext4.h | 4 +-
> fs/ext4/mballoc.c | 192 ++++++++++++++++++++++++++++++++--------------
> fs/ext4/mballoc.h | 6 +-
> fs/ext4/super.c | 4 +-
> 4 files changed, 140 insertions(+), 66 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3bf9a6926798..d54b972f1f0f 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1120,8 +1120,8 @@ struct ext4_inode_info {
>
> /* mballoc */
> atomic_t i_prealloc_active;
> - struct list_head i_prealloc_list;
> - spinlock_t i_prealloc_lock;
> + struct rb_root i_prealloc_node;
> + rwlock_t i_prealloc_lock;
>
> /* extents status tree */
> struct ext4_es_tree i_es_tree;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index b91710fe881f..cd19b9e84767 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3985,6 +3985,24 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
> mb_debug(sb, "goal %u blocks for locality group\n", ac->ac_g_ex.fe_len);
> }
>
> +/*
> + * This function returns the next element to look at during inode
> + * PA rbtree walk. We assume that we have held the inode PA rbtree lock
> + * (ei->i_prealloc_lock)
> + *
> + * new_start The start of the range we want to compare
> + * cur_start The existing start that we are comparing against
> + * node The node of the rb_tree
> + */
> +static inline struct rb_node*
> +ext4_mb_pa_rb_next_iter(int new_start, int cur_start, struct rb_node *node)

These need to be ext4_lblk_t, not int.

> +{
> + if (new_start < cur_start)
> + return node->rb_left;
> + else
> + return node->rb_right;
> +}
> +

> @@ -4032,19 +4055,29 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
> new_end = *end;
>
> /* check we don't cross already preallocated blocks */
> - rcu_read_lock();
> - list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
> - if (tmp_pa->pa_deleted)
> + read_lock(&ei->i_prealloc_lock);
> + iter = ei->i_prealloc_node.rb_node;
> + while (iter) {

Perhaps this would be nicer as a for-cycle? Like:

for (iter = ei->i_prealloc_node.rb_node; iter;
iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter))

> + tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
> + pa_node.inode_node);
> + tmp_pa_start = tmp_pa->pa_lstart;
> + tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> +
> + /*
> + * If pa is deleted, ignore overlaps and just iterate in rbtree
> + * based on tmp_pa_start
> + */
> + if (tmp_pa->pa_deleted) {
> + iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
> continue;
> + }
> spin_lock(&tmp_pa->pa_lock);
> if (tmp_pa->pa_deleted) {
> spin_unlock(&tmp_pa->pa_lock);
> + iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
> continue;
> }
>
> - tmp_pa_start = tmp_pa->pa_lstart;
> - tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> -
> /* PA must not overlap original request */
> BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
> ac->ac_o_ex.fe_logical < tmp_pa_start));
> @@ -4052,6 +4085,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
> /* skip PAs this normalized request doesn't overlap with */
> if (tmp_pa_start >= new_end || tmp_pa_end <= new_start) {
> spin_unlock(&tmp_pa->pa_lock);
> + iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
> continue;
> }
> BUG_ON(tmp_pa_start <= new_start && tmp_pa_end >= new_end);
> @@ -4065,8 +4099,9 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
> new_end = tmp_pa_start;
> }
> spin_unlock(&tmp_pa->pa_lock);
> + iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
> }
> - rcu_read_unlock();
> + read_unlock(&ei->i_prealloc_lock);

....

> @@ -4409,17 +4444,23 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
> return false;
>
> /* first, try per-file preallocation */
> - rcu_read_lock();
> - list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
> + read_lock(&ei->i_prealloc_lock);
> + iter = ei->i_prealloc_node.rb_node;
> + while (iter) {

Again, for-cycle would look more natural here.

> + tmp_pa = rb_entry(iter, struct ext4_prealloc_space, pa_node.inode_node);
>
> /* all fields in this condition don't change,
> * so we can skip locking for them */
> tmp_pa_start = tmp_pa->pa_lstart;
> tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
>
> + /* original request start doesn't lie in this PA */
> if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
> - ac->ac_o_ex.fe_logical >= tmp_pa_end)
> + ac->ac_o_ex.fe_logical >= tmp_pa_end) {
> + iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical,
> + tmp_pa_start, iter);
> continue;
> + }
>
> /* non-extent files can't have physical blocks past 2^32 */
> if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
> @@ -4439,12 +4480,14 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
> ext4_mb_use_inode_pa(ac, tmp_pa);
> spin_unlock(&tmp_pa->pa_lock);
> ac->ac_criteria = 10;
> - rcu_read_unlock();
> + read_unlock(&ei->i_prealloc_lock);
> return true;
> }
> spin_unlock(&tmp_pa->pa_lock);
> + iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical,
> + tmp_pa_start, iter);
> }
> - rcu_read_unlock();
> + read_unlock(&ei->i_prealloc_lock);
>
> /* can we use group allocation? */
> if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC))
> @@ -4596,6 +4639,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
> {
> ext4_group_t grp;
> ext4_fsblk_t grp_blk;
> + struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
>
> /* in this short window concurrent discard can set pa_deleted */
> spin_lock(&pa->pa_lock);
> @@ -4641,16 +4685,51 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
> ext4_unlock_group(sb, grp);
>
> if (pa->pa_type == MB_INODE_PA) {
> - spin_lock(pa->pa_node_lock.inode_lock);
> - list_del_rcu(&pa->pa_node.inode_list);
> - spin_unlock(pa->pa_node_lock.inode_lock);
> + write_lock(pa->pa_node_lock.inode_lock);
> + rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
> + write_unlock(pa->pa_node_lock.inode_lock);
> + ext4_mb_pa_free(pa);
> } else {
> spin_lock(pa->pa_node_lock.lg_lock);
> list_del_rcu(&pa->pa_node.lg_list);
> spin_unlock(pa->pa_node_lock.lg_lock);
> + call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
> }
> +}
> +
> +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> + int (*cmp)(struct rb_node *, struct rb_node *))
> +{

Given this has only one callsite, why not just inline ext4_mb_pa_cmp()
directly into this function?

> + struct rb_node **iter = &root->rb_node, *parent = NULL;
> +
> + while (*iter) {
> + parent = *iter;
> + if (cmp(new, *iter) > 0)
> + iter = &((*iter)->rb_left);
> + else
> + iter = &((*iter)->rb_right);
> + }
> +
> + rb_link_node(new, parent, iter);
> + rb_insert_color(new, root);
> +}
> +
> +static int ext4_mb_pa_cmp(struct rb_node *new, struct rb_node *cur)
> +{
> + ext4_grpblk_t cur_start, new_start;

This should be ext4_lblk_t to match with pa->pa_lstart...

> + struct ext4_prealloc_space *cur_pa = rb_entry(cur,
> + struct ext4_prealloc_space,
> + pa_node.inode_node);
> + struct ext4_prealloc_space *new_pa = rb_entry(new,
> + struct ext4_prealloc_space,
> + pa_node.inode_node);
> + cur_start = cur_pa->pa_lstart;
> + new_start = new_pa->pa_lstart;
>
> - call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
> + if (new_start < cur_start)
> + return 1;
> + else
> + return -1;
> }

Here and in ext4_mb_rb_insert() the comparison seems to be reversed (thus
effectively canceling out) but it is still confusing. Usually if we have
cmp(a,b) functions then if a < b we return -1, if a > b we return 1.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-09-29 12:57:29

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC v3 8/8] ext4: Remove the logic to trim inode PAs

On Tue 27-09-22 14:46:48, Ojaswin Mujoo wrote:
> Earlier, inode PAs were stored in a linked list. This caused a need to
> periodically trim the list down inorder to avoid growing it to a very
> large size, as this would severly affect performance during list
> iteration.
>
> Recent patches changed this list to an rbtree, and since the tree scales
> up much better, we no longer need to have the trim functionality, hence
> remove it.
>
> Signed-off-by: Ojaswin Mujoo <[email protected]>
> Reviewed-by: Ritesh Harjani (IBM) <[email protected]>

I'm kind of wondering: Now there won't be performance issues with much
more inode PAs but probably we don't want to let them grow completely out
of control? E.g. I can imagine that if we'd have 1 billion of inode PAs
attached to an inode, things would get wonky both in terms of memory
consumption and also in terms of CPU time spent for the cases where we
still do iterate all of the PAs... Is there anything which keeps inode PAs
reasonably bounded?

Honza

> ---
> Documentation/admin-guide/ext4.rst | 3 ---
> fs/ext4/ext4.h | 1 -
> fs/ext4/mballoc.c | 20 --------------------
> fs/ext4/mballoc.h | 5 -----
> fs/ext4/sysfs.c | 2 --
> 5 files changed, 31 deletions(-)
>
> diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst
> index 4c559e08d11e..5740d85439ff 100644
> --- a/Documentation/admin-guide/ext4.rst
> +++ b/Documentation/admin-guide/ext4.rst
> @@ -489,9 +489,6 @@ Files in /sys/fs/ext4/<devname>:
> multiple of this tuning parameter if the stripe size is not set in the
> ext4 superblock
>
> - mb_max_inode_prealloc
> - The maximum length of per-inode ext4_prealloc_space list.
> -
> mb_max_to_scan
> The maximum number of extents the multiblock allocator will search to
> find the best extent.
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index d54b972f1f0f..bca4b41cc192 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1612,7 +1612,6 @@ struct ext4_sb_info {
> unsigned int s_mb_stats;
> unsigned int s_mb_order2_reqs;
> unsigned int s_mb_group_prealloc;
> - unsigned int s_mb_max_inode_prealloc;
> unsigned int s_max_dir_size_kb;
> /* where last allocation was done - for stream allocation */
> unsigned long s_mb_last_group;
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index cd19b9e84767..57e1ec88477a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3420,7 +3420,6 @@ int ext4_mb_init(struct super_block *sb)
> sbi->s_mb_stats = MB_DEFAULT_STATS;
> sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD;
> sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS;
> - sbi->s_mb_max_inode_prealloc = MB_DEFAULT_MAX_INODE_PREALLOC;
> /*
> * The default group preallocation is 512, which for 4k block
> * sizes translates to 2 megabytes. However for bigalloc file
> @@ -5546,29 +5545,11 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
> return ;
> }
>
> -/*
> - * if per-inode prealloc list is too long, trim some PA
> - */
> -static void ext4_mb_trim_inode_pa(struct inode *inode)
> -{
> - struct ext4_inode_info *ei = EXT4_I(inode);
> - struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> - int count, delta;
> -
> - count = atomic_read(&ei->i_prealloc_active);
> - delta = (sbi->s_mb_max_inode_prealloc >> 2) + 1;
> - if (count > sbi->s_mb_max_inode_prealloc + delta) {
> - count -= sbi->s_mb_max_inode_prealloc;
> - ext4_discard_preallocations(inode, count);
> - }
> -}
> -
> /*
> * release all resource we used in allocation
> */
> static int ext4_mb_release_context(struct ext4_allocation_context *ac)
> {
> - struct inode *inode = ac->ac_inode;
> struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
> struct ext4_prealloc_space *pa = ac->ac_pa;
> if (pa) {
> @@ -5604,7 +5585,6 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
> if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC)
> mutex_unlock(&ac->ac_lg->lg_mutex);
> ext4_mb_collect_stats(ac);
> - ext4_mb_trim_inode_pa(inode);
> return 0;
> }
>
> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
> index f8e8ee493867..6d85ee8674a6 100644
> --- a/fs/ext4/mballoc.h
> +++ b/fs/ext4/mballoc.h
> @@ -73,11 +73,6 @@
> */
> #define MB_DEFAULT_GROUP_PREALLOC 512
>
> -/*
> - * maximum length of inode prealloc list
> - */
> -#define MB_DEFAULT_MAX_INODE_PREALLOC 512
> -
> /*
> * Number of groups to search linearly before performing group scanning
> * optimization.
> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
> index d233c24ea342..f0d42cf44c71 100644
> --- a/fs/ext4/sysfs.c
> +++ b/fs/ext4/sysfs.c
> @@ -214,7 +214,6 @@ EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan);
> EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs);
> EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request);
> EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc);
> -EXT4_RW_ATTR_SBI_UI(mb_max_inode_prealloc, s_mb_max_inode_prealloc);
> EXT4_RW_ATTR_SBI_UI(mb_max_linear_groups, s_mb_max_linear_groups);
> EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb);
> EXT4_ATTR(trigger_fs_error, 0200, trigger_test_error);
> @@ -264,7 +263,6 @@ static struct attribute *ext4_attrs[] = {
> ATTR_LIST(mb_order2_req),
> ATTR_LIST(mb_stream_req),
> ATTR_LIST(mb_group_prealloc),
> - ATTR_LIST(mb_max_inode_prealloc),
> ATTR_LIST(mb_max_linear_groups),
> ATTR_LIST(max_writeback_mb_bump),
> ATTR_LIST(extent_max_zeroout_kb),
> --
> 2.31.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-10-03 11:41:25

by Ojaswin Mujoo

[permalink] [raw]
Subject: Re: [RFC v3 7/8] ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list

Hi Jan,

Thank you for the review. I've added some comments inline.

On Thu, Sep 29, 2022 at 02:39:26PM +0200, Jan Kara wrote:
> On Tue 27-09-22 14:46:47, Ojaswin Mujoo wrote:
>
> I've found couple of smaller issues. See below.
>
> > ---
> > fs/ext4/ext4.h | 4 +-
> > fs/ext4/mballoc.c | 192 ++++++++++++++++++++++++++++++++--------------
> > fs/ext4/mballoc.h | 6 +-
> > fs/ext4/super.c | 4 +-
> > 4 files changed, 140 insertions(+), 66 deletions(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 3bf9a6926798..d54b972f1f0f 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1120,8 +1120,8 @@ struct ext4_inode_info {
> >
> > /* mballoc */
> > atomic_t i_prealloc_active;
> > - struct list_head i_prealloc_list;
> > - spinlock_t i_prealloc_lock;
> > + struct rb_root i_prealloc_node;
> > + rwlock_t i_prealloc_lock;
> >
> > /* extents status tree */
> > struct ext4_es_tree i_es_tree;
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index b91710fe881f..cd19b9e84767 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -3985,6 +3985,24 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
> > mb_debug(sb, "goal %u blocks for locality group\n", ac->ac_g_ex.fe_len);
> > }
> >
> > +/*
> > + * This function returns the next element to look at during inode
> > + * PA rbtree walk. We assume that we have held the inode PA rbtree lock
> > + * (ei->i_prealloc_lock)
> > + *
> > + * new_start The start of the range we want to compare
> > + * cur_start The existing start that we are comparing against
> > + * node The node of the rb_tree
> > + */
> > +static inline struct rb_node*
> > +ext4_mb_pa_rb_next_iter(int new_start, int cur_start, struct rb_node *node)
>
> These need to be ext4_lblk_t, not int.

Noted. Will fix in next version.
>
> > +{
> > + if (new_start < cur_start)
> > + return node->rb_left;
> > + else
> > + return node->rb_right;
> > +}
> > +
>
> > @@ -4032,19 +4055,29 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
> > new_end = *end;
> >
> > /* check we don't cross already preallocated blocks */
> > - rcu_read_lock();
> > - list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
> > - if (tmp_pa->pa_deleted)
> > + read_lock(&ei->i_prealloc_lock);
> > + iter = ei->i_prealloc_node.rb_node;
> > + while (iter) {
>
> Perhaps this would be nicer as a for-cycle? Like:
>
> for (iter = ei->i_prealloc_node.rb_node; iter;
> iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter))
>

Right, I agree. Will do.
> > + tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
> > + pa_node.inode_node);
> > + tmp_pa_start = tmp_pa->pa_lstart;
> > + tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> > +
> > + /*
> > + * If pa is deleted, ignore overlaps and just iterate in rbtree
> > + * based on tmp_pa_start
> > + */
> > + if (tmp_pa->pa_deleted) {
> > + iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
> > continue;
> > + }
> > spin_lock(&tmp_pa->pa_lock);
> > if (tmp_pa->pa_deleted) {
> > spin_unlock(&tmp_pa->pa_lock);
> > + iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
> > continue;
> > }
> >
> > - tmp_pa_start = tmp_pa->pa_lstart;
> > - tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> > -
> > /* PA must not overlap original request */
> > BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
> > ac->ac_o_ex.fe_logical < tmp_pa_start));
> > @@ -4052,6 +4085,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
> > /* skip PAs this normalized request doesn't overlap with */
> > if (tmp_pa_start >= new_end || tmp_pa_end <= new_start) {
> > spin_unlock(&tmp_pa->pa_lock);
> > + iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
> > continue;
> > }
> > BUG_ON(tmp_pa_start <= new_start && tmp_pa_end >= new_end);
> > @@ -4065,8 +4099,9 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
> > new_end = tmp_pa_start;
> > }
> > spin_unlock(&tmp_pa->pa_lock);
> > + iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter);
> > }
> > - rcu_read_unlock();
> > + read_unlock(&ei->i_prealloc_lock);
>
> ....
>
> > @@ -4409,17 +4444,23 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
> > return false;
> >
> > /* first, try per-file preallocation */
> > - rcu_read_lock();
> > - list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
> > + read_lock(&ei->i_prealloc_lock);
> > + iter = ei->i_prealloc_node.rb_node;
> > + while (iter) {
>
> Again, for-cycle would look more natural here.
>
> > + tmp_pa = rb_entry(iter, struct ext4_prealloc_space, pa_node.inode_node);
> >
> > /* all fields in this condition don't change,
> > * so we can skip locking for them */
> > tmp_pa_start = tmp_pa->pa_lstart;
> > tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
> >
> > + /* original request start doesn't lie in this PA */
> > if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
> > - ac->ac_o_ex.fe_logical >= tmp_pa_end)
> > + ac->ac_o_ex.fe_logical >= tmp_pa_end) {
> > + iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical,
> > + tmp_pa_start, iter);
> > continue;
> > + }
> >
> > /* non-extent files can't have physical blocks past 2^32 */
> > if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS)) &&
> > @@ -4439,12 +4480,14 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
> > ext4_mb_use_inode_pa(ac, tmp_pa);
> > spin_unlock(&tmp_pa->pa_lock);
> > ac->ac_criteria = 10;
> > - rcu_read_unlock();
> > + read_unlock(&ei->i_prealloc_lock);
> > return true;
> > }
> > spin_unlock(&tmp_pa->pa_lock);
> > + iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical,
> > + tmp_pa_start, iter);
> > }
> > - rcu_read_unlock();
> > + read_unlock(&ei->i_prealloc_lock);
> >
> > /* can we use group allocation? */
> > if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC))
> > @@ -4596,6 +4639,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
> > {
> > ext4_group_t grp;
> > ext4_fsblk_t grp_blk;
> > + struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
> >
> > /* in this short window concurrent discard can set pa_deleted */
> > spin_lock(&pa->pa_lock);
> > @@ -4641,16 +4685,51 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
> > ext4_unlock_group(sb, grp);
> >
> > if (pa->pa_type == MB_INODE_PA) {
> > - spin_lock(pa->pa_node_lock.inode_lock);
> > - list_del_rcu(&pa->pa_node.inode_list);
> > - spin_unlock(pa->pa_node_lock.inode_lock);
> > + write_lock(pa->pa_node_lock.inode_lock);
> > + rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
> > + write_unlock(pa->pa_node_lock.inode_lock);
> > + ext4_mb_pa_free(pa);
> > } else {
> > spin_lock(pa->pa_node_lock.lg_lock);
> > list_del_rcu(&pa->pa_node.lg_list);
> > spin_unlock(pa->pa_node_lock.lg_lock);
> > + call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
> > }
> > +}
> > +
> > +static void ext4_mb_rb_insert(struct rb_root *root, struct rb_node *new,
> > + int (*cmp)(struct rb_node *, struct rb_node *))
> > +{
>
> Given this has only one callsite, why not just inline ext4_mb_pa_cmp()
> directly into this function?
>
> > + struct rb_node **iter = &root->rb_node, *parent = NULL;
> > +
> > + while (*iter) {
> > + parent = *iter;
> > + if (cmp(new, *iter) > 0)
> > + iter = &((*iter)->rb_left);
> > + else
> > + iter = &((*iter)->rb_right);
> > + }
> > +
> > + rb_link_node(new, parent, iter);
> > + rb_insert_color(new, root);
> > +}
> > +
> > +static int ext4_mb_pa_cmp(struct rb_node *new, struct rb_node *cur)
> > +{
> > + ext4_grpblk_t cur_start, new_start;
>
> This should be ext4_lblk_t to match with pa->pa_lstart...

Noted, thanks.
>
> > + struct ext4_prealloc_space *cur_pa = rb_entry(cur,
> > + struct ext4_prealloc_space,
> > + pa_node.inode_node);
> > + struct ext4_prealloc_space *new_pa = rb_entry(new,
> > + struct ext4_prealloc_space,
> > + pa_node.inode_node);
> > + cur_start = cur_pa->pa_lstart;
> > + new_start = new_pa->pa_lstart;
> >
> > - call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
> > + if (new_start < cur_start)
> > + return 1;
> > + else
> > + return -1;
> > }
>
> Here and in ext4_mb_rb_insert() the comparison seems to be reversed (thus
> effectively canceling out) but it is still confusing. Usually if we have
> cmp(a,b) functions then if a < b we return -1, if a > b we return 1.

Hmm so for ext4_mb_rb_insert(), it was already defined when I started
with the pathset so I just reused it with a new comparator
ext4_mb_pa_cmp(). While rebasing, I noticed that ext4_mb_rb_insert()
function was removed since we didn't need the rbtree after your changes
to CR1, so I just added it as it is.

But you are correct, we should modify ext4_mb_rb_insert to make the
return values more intuitive. I'll fix this.
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

Regards,
ojaswin

2022-10-06 07:04:06

by Ojaswin Mujoo

[permalink] [raw]
Subject: Re: [RFC v3 8/8] ext4: Remove the logic to trim inode PAs

On Thu, Sep 29, 2022 at 02:53:11PM +0200, Jan Kara wrote:
> On Tue 27-09-22 14:46:48, Ojaswin Mujoo wrote:
> > Earlier, inode PAs were stored in a linked list. This caused a need to
> > periodically trim the list down inorder to avoid growing it to a very
> > large size, as this would severly affect performance during list
> > iteration.
> >
> > Recent patches changed this list to an rbtree, and since the tree scales
> > up much better, we no longer need to have the trim functionality, hence
> > remove it.
> >
> > Signed-off-by: Ojaswin Mujoo <[email protected]>
> > Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
>
> I'm kind of wondering: Now there won't be performance issues with much
> more inode PAs but probably we don't want to let them grow completely out
> of control? E.g. I can imagine that if we'd have 1 billion of inode PAs
> attached to an inode, things would get wonky both in terms of memory
> consumption and also in terms of CPU time spent for the cases where we
> still do iterate all of the PAs... Is there anything which keeps inode PAs
> reasonably bounded?
>
> Honza
>
Hi Jan,

Sorry for the delay in response, I was on leave for the last few days.

So as per my understanding, after this patch, the only path where we
would need to traverse all the PAs is the ext4_discard_preallocations()
call where we discard all the PAs of an inode one by one (eg when
closing the file etc). Such a discard is a colder path as we don't
usually expect to do it as often as say allocating blocks to an inode.

Originally, the limit was added in this patch [1] because of the time
lost in O(N) traversal in the allocation path (ext4_mb_use_preallocated
and ext4_mb_normalize_request). Since the rbtree addressed this
scalability issue we had decided to remove the trim logic in this
patchset.

[1]
https://lore.kernel.org/all/[email protected]/

That being said, I do agree that there should be some way to limit the
PAs from taking up an unreasonable amount of buddy space, memory and CPU
cycles in use cases like database files and disk files of long running
VMs. Previously the limit was 512 PAs per inode and trim was happening
in an LRU fashion, which is not very straightforward to implement in
trees.

Another approach is rather than having a hard limit, we can throttle the
PAs based on some parameter like total active PAs in FS or FSUtil% of
the PAs but we might need to take care of fairness so one inode is not
holding all the PAs while others get throttled.

Anyways, I think the trimming part would need some brainstorming to get
right so just wondering if we could keep that as part of a separate
patchset and remove the trimming logic for now since rbtree has
addressed the scalability concerns in allocation path.

Do let me know your thoughts on this.

Regards,
Ojaswin

2022-10-06 09:10:06

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC v3 8/8] ext4: Remove the logic to trim inode PAs

On Thu 06-10-22 12:25:00, Ojaswin Mujoo wrote:
> On Thu, Sep 29, 2022 at 02:53:11PM +0200, Jan Kara wrote:
> > On Tue 27-09-22 14:46:48, Ojaswin Mujoo wrote:
> > > Earlier, inode PAs were stored in a linked list. This caused a need to
> > > periodically trim the list down inorder to avoid growing it to a very
> > > large size, as this would severly affect performance during list
> > > iteration.
> > >
> > > Recent patches changed this list to an rbtree, and since the tree scales
> > > up much better, we no longer need to have the trim functionality, hence
> > > remove it.
> > >
> > > Signed-off-by: Ojaswin Mujoo <[email protected]>
> > > Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
> >
> > I'm kind of wondering: Now there won't be performance issues with much
> > more inode PAs but probably we don't want to let them grow completely out
> > of control? E.g. I can imagine that if we'd have 1 billion of inode PAs
> > attached to an inode, things would get wonky both in terms of memory
> > consumption and also in terms of CPU time spent for the cases where we
> > still do iterate all of the PAs... Is there anything which keeps inode PAs
> > reasonably bounded?
> >
> > Honza
> >
> Hi Jan,
>
> Sorry for the delay in response, I was on leave for the last few days.
>
> So as per my understanding, after this patch, the only path where we
> would need to traverse all the PAs is the ext4_discard_preallocations()
> call where we discard all the PAs of an inode one by one (eg when
> closing the file etc). Such a discard is a colder path as we don't
> usually expect to do it as often as say allocating blocks to an inode.
>
> Originally, the limit was added in this patch [1] because of the time
> lost in O(N) traversal in the allocation path (ext4_mb_use_preallocated
> and ext4_mb_normalize_request). Since the rbtree addressed this
> scalability issue we had decided to remove the trim logic in this
> patchset.
>
> [1]
> https://lore.kernel.org/all/[email protected]/

I agree the O(N) traversal is not in any performance sensitive path.

> That being said, I do agree that there should be some way to limit the
> PAs from taking up an unreasonable amount of buddy space, memory and CPU
> cycles in use cases like database files and disk files of long running
> VMs. Previously the limit was 512 PAs per inode and trim was happening
> in an LRU fashion, which is not very straightforward to implement in
> trees.
>
> Another approach is rather than having a hard limit, we can throttle the
> PAs based on some parameter like total active PAs in FS or FSUtil% of
> the PAs but we might need to take care of fairness so one inode is not
> holding all the PAs while others get throttled.
>
> Anyways, I think the trimming part would need some brainstorming to get
> right so just wondering if we could keep that as part of a separate
> patchset and remove the trimming logic for now since rbtree has
> addressed the scalability concerns in allocation path.

I agree the fact it took until 2020 for someone to notice inode PAs can
be cumulating enough for full scan to matter on block allocation means that
this is not a pressing issue. So I'm OK postponing it for now since I also
don't have a great idea how to best trim excessive preallocations.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-10-06 10:05:10

by Ojaswin Mujoo

[permalink] [raw]
Subject: Re: [RFC v3 8/8] ext4: Remove the logic to trim inode PAs

On Thu, Oct 06, 2022 at 10:59:58AM +0200, Jan Kara wrote:
> On Thu 06-10-22 12:25:00, Ojaswin Mujoo wrote:
> > On Thu, Sep 29, 2022 at 02:53:11PM +0200, Jan Kara wrote:
> > > On Tue 27-09-22 14:46:48, Ojaswin Mujoo wrote:
> > > > Earlier, inode PAs were stored in a linked list. This caused a need to
> > > > periodically trim the list down inorder to avoid growing it to a very
> > > > large size, as this would severly affect performance during list
> > > > iteration.
> > > >
> > > > Recent patches changed this list to an rbtree, and since the tree scales
> > > > up much better, we no longer need to have the trim functionality, hence
> > > > remove it.
> > > >
> > > > Signed-off-by: Ojaswin Mujoo <[email protected]>
> > > > Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
> > >
> > > I'm kind of wondering: Now there won't be performance issues with much
> > > more inode PAs but probably we don't want to let them grow completely out
> > > of control? E.g. I can imagine that if we'd have 1 billion of inode PAs
> > > attached to an inode, things would get wonky both in terms of memory
> > > consumption and also in terms of CPU time spent for the cases where we
> > > still do iterate all of the PAs... Is there anything which keeps inode PAs
> > > reasonably bounded?
> > >
> > > Honza
> > >
> > Hi Jan,
> >
> > Sorry for the delay in response, I was on leave for the last few days.
> >
> > So as per my understanding, after this patch, the only path where we
> > would need to traverse all the PAs is the ext4_discard_preallocations()
> > call where we discard all the PAs of an inode one by one (eg when
> > closing the file etc). Such a discard is a colder path as we don't
> > usually expect to do it as often as say allocating blocks to an inode.
> >
> > Originally, the limit was added in this patch [1] because of the time
> > lost in O(N) traversal in the allocation path (ext4_mb_use_preallocated
> > and ext4_mb_normalize_request). Since the rbtree addressed this
> > scalability issue we had decided to remove the trim logic in this
> > patchset.
> >
> > [1]
> > https://lore.kernel.org/all/[email protected]/
>
> I agree the O(N) traversal is not in any performance sensitive path.
>
> > That being said, I do agree that there should be some way to limit the
> > PAs from taking up an unreasonable amount of buddy space, memory and CPU
> > cycles in use cases like database files and disk files of long running
> > VMs. Previously the limit was 512 PAs per inode and trim was happening
> > in an LRU fashion, which is not very straightforward to implement in
> > trees.
> >
> > Another approach is rather than having a hard limit, we can throttle the
> > PAs based on some parameter like total active PAs in FS or FSUtil% of
> > the PAs but we might need to take care of fairness so one inode is not
> > holding all the PAs while others get throttled.
> >
> > Anyways, I think the trimming part would need some brainstorming to get
> > right so just wondering if we could keep that as part of a separate
> > patchset and remove the trimming logic for now since rbtree has
> > addressed the scalability concerns in allocation path.
>
> I agree the fact it took until 2020 for someone to notice inode PAs can
> be cumulating enough for full scan to matter on block allocation means that
> this is not a pressing issue. So I'm OK postponing it for now since I also
> don't have a great idea how to best trim excessive preallocations.
>
> Honza
Right, so I think I'll post a [PATCH v1] with the changes you suggested
and keep this patch as it is for now.

Thanks,
Ojaswin
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR