LinuxLists.cc - [PATCH v2 0/8] ext4: Convert inode preallocation list to an rbtree

2022-10-14 20:40:02

Subject: [PATCH v2 0/8] ext4: Convert inode preallocation list to an rbtree

This patch series aim to improve the performance and scalability of
inode preallocation by changing inode preallocation linked list to an
rbtree. I've ran xfstests quick on this series and plan to run auto group
as well to confirm we have no regressions.

** Shortcomings of existing implementation **

Right now, we add all the inode preallocations(PAs) to a per inode linked
list ei->i_prealloc_list. To prevent the list from growing infinitely
during heavy sparse workloads, the length of this list was capped at 512
and a trimming logic was added to trim the list whenever it grew over
this threshold, in patch 27bc446e2. This was discussed in detail in the
following lore thread [1].

[1] https://lore.kernel.org/all/[email protected]/

But from our testing, we noticed that the current implementation still
had issues with scalability as the performance degraded when the PAs
stored in the list grew. Most of the degradation was seen in
ext4_mb_normalize_request() and ext4_mb_use_preallocated() functions as
they iterated the inode PA list.

** Improvements in this patchset **

To counter the above shortcomings, this patch series modifies the inode
PA list to an rbtree, which:

- improves the performance of functions discussed above due to the
improved lookup speed.

- improves scalability by changing lookup complexity from O(n) to
O(logn). We no longer need the trimming logic as well.

As a result, the RCU implementation was needed to be changed since
lockless lookups of rbtrees do have some issues like skipping
subtrees. Hence, RCU was replaced with read write locks for inode
PAs. More information can be found in Patch 7 (that has the core
changes).

** Performance Numbers **

Performance numbers were collected with and without these patches, using an
nvme device. Details of tests/benchmarks used are as follows:

Test 1: 200,000 1KiB sparse writes using (fio)
Test 2: Fill 5GiB w/ random writes, 1KiB burst size using (fio)
Test 3: Test 2, but do 4 sequential writes before jumping to random
offset (fio)
Test 4: Fill 8GB FS w/ 2KiB files, 64 threads in parallel (fsmark)

+──────────+──────────────────+────────────────+──────────────────+──────────────────+
| | nodelalloc | delalloc |
+──────────+──────────────────+────────────────+──────────────────+──────────────────+
| | Unpatched | Patched | Unpatched | Patched |
+──────────+──────────────────+────────────────+──────────────────+──────────────────+
| Test 1 | 11.8 MB/s | 23.3 MB/s | 27.2 MB/s | 63.7 MB/s |
| Test 2 | 1617 MB/s | 1740 MB/s | 2223 MB/s | 2208 MB/s |
| Test 3 | 1715 MB/s | 1823 MB/s | 2346 MB/s | 2364 MB/s |
| Test 4 | 14284 files/sec | 14347 files/s | 13762 files/sec | 13882 files/sec |
+──────────+──────────────────+────────────────+──────────────────+──────────────────+

In test 1, we almost see 100 to 200% increase in performance due to the high number
of sparse writes highlighting the bottleneck in the unpatched kernel. Further, on running
"perf diff patched.data unpatched.data" for test 1, we see something as follows:

2.83% +29.67% [kernel.vmlinux] [k] _raw_spin_lock
...
+3.33% [ext4] [k] ext4_mb_normalize_request.constprop.30
0.25% +2.81% [ext4] [k] ext4_mb_use_preallocated

Here we can see that the biggest different is in the _raw_spin_lock() function
of unpatched kernel, that is called from `ext4_mb_normalize_request()` as seen
here:

32.47% fio [kernel.vmlinux] [k] _raw_spin_lock
|
---_raw_spin_lock
|
--32.22%--ext4_mb_normalize_request.constprop.30

This is coming from the spin_lock(&pa->pa_lock) that is called for
each PA that we iterate over, in ext4_mb_normalize_request(). Since in rbtrees,
we lookup log(n) PAs rather than n PAs, this spin lock is taken less frequently,
as evident in the perf.

Furthermore, we see some improvements in other tests however since they don't
exercise the PA traversal path as much as test 1, the improvements are relatively
smaller.

** Summary of patches **

- Patch 1-5: Abstractions/Minor optimizations
- Patch 6: Split common inode & locality group specific fields to a union
- Patch 7: Core changes to move inode PA logic from list to rbtree
- Patch 8: Remove the trim logic as it is not needed

** Changes since PATCH v1 **
- fixed styling issue
- merged ext4_mb_rb_insert() and ext4_mb_pa_cmp()

** Changes since RFC v3 **
- Changed while loops to for loops in patch 7
- Fixed some data types
- Made rbtree comparison logic more intuitive. The
rbtree insertion function still kept separate from
comparison function for reusability.

** Changes since RFC v2 **
- Added a function definition that was deleted during v2 rebase

** Changes since RFC v1 **
- Rebased over ext4 dev branch which includes Jan's patchset [1]
that changed some code in mballoc.c

[1] https://lore.kernel.org/all/[email protected]/

*** BLURB HERE ***

Ojaswin Mujoo (8):
ext4: Stop searching if PA doesn't satisfy non-extent file
ext4: Refactor code related to freeing PAs
ext4: Refactor code in ext4_mb_normalize_request() and
ext4_mb_use_preallocated()
ext4: Move overlap assert logic into a separate function
ext4: Abstract out overlap fix/check logic in
ext4_mb_normalize_request()
ext4: Convert pa->pa_inode_list and pa->pa_obj_lock into a union
ext4: Use rbtrees to manage PAs instead of inode i_prealloc_list
ext4: Remove the logic to trim inode PAs

Documentation/admin-guide/ext4.rst | 3 -
fs/ext4/ext4.h | 5 +-
fs/ext4/mballoc.c | 418 ++++++++++++++++++-----------
fs/ext4/mballoc.h | 17 +-
fs/ext4/super.c | 4 +-
fs/ext4/sysfs.c | 2 -
6 files changed, 274 insertions(+), 175 deletions(-)

--
2.31.1

2022-10-14 20:42:11

by Ojaswin Mujoo

[permalink] [raw]

Subject: [PATCH v2 6/8] ext4: Convert pa->pa_inode_list and pa->pa_obj_lock into a union

** Splitting pa->pa_inode_list **

Currently, we use the same pa->pa_inode_list to add a pa to either
the inode preallocation list or the locality group preallocation list.
For better clarity, split this list into a union of 2 list_heads and use
either of the them based on the type of pa.

** Splitting pa->pa_obj_lock **

Currently, pa->pa_obj_lock is either assigned &ei->i_prealloc_lock for
inode PAs or lg_prealloc_lock for lg PAs, and is then used to lock the
lists containing these PAs. Make the distinction between the 2 PA types
clear by changing this lock to a union of 2 locks. Explicitly use the
pa_lock_node.inode_lock for inode PAs and pa_lock_node.lg_lock for lg
PAs.

This patch is required so that the locality group preallocation code
remains the same as in upcoming patches we are going to make changes to
inode preallocation code to move from list to rbtree based
implementation. This patch also makes it easier to review the upcoming
patches.

There are no functional changes in this patch.

Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/mballoc.c | 76 +++++++++++++++++++++++++++--------------------
fs/ext4/mballoc.h | 10 +++++--
2 files changed, 52 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a76fc19ed8d5..70344bb16008 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3994,7 +3994,7 @@ ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
ext4_lblk_t tmp_pa_start, tmp_pa_end;

rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
+ list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
spin_lock(&tmp_pa->pa_lock);
if (tmp_pa->pa_deleted == 0) {
tmp_pa_start = tmp_pa->pa_lstart;
@@ -4032,7 +4032,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,

/* check we don't cross already preallocated blocks */
rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
+ list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
if (tmp_pa->pa_deleted)
continue;
spin_lock(&tmp_pa->pa_lock);
@@ -4409,7 +4409,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)

/* first, try per-file preallocation */
rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_inode_list) {
+ list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {

/* all fields in this condition don't change,
* so we can skip locking for them */
@@ -4466,7 +4466,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
for (i = order; i < PREALLOC_TB_SIZE; i++) {
rcu_read_lock();
list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[i],
- pa_inode_list) {
+ pa_node.lg_list) {
spin_lock(&tmp_pa->pa_lock);
if (tmp_pa->pa_deleted == 0 &&
tmp_pa->pa_free >= ac->ac_o_ex.fe_len) {
@@ -4640,9 +4640,15 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
list_del(&pa->pa_group_list);
ext4_unlock_group(sb, grp);

- spin_lock(pa->pa_obj_lock);
- list_del_rcu(&pa->pa_inode_list);
- spin_unlock(pa->pa_obj_lock);
+ if (pa->pa_type == MB_INODE_PA) {
+ spin_lock(pa->pa_node_lock.inode_lock);
+ list_del_rcu(&pa->pa_node.inode_list);
+ spin_unlock(pa->pa_node_lock.inode_lock);
+ } else {
+ spin_lock(pa->pa_node_lock.lg_lock);
+ list_del_rcu(&pa->pa_node.lg_list);
+ spin_unlock(pa->pa_node_lock.lg_lock);
+ }

call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
}
@@ -4710,7 +4716,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
pa->pa_len = ac->ac_b_ex.fe_len;
pa->pa_free = pa->pa_len;
spin_lock_init(&pa->pa_lock);
- INIT_LIST_HEAD(&pa->pa_inode_list);
+ INIT_LIST_HEAD(&pa->pa_node.inode_list);
INIT_LIST_HEAD(&pa->pa_group_list);
pa->pa_deleted = 0;
pa->pa_type = MB_INODE_PA;
@@ -4725,14 +4731,14 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
ei = EXT4_I(ac->ac_inode);
grp = ext4_get_group_info(sb, ac->ac_b_ex.fe_group);

- pa->pa_obj_lock = &ei->i_prealloc_lock;
+ pa->pa_node_lock.inode_lock = &ei->i_prealloc_lock;
pa->pa_inode = ac->ac_inode;

list_add(&pa->pa_group_list, &grp->bb_prealloc_list);

- spin_lock(pa->pa_obj_lock);
- list_add_rcu(&pa->pa_inode_list, &ei->i_prealloc_list);
- spin_unlock(pa->pa_obj_lock);
+ spin_lock(pa->pa_node_lock.inode_lock);
+ list_add_rcu(&pa->pa_node.inode_list, &ei->i_prealloc_list);
+ spin_unlock(pa->pa_node_lock.inode_lock);
atomic_inc(&ei->i_prealloc_active);
}

@@ -4764,7 +4770,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
pa->pa_len = ac->ac_b_ex.fe_len;
pa->pa_free = pa->pa_len;
spin_lock_init(&pa->pa_lock);
- INIT_LIST_HEAD(&pa->pa_inode_list);
+ INIT_LIST_HEAD(&pa->pa_node.lg_list);
INIT_LIST_HEAD(&pa->pa_group_list);
pa->pa_deleted = 0;
pa->pa_type = MB_GROUP_PA;
@@ -4780,7 +4786,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
lg = ac->ac_lg;
BUG_ON(lg == NULL);

- pa->pa_obj_lock = &lg->lg_prealloc_lock;
+ pa->pa_node_lock.lg_lock = &lg->lg_prealloc_lock;
pa->pa_inode = NULL;

list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
@@ -4956,9 +4962,15 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {

/* remove from object (inode or locality group) */
- spin_lock(pa->pa_obj_lock);
- list_del_rcu(&pa->pa_inode_list);
- spin_unlock(pa->pa_obj_lock);
+ if (pa->pa_type == MB_GROUP_PA) {
+ spin_lock(pa->pa_node_lock.lg_lock);
+ list_del_rcu(&pa->pa_node.lg_list);
+ spin_unlock(pa->pa_node_lock.lg_lock);
+ } else {
+ spin_lock(pa->pa_node_lock.inode_lock);
+ list_del_rcu(&pa->pa_node.inode_list);
+ spin_unlock(pa->pa_node_lock.inode_lock);
+ }

if (pa->pa_type == MB_GROUP_PA)
ext4_mb_release_group_pa(&e4b, pa);
@@ -5021,8 +5033,8 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
spin_lock(&ei->i_prealloc_lock);
while (!list_empty(&ei->i_prealloc_list) && needed) {
pa = list_entry(ei->i_prealloc_list.prev,
- struct ext4_prealloc_space, pa_inode_list);
- BUG_ON(pa->pa_obj_lock != &ei->i_prealloc_lock);
+ struct ext4_prealloc_space, pa_node.inode_list);
+ BUG_ON(pa->pa_node_lock.inode_lock != &ei->i_prealloc_lock);
spin_lock(&pa->pa_lock);
if (atomic_read(&pa->pa_count)) {
/* this shouldn't happen often - nobody should
@@ -5039,7 +5051,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
if (pa->pa_deleted == 0) {
ext4_mb_mark_pa_deleted(sb, pa);
spin_unlock(&pa->pa_lock);
- list_del_rcu(&pa->pa_inode_list);
+ list_del_rcu(&pa->pa_node.inode_list);
list_add(&pa->u.pa_tmp_list, &list);
needed--;
continue;
@@ -5331,7 +5343,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,

spin_lock(&lg->lg_prealloc_lock);
list_for_each_entry_rcu(pa, &lg->lg_prealloc_list[order],
- pa_inode_list,
+ pa_node.lg_list,
lockdep_is_held(&lg->lg_prealloc_lock)) {
spin_lock(&pa->pa_lock);
if (atomic_read(&pa->pa_count)) {
@@ -5354,7 +5366,7 @@ ext4_mb_discard_lg_preallocations(struct super_block *sb,
ext4_mb_mark_pa_deleted(sb, pa);
spin_unlock(&pa->pa_lock);

- list_del_rcu(&pa->pa_inode_list);
+ list_del_rcu(&pa->pa_node.lg_list);
list_add(&pa->u.pa_tmp_list, &discard_list);

total_entries--;
@@ -5415,7 +5427,7 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
/* Add the prealloc space to lg */
spin_lock(&lg->lg_prealloc_lock);
list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order],
- pa_inode_list,
+ pa_node.lg_list,
lockdep_is_held(&lg->lg_prealloc_lock)) {
spin_lock(&tmp_pa->pa_lock);
if (tmp_pa->pa_deleted) {
@@ -5424,8 +5436,8 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
}
if (!added && pa->pa_free < tmp_pa->pa_free) {
/* Add to the tail of the previous entry */
- list_add_tail_rcu(&pa->pa_inode_list,
- &tmp_pa->pa_inode_list);
+ list_add_tail_rcu(&pa->pa_node.lg_list,
+ &tmp_pa->pa_node.lg_list);
added = 1;
/*
* we want to count the total
@@ -5436,7 +5448,7 @@ static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
lg_prealloc_count++;
}
if (!added)
- list_add_tail_rcu(&pa->pa_inode_list,
+ list_add_tail_rcu(&pa->pa_node.lg_list,
&lg->lg_prealloc_list[order]);
spin_unlock(&lg->lg_prealloc_lock);

@@ -5492,9 +5504,9 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
* doesn't grow big.
*/
if (likely(pa->pa_free)) {
- spin_lock(pa->pa_obj_lock);
- list_del_rcu(&pa->pa_inode_list);
- spin_unlock(pa->pa_obj_lock);
+ spin_lock(pa->pa_node_lock.lg_lock);
+ list_del_rcu(&pa->pa_node.lg_list);
+ spin_unlock(pa->pa_node_lock.lg_lock);
ext4_mb_add_n_trim(ac);
}
}
@@ -5504,9 +5516,9 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
* treat per-inode prealloc list as a lru list, then try
* to trim the least recently used PA.
*/
- spin_lock(pa->pa_obj_lock);
- list_move(&pa->pa_inode_list, &ei->i_prealloc_list);
- spin_unlock(pa->pa_obj_lock);
+ spin_lock(pa->pa_node_lock.inode_lock);
+ list_move(&pa->pa_node.inode_list, &ei->i_prealloc_list);
+ spin_unlock(pa->pa_node_lock.inode_lock);
}

ext4_mb_put_pa(ac, ac->ac_sb, pa);
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index dcda2a943cee..398a6688c341 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -114,7 +114,10 @@ struct ext4_free_data {
};

struct ext4_prealloc_space {
- struct list_head pa_inode_list;
+ union {
+ struct list_head inode_list; /* for inode PAs */
+ struct list_head lg_list; /* for lg PAs */
+ } pa_node;
struct list_head pa_group_list;
union {
struct list_head pa_tmp_list;
@@ -128,7 +131,10 @@ struct ext4_prealloc_space {
ext4_grpblk_t pa_len; /* len of preallocated chunk */
ext4_grpblk_t pa_free; /* how many blocks are free */
unsigned short pa_type; /* pa type. inode or group */
- spinlock_t *pa_obj_lock;
+ union {
+ spinlock_t *inode_lock; /* locks the inode list holding this PA */
+ spinlock_t *lg_lock; /* locks the lg list holding this PA */
+ } pa_node_lock;
struct inode *pa_inode; /* hack, for history only */
};

--
2.31.1

2022-11-13 13:10:21

by Ojaswin Mujoo

[permalink] [raw]

Subject: Re: [PATCH v2 0/8] ext4: Convert inode preallocation list to an rbtree

Hello,

Just a gentle ping, let me know if there are any reviews or suggestions
regarding this patchset.

Thank you!
ojaswin

On Sat, Oct 15, 2022 at 02:06:22AM +0530, Ojaswin Mujoo wrote:
> This patch series aim to improve the performance and scalability of
> inode preallocation by changing inode preallocation linked list to an
> rbtree. I've ran xfstests quick on this series and plan to run auto group
> as well to confirm we have no regressions.