Currently, the kernel uses i_prealloc_list to hold all the inode
preallocations. This is known to cause degradation in performance in
workloads which perform large number of sparse writes on a single file.
This is mainly because functions like ext4_mb_normalize_request() and
ext4_mb_use_preallocated() iterate over this complete list, resulting in
slowdowns when large number of PAs are present.
Patch 27bc446e2 partially fixed this by enforcing a limit of 512 for
the inode preallocation list and adding logic to continually trim the
list if it grows above the threshold, however our testing revealed that
a hardcoded value is not suitable for all kinds of workloads.
To optimize this, add an rbtree to the inode and hold the inode
preallocations in this rbtree. This will make iterating over inode PAs
faster and scale much better than a linked list. Additionally, we also
had to remove the LRU logic that was added during trimming of the list
(in ext4_mb_release_context()) as it will add extra overhead in rbtree.
The discards now happen in the lowest-logical-offset-first order.
** Locking notes **
With the introduction of rbtree to maintain inode PAs, we can't use RCU
to walk the tree for searching since it can result in partial traversals
which might miss some nodes(or entire subtrees) while discards happen
in parallel (which happens under a lock). Hence this patch converts the
ei->i_prealloc_lock spin_lock to rw_lock.
Almost all the codepaths that read/modify the PA rbtrees are protected
by the higher level inode->i_data_sem (except
ext4_mb_discard_group_preallocations() and ext4_clear_inode()) IIUC, the
only place we need lock protection is when one thread is reading
"searching" the PA rbtree (earlier protected under rcu_read_lock()) and
another is "deleting" the PAs in ext4_mb_discard_group_preallocations()
function (which iterates all the PAs using the grp->bb_prealloc_list and
deletes PAs from the tree without taking any inode lock (i_data_sem)).
So, this patch converts all rcu_read_lock/unlock() paths for inode list
PA to use read_lock() and all places where we were using
ei->i_prealloc_lock spinlock will now be using write_lock().
Note that this makes the fast path (searching of the right PA e.g.
ext4_mb_use_preallocated() or ext4_mb_normalize_request()), now use
read_lock() instead of rcu_read_lock/unlock(). Ths also will now block
due to slow discard path (ext4_mb_discard_group_preallocations()) which
uses write_lock().
But this is not as bad as it looks. This is because -
1. The slow path only occurs when the normal allocation failed and we
can say that we are low on disk space. One can argue this scenario
won't be much frequent.
2. ext4_mb_discard_group_preallocations(), locks and unlocks the rwlock
for deleting every individual PA. This gives enough opportunity for
the fast path to acquire the read_lock for searching the PA inode
list.
Signed-off-by: Ojaswin Mujoo <[email protected]>
Reviewed-by: Ritesh Harjani (IBM) <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 +-
fs/ext4/mballoc.c | 168 ++++++++++++++++++++++++++++++----------------
fs/ext4/mballoc.h | 6 +-
fs/ext4/super.c | 4 +-
4 files changed, 118 insertions(+), 64 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9a3521e95f00..c23be3b45442 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1120,8 +1120,8 @@ struct ext4_inode_info {
/* mballoc */
atomic_t i_prealloc_active;
- struct list_head i_prealloc_list;
- spinlock_t i_prealloc_lock;
+ struct rb_root i_prealloc_node;
+ rwlock_t i_prealloc_lock;
/* extents status tree */
struct ext4_es_tree i_es_tree;
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 70344bb16008..9bbe2e099d81 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3984,6 +3984,24 @@ static void ext4_mb_normalize_group_request(struct ext4_allocation_context *ac)
mb_debug(sb, "goal %u blocks for locality group\n", ac->ac_g_ex.fe_len);
}
+/*
+ * This function returns the next element to look at during inode
+ * PA rbtree walk. We assume that we have held the inode PA rbtree lock
+ * (ei->i_prealloc_lock)
+ *
+ * new_start The start of the range we want to compare
+ * cur_start The existing start that we are comparing against
+ * node The node of the rb_tree
+ */
+static inline struct rb_node*
+ext4_mb_pa_rb_next_iter(ext4_lblk_t new_start, ext4_lblk_t cur_start, struct rb_node *node)
+{
+ if (new_start < cur_start)
+ return node->rb_left;
+ else
+ return node->rb_right;
+}
+
static inline void
ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
ext4_lblk_t start, ext4_lblk_t end)
@@ -3992,27 +4010,31 @@ ext4_mb_pa_assert_overlap(struct ext4_allocation_context *ac,
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
struct ext4_prealloc_space *tmp_pa;
ext4_lblk_t tmp_pa_start, tmp_pa_end;
+ struct rb_node *iter;
- rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
- spin_lock(&tmp_pa->pa_lock);
- if (tmp_pa->pa_deleted == 0) {
- tmp_pa_start = tmp_pa->pa_lstart;
- tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+ read_lock(&ei->i_prealloc_lock);
+ iter = ei->i_prealloc_node.rb_node;
+ while (iter) {
+ tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
+ pa_node.inode_node);
+ tmp_pa_start = tmp_pa->pa_lstart;
+ tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+ spin_lock(&tmp_pa->pa_lock);
+ if (tmp_pa->pa_deleted == 0)
BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
- }
spin_unlock(&tmp_pa->pa_lock);
+
+ iter = ext4_mb_pa_rb_next_iter(start, tmp_pa_start, iter);
}
- rcu_read_unlock();
+ read_unlock(&ei->i_prealloc_lock);
}
-
/*
* Given an allocation context "ac" and a range "start", "end", check
* and adjust boundaries if the range overlaps with any of the existing
* preallocatoins stored in the corresponding inode of the allocation context.
*
- *Parameters:
+ * Parameters:
* ac allocation context
* start start of the new range
* end end of the new range
@@ -4024,6 +4046,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_prealloc_space *tmp_pa;
+ struct rb_node *iter;
ext4_lblk_t new_start, new_end;
ext4_lblk_t tmp_pa_start, tmp_pa_end;
@@ -4031,19 +4054,27 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
new_end = *end;
/* check we don't cross already preallocated blocks */
- rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
+ read_lock(&ei->i_prealloc_lock);
+ for (iter = ei->i_prealloc_node.rb_node; iter;
+ iter = ext4_mb_pa_rb_next_iter(new_start, tmp_pa_start, iter)) {
+ tmp_pa = rb_entry(iter, struct ext4_prealloc_space,
+ pa_node.inode_node);
+ tmp_pa_start = tmp_pa->pa_lstart;
+ tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+
+ /*
+ * If pa is deleted, ignore overlaps and just iterate in rbtree
+ * based on tmp_pa_start
+ */
if (tmp_pa->pa_deleted)
continue;
+
spin_lock(&tmp_pa->pa_lock);
if (tmp_pa->pa_deleted) {
spin_unlock(&tmp_pa->pa_lock);
continue;
}
- tmp_pa_start = tmp_pa->pa_lstart;
- tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
-
/* PA must not overlap original request */
BUG_ON(!(ac->ac_o_ex.fe_logical >= tmp_pa_end ||
ac->ac_o_ex.fe_logical < tmp_pa_start));
@@ -4065,7 +4096,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
}
spin_unlock(&tmp_pa->pa_lock);
}
- rcu_read_unlock();
+ read_unlock(&ei->i_prealloc_lock);
/* XXX: extra loop to check we really don't overlap preallocations */
ext4_mb_pa_assert_overlap(ac, new_start, new_end);
@@ -4192,7 +4223,6 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
ext4_mb_pa_adjust_overlap(ac, &start, &end);
size = end - start;
-
/*
* In this function "start" and "size" are normalized for better
* alignment and length such that we could preallocate more blocks.
@@ -4401,6 +4431,7 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
struct ext4_locality_group *lg;
struct ext4_prealloc_space *tmp_pa, *cpa = NULL;
ext4_lblk_t tmp_pa_start, tmp_pa_end;
+ struct rb_node *iter;
ext4_fsblk_t goal_block;
/* only data can be preallocated */
@@ -4408,14 +4439,17 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
return false;
/* first, try per-file preallocation */
- rcu_read_lock();
- list_for_each_entry_rcu(tmp_pa, &ei->i_prealloc_list, pa_node.inode_list) {
+ read_lock(&ei->i_prealloc_lock);
+ for (iter = ei->i_prealloc_node.rb_node; iter;
+ iter = ext4_mb_pa_rb_next_iter(ac->ac_o_ex.fe_logical, tmp_pa_start, iter)) {
+ tmp_pa = rb_entry(iter, struct ext4_prealloc_space, pa_node.inode_node);
/* all fields in this condition don't change,
* so we can skip locking for them */
tmp_pa_start = tmp_pa->pa_lstart;
tmp_pa_end = tmp_pa->pa_lstart + EXT4_C2B(sbi, tmp_pa->pa_len);
+ /* original request start doesn't lie in this PA */
if (ac->ac_o_ex.fe_logical < tmp_pa_start ||
ac->ac_o_ex.fe_logical >= tmp_pa_end)
continue;
@@ -4438,12 +4472,12 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
ext4_mb_use_inode_pa(ac, tmp_pa);
spin_unlock(&tmp_pa->pa_lock);
ac->ac_criteria = 10;
- rcu_read_unlock();
+ read_unlock(&ei->i_prealloc_lock);
return true;
}
spin_unlock(&tmp_pa->pa_lock);
}
- rcu_read_unlock();
+ read_unlock(&ei->i_prealloc_lock);
/* can we use group allocation? */
if (!(ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC))
@@ -4596,6 +4630,7 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
{
ext4_group_t grp;
ext4_fsblk_t grp_blk;
+ struct ext4_inode_info *ei = EXT4_I(ac->ac_inode);
/* in this short window concurrent discard can set pa_deleted */
spin_lock(&pa->pa_lock);
@@ -4641,16 +4676,41 @@ static void ext4_mb_put_pa(struct ext4_allocation_context *ac,
ext4_unlock_group(sb, grp);
if (pa->pa_type == MB_INODE_PA) {
- spin_lock(pa->pa_node_lock.inode_lock);
- list_del_rcu(&pa->pa_node.inode_list);
- spin_unlock(pa->pa_node_lock.inode_lock);
+ write_lock(pa->pa_node_lock.inode_lock);
+ rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
+ write_unlock(pa->pa_node_lock.inode_lock);
+ ext4_mb_pa_free(pa);
} else {
spin_lock(pa->pa_node_lock.lg_lock);
list_del_rcu(&pa->pa_node.lg_list);
spin_unlock(pa->pa_node_lock.lg_lock);
+ call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+ }
+}
+
+static void ext4_mb_pa_rb_insert(struct rb_root *root, struct rb_node *new)
+{
+ struct rb_node **iter = &root->rb_node, *parent = NULL;
+ struct ext4_prealloc_space *iter_pa, *new_pa;
+ ext4_lblk_t iter_start, new_start;
+
+ while (*iter) {
+ iter_pa = rb_entry(*iter, struct ext4_prealloc_space,
+ pa_node.inode_node);
+ new_pa = rb_entry(new, struct ext4_prealloc_space,
+ pa_node.inode_node);
+ iter_start = iter_pa->pa_lstart;
+ new_start = new_pa->pa_lstart;
+
+ parent = *iter;
+ if (new_start < iter_start)
+ iter = &((*iter)->rb_left);
+ else
+ iter = &((*iter)->rb_right);
}
- call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+ rb_link_node(new, parent, iter);
+ rb_insert_color(new, root);
}
/*
@@ -4716,7 +4776,6 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
pa->pa_len = ac->ac_b_ex.fe_len;
pa->pa_free = pa->pa_len;
spin_lock_init(&pa->pa_lock);
- INIT_LIST_HEAD(&pa->pa_node.inode_list);
INIT_LIST_HEAD(&pa->pa_group_list);
pa->pa_deleted = 0;
pa->pa_type = MB_INODE_PA;
@@ -4736,9 +4795,9 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
list_add(&pa->pa_group_list, &grp->bb_prealloc_list);
- spin_lock(pa->pa_node_lock.inode_lock);
- list_add_rcu(&pa->pa_node.inode_list, &ei->i_prealloc_list);
- spin_unlock(pa->pa_node_lock.inode_lock);
+ write_lock(pa->pa_node_lock.inode_lock);
+ ext4_mb_pa_rb_insert(&ei->i_prealloc_node, &pa->pa_node.inode_node);
+ write_unlock(pa->pa_node_lock.inode_lock);
atomic_inc(&ei->i_prealloc_active);
}
@@ -4904,6 +4963,7 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
struct ext4_prealloc_space *pa, *tmp;
struct list_head list;
struct ext4_buddy e4b;
+ struct ext4_inode_info *ei;
int err;
int free = 0;
@@ -4967,18 +5027,21 @@ ext4_mb_discard_group_preallocations(struct super_block *sb,
list_del_rcu(&pa->pa_node.lg_list);
spin_unlock(pa->pa_node_lock.lg_lock);
} else {
- spin_lock(pa->pa_node_lock.inode_lock);
- list_del_rcu(&pa->pa_node.inode_list);
- spin_unlock(pa->pa_node_lock.inode_lock);
+ write_lock(pa->pa_node_lock.inode_lock);
+ ei = EXT4_I(pa->pa_inode);
+ rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
+ write_unlock(pa->pa_node_lock.inode_lock);
}
- if (pa->pa_type == MB_GROUP_PA)
+ list_del(&pa->u.pa_tmp_list);
+
+ if (pa->pa_type == MB_GROUP_PA) {
ext4_mb_release_group_pa(&e4b, pa);
- else
+ call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+ } else {
ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa);
-
- list_del(&pa->u.pa_tmp_list);
- call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+ ext4_mb_pa_free(pa);
+ }
}
ext4_unlock_group(sb, group);
@@ -5008,6 +5071,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
ext4_group_t group = 0;
struct list_head list;
struct ext4_buddy e4b;
+ struct rb_node *iter;
int err;
if (!S_ISREG(inode->i_mode)) {
@@ -5030,17 +5094,18 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
repeat:
/* first, collect all pa's in the inode */
- spin_lock(&ei->i_prealloc_lock);
- while (!list_empty(&ei->i_prealloc_list) && needed) {
- pa = list_entry(ei->i_prealloc_list.prev,
- struct ext4_prealloc_space, pa_node.inode_list);
+ write_lock(&ei->i_prealloc_lock);
+ for (iter = rb_first(&ei->i_prealloc_node); iter && needed; iter = rb_next(iter)) {
+ pa = rb_entry(iter, struct ext4_prealloc_space,
+ pa_node.inode_node);
BUG_ON(pa->pa_node_lock.inode_lock != &ei->i_prealloc_lock);
+
spin_lock(&pa->pa_lock);
if (atomic_read(&pa->pa_count)) {
/* this shouldn't happen often - nobody should
* use preallocation while we're discarding it */
spin_unlock(&pa->pa_lock);
- spin_unlock(&ei->i_prealloc_lock);
+ write_unlock(&ei->i_prealloc_lock);
ext4_msg(sb, KERN_ERR,
"uh-oh! used pa while discarding");
WARN_ON(1);
@@ -5051,7 +5116,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
if (pa->pa_deleted == 0) {
ext4_mb_mark_pa_deleted(sb, pa);
spin_unlock(&pa->pa_lock);
- list_del_rcu(&pa->pa_node.inode_list);
+ rb_erase(&pa->pa_node.inode_node, &ei->i_prealloc_node);
list_add(&pa->u.pa_tmp_list, &list);
needed--;
continue;
@@ -5059,7 +5124,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
/* someone is deleting pa right now */
spin_unlock(&pa->pa_lock);
- spin_unlock(&ei->i_prealloc_lock);
+ write_unlock(&ei->i_prealloc_lock);
/* we have to wait here because pa_deleted
* doesn't mean pa is already unlinked from
@@ -5076,7 +5141,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
schedule_timeout_uninterruptible(HZ);
goto repeat;
}
- spin_unlock(&ei->i_prealloc_lock);
+ write_unlock(&ei->i_prealloc_lock);
list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
BUG_ON(pa->pa_type != MB_INODE_PA);
@@ -5108,7 +5173,7 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
put_bh(bitmap_bh);
list_del(&pa->u.pa_tmp_list);
- call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
+ ext4_mb_pa_free(pa);
}
}
@@ -5484,7 +5549,6 @@ static void ext4_mb_trim_inode_pa(struct inode *inode)
static int ext4_mb_release_context(struct ext4_allocation_context *ac)
{
struct inode *inode = ac->ac_inode;
- struct ext4_inode_info *ei = EXT4_I(inode);
struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
struct ext4_prealloc_space *pa = ac->ac_pa;
if (pa) {
@@ -5511,16 +5575,6 @@ static int ext4_mb_release_context(struct ext4_allocation_context *ac)
}
}
- if (pa->pa_type == MB_INODE_PA) {
- /*
- * treat per-inode prealloc list as a lru list, then try
- * to trim the least recently used PA.
- */
- spin_lock(pa->pa_node_lock.inode_lock);
- list_move(&pa->pa_node.inode_list, &ei->i_prealloc_list);
- spin_unlock(pa->pa_node_lock.inode_lock);
- }
-
ext4_mb_put_pa(ac, ac->ac_sb, pa);
}
if (ac->ac_bitmap_page)
diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h
index 398a6688c341..f8e8ee493867 100644
--- a/fs/ext4/mballoc.h
+++ b/fs/ext4/mballoc.h
@@ -115,7 +115,7 @@ struct ext4_free_data {
struct ext4_prealloc_space {
union {
- struct list_head inode_list; /* for inode PAs */
+ struct rb_node inode_node; /* for inode PA rbtree */
struct list_head lg_list; /* for lg PAs */
} pa_node;
struct list_head pa_group_list;
@@ -132,10 +132,10 @@ struct ext4_prealloc_space {
ext4_grpblk_t pa_free; /* how many blocks are free */
unsigned short pa_type; /* pa type. inode or group */
union {
- spinlock_t *inode_lock; /* locks the inode list holding this PA */
+ rwlock_t *inode_lock; /* locks the rbtree holding this PA */
spinlock_t *lg_lock; /* locks the lg list holding this PA */
} pa_node_lock;
- struct inode *pa_inode; /* hack, for history only */
+ struct inode *pa_inode; /* used to get the inode during group discard */
};
enum {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d733db8a0b02..c75869419621 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1324,9 +1324,9 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
inode_set_iversion(&ei->vfs_inode, 1);
spin_lock_init(&ei->i_raw_lock);
- INIT_LIST_HEAD(&ei->i_prealloc_list);
+ ei->i_prealloc_node = RB_ROOT;
atomic_set(&ei->i_prealloc_active, 0);
- spin_lock_init(&ei->i_prealloc_lock);
+ rwlock_init(&ei->i_prealloc_lock);
ext4_es_init_tree(&ei->i_es_tree);
rwlock_init(&ei->i_es_lock);
INIT_LIST_HEAD(&ei->i_es_list);
--
2.31.1
This commit (determined via bisecion) seems to be causing a reliable
failure using the ext4/ext3 configuration when running generic/269:
% kvm-xfstests -c ext4/ext3 generic/269
...
BEGIN TEST ext3 (1 test): Ext4 4k block emulating ext3 Mon Nov 28 21:39:35 EST 2022
DEVICE: /dev/vdd
EXT_MKFS_OPTIONS: -O ^extents,^flex_bg,^uninit_bg,^64bit,^metadata_csum,^huge_file,^die
EXT_MOUNT_OPTIONS: -o block_validity,nodelalloc
FSTYP -- ext4
PLATFORM -- Linux/x86_64 kvm-xfstests 6.1.0-rc4-xfstests-00018-g1c85d4890f15 #8492
MKFS_OPTIONS -- -F -q -O ^extents,^flex_bg,^uninit_bg,^64bit,^metadata_csum,^huge_filc
MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity,nodelalloc /dev/vdc /vdc
generic/269 23s ... [21:39:35][ 3.085973] run fstests generic/269 at 2022-11-28 215
[ 14.931680] ------------[ cut here ]------------
[ 14.931902] kernel BUG at fs/ext4/mballoc.c:4025!
[ 14.932137] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 14.932366] CPU: 1 PID: 2677 Comm: fsstress Not tainted 6.1.0-rc4-xfstests-00018-g19
[ 14.932756] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-4
[ 14.933169] RIP: 0010:ext4_mb_pa_adjust_overlap.constprop.0+0x18e/0x1c0
[ 14.933457] Code: 66 54 8b 48 54 89 4c 24 04 e8 ae 92 9f 00 41 8b 46 40 85 c0 75 bc4
[ 14.934270] RSP: 0018:ffffc90003aeb868 EFLAGS: 00010283
[ 14.934499] RAX: 0000000000000000 RBX: 00000000000000fc RCX: 0000000000000000
[ 14.934830] RDX: 0000000000000001 RSI: ffffc90003aeb8d4 RDI: 0000000000000001
[ 14.935146] RBP: 0000000000000200 R08: 0000000000008000 R09: 0000000000000001
[ 14.935447] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000103
[ 14.935744] R13: 0000000000000101 R14: ffff8880073370e0 R15: ffff888007337118
[ 14.936043] FS: 00007f94eda0b740(0000) GS:ffff88807dd00000(0000) knlGS:000000000000
[ 14.936390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.936634] CR2: 000055ba905a0448 CR3: 000000001092c005 CR4: 0000000000770ee0
[ 14.936932] PKRU: 55555554
[ 14.937048] Call Trace:
[ 14.937190] <TASK>
[ 14.937285] ext4_mb_normalize_request.constprop.0+0x1e9/0x440
[ 14.937534] ext4_mb_new_blocks+0x3a2/0x560
[ 14.937715] ext4_alloc_branch+0x21e/0x350
[ 14.937892] ext4_ind_map_blocks+0x322/0x750
[ 14.938076] ext4_map_blocks+0x380/0x6e0
[ 14.938260] _ext4_get_block+0xb2/0x120
[ 14.938426] ext4_block_write_begin+0x13c/0x3d0
[ 14.938624] ? _ext4_get_block+0x120/0x120
[ 14.938801] ext4_write_begin+0x1c1/0x570
[ 14.938973] generic_perform_write+0xcf/0x220
[ 14.939175] ext4_buffered_write_iter+0x84/0x140
[ 14.939377] do_iter_readv_writev+0xf0/0x150
[ 14.939562] do_iter_write+0x80/0x150
[ 14.939722] vfs_writev+0xed/0x1f0
[ 14.939871] do_writev+0x73/0x100
[ 14.940016] do_syscall_64+0x37/0x90
[ 14.940186] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 14.940403] RIP: 0033:0x7f94edb02da3
[ 14.940559] Code: 8b 15 f1 90 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f8
[ 14.941341] RSP: 002b:00007ffc5e82d0d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
[ 14.941659] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f94edb02da3
[ 14.941961] RDX: 0000000000000356 RSI: 000055ba901c1240 RDI: 0000000000000003
[ 14.942290] RBP: 0000000000000003 R08: 000055ba901cf240 R09: 00007f94edbccbe0
[ 14.942596] R10: 0000000000000080 R11: 0000000000000246 R12: 000000000000062a
[ 14.942902] R13: 0000000000000356 R14: 000055ba901c1240 R15: 000000000000b529
[ 14.943219] </TASK>
[ 14.943326] ---[ end trace 0000000000000000 ]---
Looking at the stack trace it looks like we're hitting this BUG_ON:
spin_lock(&tmp_pa->pa_lock);
if (tmp_pa->pa_deleted == 0)
BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
spin_unlock(&tmp_pa->pa_lock);
... in the inline function ext4_mb_pa_assert_overlap(), called from
ext4_mb_pa_adjust_overlap().
The generic/269 test runs fstress with an ENOSPC hitter as an
antogonist process. The ext3 configuration disables delayed
allocation, which means that fstress is going to be allocating blocks
at write time (instead of dirty page writeback time).
Could you take a look? Thanks!
- Ted
On Mon, Nov 28, 2022 at 10:06:34PM -0500, Theodore Ts'o wrote:
> This commit (determined via bisecion) seems to be causing a reliable
> failure using the ext4/ext3 configuration when running generic/269:
>
> % kvm-xfstests -c ext4/ext3 generic/269
> ...
> BEGIN TEST ext3 (1 test): Ext4 4k block emulating ext3 Mon Nov 28 21:39:35 EST 2022
> DEVICE: /dev/vdd
> EXT_MKFS_OPTIONS: -O ^extents,^flex_bg,^uninit_bg,^64bit,^metadata_csum,^huge_file,^die
> EXT_MOUNT_OPTIONS: -o block_validity,nodelalloc
> FSTYP -- ext4
> PLATFORM -- Linux/x86_64 kvm-xfstests 6.1.0-rc4-xfstests-00018-g1c85d4890f15 #8492
> MKFS_OPTIONS -- -F -q -O ^extents,^flex_bg,^uninit_bg,^64bit,^metadata_csum,^huge_filc
> MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity,nodelalloc /dev/vdc /vdc
>
> generic/269 23s ... [21:39:35][ 3.085973] run fstests generic/269 at 2022-11-28 215
> [ 14.931680] ------------[ cut here ]------------
> [ 14.931902] kernel BUG at fs/ext4/mballoc.c:4025!
> [ 14.932137] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> [ 14.932366] CPU: 1 PID: 2677 Comm: fsstress Not tainted 6.1.0-rc4-xfstests-00018-g19
> [ 14.932756] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.0-debian-4
> [ 14.933169] RIP: 0010:ext4_mb_pa_adjust_overlap.constprop.0+0x18e/0x1c0
> [ 14.933457] Code: 66 54 8b 48 54 89 4c 24 04 e8 ae 92 9f 00 41 8b 46 40 85 c0 75 bc4
> [ 14.934270] RSP: 0018:ffffc90003aeb868 EFLAGS: 00010283
> [ 14.934499] RAX: 0000000000000000 RBX: 00000000000000fc RCX: 0000000000000000
> [ 14.934830] RDX: 0000000000000001 RSI: ffffc90003aeb8d4 RDI: 0000000000000001
> [ 14.935146] RBP: 0000000000000200 R08: 0000000000008000 R09: 0000000000000001
> [ 14.935447] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000103
> [ 14.935744] R13: 0000000000000101 R14: ffff8880073370e0 R15: ffff888007337118
> [ 14.936043] FS: 00007f94eda0b740(0000) GS:ffff88807dd00000(0000) knlGS:000000000000
> [ 14.936390] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 14.936634] CR2: 000055ba905a0448 CR3: 000000001092c005 CR4: 0000000000770ee0
> [ 14.936932] PKRU: 55555554
> [ 14.937048] Call Trace:
> [ 14.937190] <TASK>
> [ 14.937285] ext4_mb_normalize_request.constprop.0+0x1e9/0x440
> [ 14.937534] ext4_mb_new_blocks+0x3a2/0x560
> [ 14.937715] ext4_alloc_branch+0x21e/0x350
> [ 14.937892] ext4_ind_map_blocks+0x322/0x750
> [ 14.938076] ext4_map_blocks+0x380/0x6e0
> [ 14.938260] _ext4_get_block+0xb2/0x120
> [ 14.938426] ext4_block_write_begin+0x13c/0x3d0
> [ 14.938624] ? _ext4_get_block+0x120/0x120
> [ 14.938801] ext4_write_begin+0x1c1/0x570
> [ 14.938973] generic_perform_write+0xcf/0x220
> [ 14.939175] ext4_buffered_write_iter+0x84/0x140
> [ 14.939377] do_iter_readv_writev+0xf0/0x150
> [ 14.939562] do_iter_write+0x80/0x150
> [ 14.939722] vfs_writev+0xed/0x1f0
> [ 14.939871] do_writev+0x73/0x100
> [ 14.940016] do_syscall_64+0x37/0x90
> [ 14.940186] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [ 14.940403] RIP: 0033:0x7f94edb02da3
> [ 14.940559] Code: 8b 15 f1 90 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f8
> [ 14.941341] RSP: 002b:00007ffc5e82d0d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000014
> [ 14.941659] RAX: ffffffffffffffda RBX: 0000000000000036 RCX: 00007f94edb02da3
> [ 14.941961] RDX: 0000000000000356 RSI: 000055ba901c1240 RDI: 0000000000000003
> [ 14.942290] RBP: 0000000000000003 R08: 000055ba901cf240 R09: 00007f94edbccbe0
> [ 14.942596] R10: 0000000000000080 R11: 0000000000000246 R12: 000000000000062a
> [ 14.942902] R13: 0000000000000356 R14: 000055ba901c1240 R15: 000000000000b529
> [ 14.943219] </TASK>
> [ 14.943326] ---[ end trace 0000000000000000 ]---
>
> Looking at the stack trace it looks like we're hitting this BUG_ON:
>
> spin_lock(&tmp_pa->pa_lock);
> if (tmp_pa->pa_deleted == 0)
> BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
> spin_unlock(&tmp_pa->pa_lock);
>
> ... in the inline function ext4_mb_pa_assert_overlap(), called from
> ext4_mb_pa_adjust_overlap().
>
> The generic/269 test runs fstress with an ENOSPC hitter as an
> antogonist process. The ext3 configuration disables delayed
> allocation, which means that fstress is going to be allocating blocks
> at write time (instead of dirty page writeback time).
>
> Could you take a look? Thanks!
Hi Ted,
Thanks for pointing this out, I'll have a look into this.
PS: I'm on vacation so might be a bit slow to update on this.
Regards,
Ojaswin
>
> - Ted
On Tue, Nov 29, 2022 at 08:07:50PM +0530, Ojaswin Mujoo wrote:
> On Mon, Nov 28, 2022 at 10:06:34PM -0500, Theodore Ts'o wrote:
> > Looking at the stack trace it looks like we're hitting this BUG_ON:
> >
> > spin_lock(&tmp_pa->pa_lock);
> > if (tmp_pa->pa_deleted == 0)
> > BUG_ON(!(start >= tmp_pa_end || end <= tmp_pa_start));
> > spin_unlock(&tmp_pa->pa_lock);
> >
> > ... in the inline function ext4_mb_pa_assert_overlap(), called from
> > ext4_mb_pa_adjust_overlap().
> >
> > The generic/269 test runs fstress with an ENOSPC hitter as an
> > antogonist process. The ext3 configuration disables delayed
> > allocation, which means that fstress is going to be allocating blocks
> > at write time (instead of dirty page writeback time).
> >
> > Could you take a look? Thanks!
> Hi Ted,
>
> Thanks for pointing this out, I'll have a look into this.
>
> PS: I'm on vacation so might be a bit slow to update on this.
>
> Regards,
> Ojaswin
> >
> > - Ted
Hi Ted,
Apologies for the delay on this due to new years/replication issues. I
was not able to replicate it at my end for some time before ultimately
replicating this in gce-xfstests. I have sent a v3 [1] that fixes the
issue by introducing an optimization related to delted PAs found in
rbtree traversal in functions like ext4_mb_adjust_overlap() and
ext4_mb_use_preallocated(). Let me quote the commit message which
explains the issue and the fix we proposed:
---- snip ----
In ext4_mb_adjust_overlap(), it is possible for an allocating thread to
come across a PA that is marked deleted via
ext4_mb_discard_group_preallocations() but not yet removed from the
inode PA rbtree. In such a case, we ignore any overlaps with this PA
node and simply move forward in the rbtree by comparing logical start of
to-be-inserted PA and the deleted PA node.
Although this usually doesn't cause an issue and we can move forward in
the tree, in one special case, i.e if range of deleted PA lies
completely inside the normalized range, we might require to travese both
the sub-trees under such a deleted PA.
To simplify this special scenario and also as an optimization, undelete
the PA If the allocating thread determines that this PA might be needed
in the near future. This results in the following changes:
- ext4_mb_use_preallocated(): Use a deleted PA if original request lies in it.
- ext4_mb_pa_adjust_overlap(): Undelete a PA if it is deleted but there
is an overlap with the normalized range.
- ext4_mb_discard_group_preallocations(): Rollback delete operation if
allocation path undeletes a PA before it is erased from inode PA
list.
Since this covers the special case we discussed above, we will always
un-delete the PA when we encounter the special case and we can then
adjust for overlap and traverse the PA rbtree without any issues.
---- snip ----
The above optimization is now straight forward as we completely lock the
rbtree during traversals and modifications. Earlier in case of list,
the locking would have been tricky due to usage of RCU.
[1]
https://lore.kernel.org/linux-ext4/[email protected]/