> On Apr 7, 2021, at 5:16 AM, riteshh <[email protected]> wrote:
>>
>> On 21/04/07 03:01PM, Wen Yang wrote:
>>> From: Wen Yang <[email protected]>
>>>
>>> The kworker has occupied 100% of the CPU for several days:
>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>> 68086 root 20 0 0 0 0 R 100.0 0.0 9718:18 kworker/u64:11
>>>
>>> And the stack obtained through sysrq is as follows:
>>> [20613144.850426] task: ffff8800b5e08000 task.stack: ffffc9001342c000
>>> [20613144.850438] Call Trace:
>>> [20613144.850439]
[<ffffffffa0244209>]ext4_mb_new_blocks+0x429/0x550 [ext4]
>>> [20613144.850439] [<ffffffffa02389ae>]
ext4_ext_map_blocks+0xb5e/0xf30 [ext4]
>>> [20613144.850441] [<ffffffffa0204b52>] ext4_map_blocks+0x172/0x620
[ext4]
>>> [20613144.850442] [<ffffffffa0208675>] ext4_writepages+0x7e5/0xf00
[ext4]
>>> [20613144.850443] [<ffffffff811c487e>] do_writepages+0x1e/0x30
>>> [20613144.850444] [<ffffffff81280265>]
__writeback_single_inode+0x45/0x320
>>> [20613144.850444] [<ffffffff81280ab2>] writeback_sb_inodes+0x272/0x600
>>> [20613144.850445] [<ffffffff81280ed2>] __writeback_inodes_wb+0x92/0xc0
>>> [20613144.850445] [<ffffffff81281238>] wb_writeback+0x268/0x300
>>> [20613144.850446] [<ffffffff812819f4>] wb_workfn+0xb4/0x380
>>> [20613144.850447] [<ffffffff810a5dc9>] process_one_work+0x189/0x420
>>> [20613144.850447] [<ffffffff810a60ae>] worker_thread+0x4e/0x4b0
>>>
>>> The cpu resources of the cloud server are precious, and the server
>>> cannot be restarted after running for a long time, so a configuration
>>> parameter is added to prevent this endless loop.
>>
>> Strange, if there is a endless loop here. Then I would definitely see
>> if there is any accounting problem in pa->pa_count. Otherwise busy=1
>> should not be set everytime. ext4_mb_show_pa() function may help
debug this.
>>
>> If yes, then that means there always exists either a file preallocation
>> or a group preallocation. Maybe it is possible, in some use case.
>> Others may know of such use case, if any.
> If this code is broken, then it doesn't make sense to me that we would
> leave it in the "run forever" state after the patch, and require a sysfs
> tunable to be set to have a properly working system?
> Is there anything particularly strange about the workload/system that
> might cause this? Filesystem is very full, memory is very low, etc?
Hi Ritesh and Andreas,
Thank you for your reply. Since there is still a faulty machine, we have
analyzed it again and found it is indeed a very special case:
crash> struct ext4_group_info ffff8813bb5f72d0
struct ext4_group_info {
bb_state = 0,
bb_free_root = {
rb_node = 0x0
},
bb_first_free = 1681,
bb_free = 0,
bb_fragments = 0,
bb_largest_free_order = -1,
bb_prealloc_list = {
next = 0xffff880268291d78,
prev = 0xffff880268291d78 ---> *** The list is empty
},
alloc_sem = {
count = {
counter = 0
},
wait_list = {
next = 0xffff8813bb5f7308,
prev = 0xffff8813bb5f7308
},
wait_lock = {
raw_lock = {
{
val = {
counter = 0
},
{
locked = 0 '\000',
pending = 0 '\000'
},
{
locked_pending = 0,
tail = 0
}
}
}
},
osq = {
tail = {
counter = 0
}
},
owner = 0x0
},
bb_counters = 0xffff8813bb5f7328
}
crash>
crash> list 0xffff880268291d78 -l ext4_prealloc_space.pa_group_list -s
ext4_prealloc_space.pa_count
ffff880268291d78
pa_count = {
counter = 1 ---> ****pa->pa_count
}
ffff8813bb5f72f0
pa_count = {
counter = -30701
}
crash> struct -xo ext4_prealloc_space
struct ext4_prealloc_space {
[0x0] struct list_head pa_inode_list;
[0x10] struct list_head pa_group_list;
union {
struct list_head pa_tmp_list;
struct callback_head pa_rcu;
[0x20] } u;
[0x30] spinlock_t pa_lock;
[0x34] atomic_t pa_count;
[0x38] unsigned int pa_deleted;
[0x40] ext4_fsblk_t pa_pstart;
[0x48] ext4_lblk_t pa_lstart;
[0x4c] ext4_grpblk_t pa_len;
[0x50] ext4_grpblk_t pa_free;
[0x54] unsigned short pa_type;
[0x58] spinlock_t *pa_obj_lock;
[0x60] struct inode *pa_inode;
}
SIZE: 0x68
crash> rd 0xffff880268291d68 20
ffff880268291d68: ffff881822f8a4c8 ffff881822f8a4c8 ..."......."....
ffff880268291d78: ffff8813bb5f72f0 ffff8813bb5f72f0 .r_......r_.....
ffff880268291d88: 0000000000001000 ffff880db2371000 ..........7.....
ffff880268291d98: 0000000100000001 0000000000000000 ................
ffff880268291da8: 0000000000029c39 0000001700000c41 9.......A.......
ffff880268291db8: ffff000000000016 ffff881822f8a4d8 ..........."....
ffff880268291dc8: ffff881822f8a268 ffff880268291af8 h.."......)h....
ffff880268291dd8: ffff880268291dd0 ffffea00069a07c0 ..)h............
ffff880268291de8: 00000000004d5bb7 0000000000001000 .[M.............
ffff880268291df8: ffff8801a681f000 0000000000000000 ................
Before "goto repeat", it is necessary to check whether
grp->bb_prealloc_list is empty.
This patch may fix it.
Please kindly give us your comments and suggestions.
Thanks.
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 99bf091..8082e2d 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4290,7 +4290,7 @@ static void ext4_mb_new_preallocation(struct
ext4_allocation_context *ac)
free_total += free;
/* if we still need more blocks and some PAs were used, try
again */
- if (free_total < needed && busy) {
+ if (free_total < needed && busy &&
!list_empty(&grp->bb_prealloc_list)) {
ext4_unlock_group(sb, group);
cond_resched();
busy = 0;
--
Best wishes,
Wen
On Apr 8, 2021, at 12:50 PM, Wen Yang <[email protected]> wrote:
>
> Hi Ritesh and Andreas,
>
> Thank you for your reply. Since there is still a faulty machine, we have analyzed it again and found it is indeed a very special case:
>
>
> crash> struct ext4_group_info ffff8813bb5f72d0
> struct ext4_group_info {
> bb_state = 0,
> bb_free_root = {
> rb_node = 0x0
> },
> bb_first_free = 1681,
> bb_free = 0,
> bb_fragments = 0,
> bb_largest_free_order = -1,
> bb_prealloc_list = {
> next = 0xffff880268291d78,
> prev = 0xffff880268291d78 ---> *** The list is empty
> },
> alloc_sem = {
> count = {
> counter = 0
> },
> wait_list = {
> next = 0xffff8813bb5f7308,
> prev = 0xffff8813bb5f7308
> },
> wait_lock = {
> raw_lock = {
> {
> val = {
> counter = 0
> },
> {
> locked = 0 '\000',
> pending = 0 '\000'
> },
> {
> locked_pending = 0,
> tail = 0
> }
> }
> }
> },
> osq = {
> tail = {
> counter = 0
> }
> },
> owner = 0x0
> },
> bb_counters = 0xffff8813bb5f7328
> }
> crash>
>
>
> crash> list 0xffff880268291d78 -l ext4_prealloc_space.pa_group_list -s ext4_prealloc_space.pa_count
> ffff880268291d78
> pa_count = {
> counter = 1 ---> ****pa->pa_count
> }
> ffff8813bb5f72f0
> pa_count = {
> counter = -30701
> }
>
>
> crash> struct -xo ext4_prealloc_space
> struct ext4_prealloc_space {
> [0x0] struct list_head pa_inode_list;
> [0x10] struct list_head pa_group_list;
> union {
> struct list_head pa_tmp_list;
> struct callback_head pa_rcu;
> [0x20] } u;
> [0x30] spinlock_t pa_lock;
> [0x34] atomic_t pa_count;
> [0x38] unsigned int pa_deleted;
> [0x40] ext4_fsblk_t pa_pstart;
> [0x48] ext4_lblk_t pa_lstart;
> [0x4c] ext4_grpblk_t pa_len;
> [0x50] ext4_grpblk_t pa_free;
> [0x54] unsigned short pa_type;
> [0x58] spinlock_t *pa_obj_lock;
> [0x60] struct inode *pa_inode;
> }
> SIZE: 0x68
>
>
> Before "goto repeat", it is necessary to check whether grp->bb_prealloc_list is empty. This patch may fix it.
> Please kindly give us your comments and suggestions.
> Thanks.
This patch definitely looks more reasonable than the previous one, but
I don't think it is correct?
It isn't clear how the system could get into this state, or stay there.
If bb_prealloc_list is empty, then "busy" should be 0, since busy = 1
is only set inside "list_for_each_entry_safe(bb_prealloc_list)", and
that implies the list is *not* empty? The ext4_lock_group() is held
over this code, so the list should not be changing in this case, and
even if the list changed, it _should_ only affect the loop one time.
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 99bf091..8082e2d 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -4290,7 +4290,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
> free_total += free;
>
> /* if we still need more blocks and some PAs were used, try again */
> - if (free_total < needed && busy) {
> + if (free_total < needed && busy && !list_empty(&grp->bb_prealloc_list)) {
> ext4_unlock_group(sb, group);
> cond_resched();
> busy = 0;
> goto repeat;
Minor suggestion - moving "busy = 0" right after "repeat:" and removing
the "int busy = 0" initializer would make this code a bit more clear.
Cheers, Andreas