The kworker has occupied 100% of the CPU for several days:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
68086 root 20 0 0 0 0 R 100.0 0.0 9718:18 kworker/u64:11
And the stack obtained through sysrq is as follows:
[20613144.850426] task: ffff8800b5e08000 task.stack: ffffc9001342c000
[20613144.850427] RIP: 0010:[<ffffffffa0240743>] ^Ac
[<ffffffffa0240743>] ext4_mb_discard_group_preallocations+0x1b3/0x480 [ext4]
...
[20613144.850435] Stack:
[20613144.850435] ffff881942d6a6e8^Ac ffff8813bb5f72d0^Ac 00000001a02427cf^Ac 0000000000000140^Ac
[20613144.850436] ffff880f80618000^Ac 0000000000000000^Ac ffffc9001342f770^Ac ffffc9001342f770^Ac
[20613144.850437] ffffea0056360dc0^Ac ffff88158d837000^Ac ffffea0045155f80^Ac ffff88114557e000^Ac
[20613144.850438] Call Trace:
[20613144.850439] [<ffffffffa0244209>] ext4_mb_new_blocks+0x429/0x550 [ext4]
[20613144.850439] [<ffffffffa02389ae>] ext4_ext_map_blocks+0xb5e/0xf30 [ext4]
[20613144.850440] [<ffffffff811c0001>] ? numa_zonelist_order_handler+0xa1/0x1c0
[20613144.850441] [<ffffffffa0204b52>] ext4_map_blocks+0x172/0x620 [ext4]
[20613144.850441] [<ffffffffa020835d>] ? ext4_writepages+0x4cd/0xf00 [ext4]
[20613144.850442] [<ffffffffa0208675>] ext4_writepages+0x7e5/0xf00 [ext4]
[20613144.850442] [<ffffffff811c1410>] ? wb_position_ratio+0x1f0/0x1f0
[20613144.850443] [<ffffffff811c487e>] do_writepages+0x1e/0x30
[20613144.850444] [<ffffffff81280265>] __writeback_single_inode+0x45/0x320
[20613144.850444] [<ffffffff81280ab2>] writeback_sb_inodes+0x272/0x600
[20613144.850445] [<ffffffff81280ed2>] __writeback_inodes_wb+0x92/0xc0
[20613144.850445] [<ffffffff81281238>] wb_writeback+0x268/0x300
[20613144.850446] [<ffffffff812819f4>] wb_workfn+0xb4/0x380
[20613144.850447] [<ffffffff810a5dc9>] process_one_work+0x189/0x420
[20613144.850447] [<ffffffff810a60ae>] worker_thread+0x4e/0x4b0
[20613144.850448] [<ffffffff810a6060>] ? process_one_work+0x420/0x420
[20613144.850448] [<ffffffff810ac696>] kthread+0xe6/0x100
[20613144.850449] [<ffffffff810ac5b0>] ? kthread_park+0x60/0x60
[20613144.850450] [<ffffffff81741dd9>] ret_from_fork+0x39/0x50
The thread that references this pa has been waiting for IO to return:
PID: 15140 TASK: ffff88004d6dc300 CPU: 16 COMMAND: "kworker/u64:1"
[ffffc900273e7518] __schedule at ffffffff8173ca3b
[ffffc900273e75a0] schedule at ffffffff8173cfb6
[ffffc900273e75b8] io_schedule at ffffffff810bb75a
[ffffc900273e75e0] bit_wait_io at ffffffff8173d8d1
[ffffc900273e75f8] __wait_on_bit_lock at ffffffff8173d4e9
[ffffc900273e7638] out_of_line_wait_on_bit_lock at ffffffff8173d742
[ffffc900273e76b0] __lock_buffer at ffffffff81288c32
[ffffc900273e76c8] do_get_write_access at ffffffffa00dd177 [jbd2]
[ffffc900273e7728] jbd2_journal_get_write_access at ffffffffa00dd3a3 [jbd2]
[ffffc900273e7750] __ext4_journal_get_write_access at ffffffffa023b37b [ext4]
[ffffc900273e7788] ext4_mb_mark_diskspace_used at ffffffffa0242a0b [ext4]
[ffffc900273e77f0] ext4_mb_new_blocks at ffffffffa0244100 [ext4]
[ffffc900273e7860] ext4_ext_map_blocks at ffffffffa02389ae [ext4]
[ffffc900273e7950] ext4_map_blocks at ffffffffa0204b52 [ext4]
[ffffc900273e79d0] ext4_writepages at ffffffffa0208675 [ext4]
[ffffc900273e7b30] do_writepages at ffffffff811c487e
[ffffc900273e7b40] __writeback_single_inode at ffffffff81280265
[ffffc900273e7b90] writeback_sb_inodes at ffffffff81280ab2
[ffffc900273e7c90] __writeback_inodes_wb at ffffffff81280ed2
[ffffc900273e7cd8] wb_writeback at ffffffff81281238
[ffffc900273e7d80] wb_workfn at ffffffff812819f4
[ffffc900273e7e18] process_one_work at ffffffff810a5dc9
[ffffc900273e7e60] worker_thread at ffffffff810a60ae
[ffffc900273e7ec0] kthread at ffffffff810ac696
[ffffc900273e7f50] ret_from_fork at ffffffff81741dd9
On the bare metal server, we will use multiple hard disks, the Linux
kernel will run on the system disk, and business programs will run on
several hard disks virtualized by the BM hypervisor. The reason why IO
has not returned here is that the process handling IO in the BM hypervisor
has failed.
The cpu resources of the cloud server are precious, and the server
cannot be restarted after running for a long time. So it's slightly
optimized here to prevent the CPU from being 100% occupied.
Signed-off-by: Wen Yang <[email protected]>
Cc: "Theodore Ts'o" <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Ritesh Harjani <[email protected]>
Cc: Baoyou Xie <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
fs/ext4/mballoc.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index a02fadf..c73f212 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -351,6 +351,8 @@ static void ext4_mb_generate_from_freelist(struct super_block *sb, void *bitmap,
ext4_group_t group);
static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac);
+static inline void ext4_mb_show_pa(struct super_block *sb);
+
/*
* The algorithm using this percpu seq counter goes below:
* 1. We sample the percpu discard_pa_seq counter before trying for block
@@ -4217,9 +4219,9 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
struct ext4_prealloc_space *pa, *tmp;
struct list_head list;
struct ext4_buddy e4b;
+ int free_total = 0;
+ int busy, free;
int err;
- int busy = 0;
- int free, free_total = 0;
mb_debug(sb, "discard preallocation for group %u\n", group);
if (list_empty(&grp->bb_prealloc_list))
@@ -4247,6 +4249,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
INIT_LIST_HEAD(&list);
repeat:
+ busy = 0;
free = 0;
ext4_lock_group(sb, group);
list_for_each_entry_safe(pa, tmp,
@@ -4255,6 +4258,8 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
if (atomic_read(&pa->pa_count)) {
spin_unlock(&pa->pa_lock);
busy = 1;
+ mb_debug(sb, "used pa while discarding for group %u\n", group);
+ ext4_mb_show_pa(sb);
continue;
}
if (pa->pa_deleted) {
@@ -4299,8 +4304,7 @@ static void ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
/* if we still need more blocks and some PAs were used, try again */
if (free_total < needed && busy) {
ext4_unlock_group(sb, group);
- cond_resched();
- busy = 0;
+ schedule_timeout_uninterruptible(HZ/100);
goto repeat;
}
ext4_unlock_group(sb, group);
--
1.8.3.1
On Sun, Apr 18, 2021 at 06:28:34PM +0800, Wen Yang wrote:
> The kworker has occupied 100% of the CPU for several days:
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 68086 root 20 0 0 0 0 R 100.0 0.0 9718:18 kworker/u64:11
>
> ....
>
> The thread that references this pa has been waiting for IO to return:
> PID: 15140 TASK: ffff88004d6dc300 CPU: 16 COMMAND: "kworker/u64:1"
> [ffffc900273e7518] __schedule at ffffffff8173ca3b
> [ffffc900273e75a0] schedule at ffffffff8173cfb6
> [ffffc900273e75b8] io_schedule at ffffffff810bb75a
> [ffffc900273e75e0] bit_wait_io at ffffffff8173d8d1
> [ffffc900273e75f8] __wait_on_bit_lock at ffffffff8173d4e9
> [ffffc900273e7638] out_of_line_wait_on_bit_lock at ffffffff8173d742
> [ffffc900273e76b0] __lock_buffer at ffffffff81288c32
> [ffffc900273e76c8] do_get_write_access at ffffffffa00dd177 [jbd2]
> [ffffc900273e7728] jbd2_journal_get_write_access at ffffffffa00dd3a3 [jbd2]
> [ffffc900273e7750] __ext4_journal_get_write_access at ffffffffa023b37b [ext4]
> [ffffc900273e7788] ext4_mb_mark_diskspace_used at ffffffffa0242a0b [ext4]
> [ffffc900273e77f0] ext4_mb_new_blocks at ffffffffa0244100 [ext4]
> [ffffc900273e7860] ext4_ext_map_blocks at ffffffffa02389ae [ext4]
> [ffffc900273e7950] ext4_map_blocks at ffffffffa0204b52 [ext4]
> [ffffc900273e79d0] ext4_writepages at ffffffffa0208675 [ext4]
> [ffffc900273e7b30] do_writepages at ffffffff811c487e
> [ffffc900273e7b40] __writeback_single_inode at ffffffff81280265
> [ffffc900273e7b90] writeback_sb_inodes at ffffffff81280ab2
> [ffffc900273e7c90] __writeback_inodes_wb at ffffffff81280ed2
> [ffffc900273e7cd8] wb_writeback at ffffffff81281238
> [ffffc900273e7d80] wb_workfn at ffffffff812819f4
> [ffffc900273e7e18] process_one_work at ffffffff810a5dc9
> [ffffc900273e7e60] worker_thread at ffffffff810a60ae
> [ffffc900273e7ec0] kthread at ffffffff810ac696
> [ffffc900273e7f50] ret_from_fork at ffffffff81741dd9
>
> On the bare metal server, we will use multiple hard disks, the Linux
> kernel will run on the system disk, and business programs will run on
> several hard disks virtualized by the BM hypervisor. The reason why IO
> has not returned here is that the process handling IO in the BM hypervisor
> has failed.
So if the I/O not returning for every days, such that this thread had
been hanging for that long, it also follows that since it was calling
do_get_write_access(), that a handle was open. And if a handle is
open, then the current jbd2 transaction would never close --- which
means none of the file system operations executed over the past few
days would never commit, and would be undone on the next reboot.
Furthermore, sooner or later the journal would run out of space, at
which point the *entire* system would be locked up waiting for the
transaction to close.
I'm guessing that if the server hadn't come to a full livelock
earlier, it's because there aren't that many metadata operations that
are happening in the server's stable state operation. But in any
case, this particular server was/is(?) doomed, and all of the patches
that you proposed are not going to help in the long run. The correct
fix is to fix the hypervisor, which is the root cause of the problem.
I could imagine some kind of retry counter, where we start sleeping
after some number of retries, and give up after some larger number of
retries (at which point the allocation would fail with ENOSPC). We'd
need to do some testing against our current tests which test how we
handle running close to ENOSPC, and I'm not at all convinced it's
worth the effort in the end. We're trying to (slightly) improve the
case where (a) the file system is running close to full, (b) the
hypervisor is critically flawed and is the real problem, and (c) the
VM is eventually doomed to fail anyway due to a transaction never
closing due to an I/O never getting acknowledged for days(!).
If you really want to fix things in the guest OS, I perhaps the
virtio_scsi driver (or whatever I/O driver you are using), should
notice when an I/O request hasn't gotten acknowledged after minutes or
hours, and do something such as force a SCSI reset (which will result
in the file system needing to be unmounted and recovered, but due to
the hypervisor bug, that was an inevitable end result anyway).
Cheers,
- Ted
?? 2021/4/19 ????12:06, Theodore Ts'o ะด??:
> On Sun, Apr 18, 2021 at 06:28:34PM +0800, Wen Yang wrote:
>> The kworker has occupied 100% of the CPU for several days:
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 68086 root 20 0 0 0 0 R 100.0 0.0 9718:18 kworker/u64:11
>>
>> ....
>>
>> The thread that references this pa has been waiting for IO to return:
>> PID: 15140 TASK: ffff88004d6dc300 CPU: 16 COMMAND: "kworker/u64:1"
>> [ffffc900273e7518] __schedule at ffffffff8173ca3b
>> [ffffc900273e75a0] schedule at ffffffff8173cfb6
>> [ffffc900273e75b8] io_schedule at ffffffff810bb75a
>> [ffffc900273e75e0] bit_wait_io at ffffffff8173d8d1
>> [ffffc900273e75f8] __wait_on_bit_lock at ffffffff8173d4e9
>> [ffffc900273e7638] out_of_line_wait_on_bit_lock at ffffffff8173d742
>> [ffffc900273e76b0] __lock_buffer at ffffffff81288c32
>> [ffffc900273e76c8] do_get_write_access at ffffffffa00dd177 [jbd2]
>> [ffffc900273e7728] jbd2_journal_get_write_access at ffffffffa00dd3a3 [jbd2]
>> [ffffc900273e7750] __ext4_journal_get_write_access at ffffffffa023b37b [ext4]
>> [ffffc900273e7788] ext4_mb_mark_diskspace_used at ffffffffa0242a0b [ext4]
>> [ffffc900273e77f0] ext4_mb_new_blocks at ffffffffa0244100 [ext4]
>> [ffffc900273e7860] ext4_ext_map_blocks at ffffffffa02389ae [ext4]
>> [ffffc900273e7950] ext4_map_blocks at ffffffffa0204b52 [ext4]
>> [ffffc900273e79d0] ext4_writepages at ffffffffa0208675 [ext4]
>> [ffffc900273e7b30] do_writepages at ffffffff811c487e
>> [ffffc900273e7b40] __writeback_single_inode at ffffffff81280265
>> [ffffc900273e7b90] writeback_sb_inodes at ffffffff81280ab2
>> [ffffc900273e7c90] __writeback_inodes_wb at ffffffff81280ed2
>> [ffffc900273e7cd8] wb_writeback at ffffffff81281238
>> [ffffc900273e7d80] wb_workfn at ffffffff812819f4
>> [ffffc900273e7e18] process_one_work at ffffffff810a5dc9
>> [ffffc900273e7e60] worker_thread at ffffffff810a60ae
>> [ffffc900273e7ec0] kthread at ffffffff810ac696
>> [ffffc900273e7f50] ret_from_fork at ffffffff81741dd9
>>
>> On the bare metal server, we will use multiple hard disks, the Linux
>> kernel will run on the system disk, and business programs will run on
>> several hard disks virtualized by the BM hypervisor. The reason why IO
>> has not returned here is that the process handling IO in the BM hypervisor
>> has failed.
>
> So if the I/O not returning for every days, such that this thread had
> been hanging for that long, it also follows that since it was calling
> do_get_write_access(), that a handle was open. And if a handle is
> open, then the current jbd2 transaction would never close --- which
> means none of the file system operations executed over the past few
> days would never commit, and would be undone on the next reboot.
> Furthermore, sooner or later the journal would run out of space, at
> which point the *entire* system would be locked up waiting for the
> transaction to close.
>
> I'm guessing that if the server hadn't come to a full livelock
> earlier, it's because there aren't that many metadata operations that
> are happening in the server's stable state operation. But in any
> case, this particular server was/is(?) doomed, and all of the patches
> that you proposed are not going to help in the long run. The correct
> fix is to fix the hypervisor, which is the root cause of the problem.
>
Yes, in the end, the whole system was affected, as follows
crash> ps | grep UN
281 2 16 ffff881fb011c300 UN 0.0 0 0 [kswapd_0]
398 358 9 ffff880084094300 UN 0.0 30892 2592
systemd-journal
......
2093 358 28 ffff880012d2c300 UN 0.0 241676 15108 syslog-ng
2119 358 0 ffff88005a252180 UN 0.0 124340 3148 crond
......
PID: 281 TASK: ffff881fb011c300 CPU: 16 COMMAND: "kswapd_0"
#0 [ffffc9000d7af7e0] __schedule at ffffffff8173ca3b
#1 [ffffc9000d7af868] schedule at ffffffff8173cfb6
#2 [ffffc9000d7af880] wait_transaction_locked at ffffffffa00db08a [jbd2]
#3 [ffffc9000d7af8d8] add_transaction_credits at ffffffffa00db2c0 [jbd2]
#4 [ffffc9000d7af938] start_this_handle at ffffffffa00db64f [jbd2]
#5 [ffffc9000d7af9c8] jbd2__journal_start at ffffffffa00dbe3e [jbd2]
#6 [ffffc9000d7afa18] __ext4_journal_start_sb at ffffffffa023b0dd [ext4]
#7 [ffffc9000d7afa58] ext4_release_dquot at ffffffffa02202f2 [ext4]
#8 [ffffc9000d7afa78] dqput at ffffffff812b9bef
#9 [ffffc9000d7afaa0] __dquot_drop at ffffffff812b9eaf
#10 [ffffc9000d7afad8] dquot_drop at ffffffff812b9f22
#11 [ffffc9000d7afaf0] ext4_clear_inode at ffffffffa02291f2 [ext4]
#12 [ffffc9000d7afb08] ext4_evict_inode at ffffffffa020a939 [ext4]
#13 [ffffc9000d7afb28] evict at ffffffff8126d05a
#14 [ffffc9000d7afb50] dispose_list at ffffffff8126d16b
#15 [ffffc9000d7afb78] prune_icache_sb at ffffffff8126e2ba
#16 [ffffc9000d7afbb0] super_cache_scan at ffffffff8125320e
#17 [ffffc9000d7afc08] shrink_slab at ffffffff811cab55
#18 [ffffc9000d7afce8] shrink_node at ffffffff811d000e
#19 [ffffc9000d7afd88] balance_pgdat at ffffffff811d0f42
#20 [ffffc9000d7afe58] kswapd at ffffffff811d14f1
#21 [ffffc9000d7afec0] kthread at ffffffff810ac696
#22 [ffffc9000d7aff50] ret_from_fork at ffffffff81741dd9
> I could imagine some kind of retry counter, where we start sleeping
> after some number of retries, and give up after some larger number of
> retries (at which point the allocation would fail with ENOSPC). We'd
> need to do some testing against our current tests which test how we
> handle running close to ENOSPC, and I'm not at all convinced it's
> worth the effort in the end. We're trying to (slightly) improve the
> case where (a) the file system is running close to full, (b) the
> hypervisor is critically flawed and is the real problem, and (c) the
> VM is eventually doomed to fail anyway due to a transaction never
> closing due to an I/O never getting acknowledged for days(!).
>
Great. If you have any progress, we'll be happy to test it in our
production environment. We are also looking forward to working together
to optimize it.
> If you really want to fix things in the guest OS, I perhaps the
> virtio_scsi driver (or whatever I/O driver you are using), should
> notice when an I/O request hasn't gotten acknowledged after minutes or
> hours, and do something such as force a SCSI reset (which will result
> in the file system needing to be unmounted and recovered, but due to
> the hypervisor bug, that was an inevitable end result anyway).
>
Yes, but unfortunately, it may not be finished in a short time.
We may refer to the documentation of the qemo community as follows:
https://wiki.qemu.org/ToDo/Block
Add a cancel command to the virtio-blk device so that running requests
can be aborted. This requires changing the VIRTIO spec, extending QEMU's
device emulation, and implementing blk_mq_ops->timeout() in Linux
virtio_blk.ko. This task depends on first implementing real request
cancellation in QEMU.
--
Best wishes,
Wen