LinuxLists.cc - Random high CPU utilization in blk-mq with the none scheduler

2021-12-10 03:30:17

Subject: Random high CPU utilization in blk-mq with the none scheduler

Hi all,
I found a random high CPU utilization issue with some database benchmark
program running on a 192-CPU virtual machine (VM). Originally the issue
was found with RHEL 8.4 and Ubuntu 20.04, and further tests show that the
issue also reproduces with the latest upstream stable kernel v5.15.7, but
*not* with v5.16-rc1. It looks like someone resolved the issue in v5.16-rc1
recently?

Details:
The database benchmark test starts 1000+ processes, but it is not
CPU-intensive. The processes access 24 iSCSI-backed disks (multipath is
used), and here no local disk is used). Normally the overall CPU
utilization of 192 CPUs should be 2~5%, including %user (less than 1%),
%sys (2~5%), and %si (less than 1%) . The iSCSI TCP traffic is not heavy and
the packets-per-second (PPS) rate is not high. By default the "none" I/O
scheduler is used in the test.

The issue is: every 1~2 minutes, there is a >90% CPU utilization spike
that lasts 5~20 seconds, and the system's performance drops
significantly. When the CPU utilization spike is happening, *each* of
the 192 CPUs' %sys goes up to 60+%, and the %si goes up to 30%!
Sometimes if I'm not lucky, the >90% CPU utilization happens all the time.

This is not a NIC driver issue becaue the NIC interrupt rate is low due
to NAPI, and the PPS rate is also low (only about 250K/s) and the network
traffic is also low (typically only 2~5 Gbps. The limit for the VM is
30 Gbps, and the NIC driver itself is able to reach a throughput of
20+Gbps easily, if the database benchmark test is not running).

The NIC driver uses 31 queues by default. When the CPU utilization spike
is happening, if I configure the NIC to use only 1 queue by
"ethtool -L eth0 combined 1", the CPU utilization goes down to normal
immediately, and the network throughput and PPS rate go up; if I use 2
queues, the CPU utilization is also not normal (and it consistently
stays at this high CPU utilization -- see [1]), but is not so high as
the 31-queue case. In the case of [1], the iSCSI TCP packets are
processed on CPU 7 and CPU 30 ("cat /proc/irq/57/smp_affinity_list"
returns "7", and "cat /proc/irq/58/smp_affinity_list" returns "30") on
which the NIC's MSI-X interrupts (IRQ 57 and 58) are bound to. If I
check the hottest functions on CPU 7 and CPU 30, e.g.
by "perf top -C 7 --call-graph dwarf" (see [2]), it looks like the code path
blk_done_softirq() -> ... -> blk_mq_sched_restart() -> ...
-> kblockd_mod_delayed_work_on() -> ... -> try_to_grab_pending() is
even hotter than the NIC TX/RX handler and the TCP/IP stack! Since the
other 190 CPUs also have high %sys and %si (see [3]), I check
"perf top -C 0 --call-graph dwarf" and find that the same
blk_done_softirq() -> ... -> try_to_grab_pending() code path is very hot.

My feeling is that it looks like the iSCSI hardware queues are experiencing
the transition "not full -> full -> not full -> full -> ... " very frequently
and the blk-mq softirq handler is called very frequently, trying to submit
new I/O requests? try_to_grab_pending() acquires and releases the pool->lock
spinlock, and I guess there may be a lock contention issue?? If I change
the NIC's second IRQ's affinity from CPU 30 to to CPU 7 as well (i.e. both the
NIC queues are handled by the same CPU 7) by
" echo 7 > /proc/irq/58/smp_affinity_list", the high CPU utilization issue goes
away immediately (i.e. the %sys and %si numbers on each CPU rather than
CPU7 go down immediately to normal, i.e. %sys = 1.9%, %si = 0.6%);
if I change the second IRQ's affinity back to CPU 30, the high CPU utilization
happens again immediately. Any insights into this significant difference when
blk_done_softirq() runs on 1 CPU vs. on 2 CPUs? Here when both the 2 NIC
IRQs happen on the same CPU, the network throughput and PPS rate are
actually better.

When the high CPU utilization issue is happening, if I change the I/O
scheduler from "noop" to "mq-deadline", basicaly the issue will also go
away immediately (%sys = 2.7%, %si = 1.4%. This is slightly higher than the
case where "noop" is used and both the 2 NIC IRQs happen on the same CPU).

It's unclear why the issue doesn't reproduce with v5.16-rc1. The kernel
difference from v5.15.7 (the bad one) is huge (1+ million lines of changes)
and a weird file system hang issue (if I use a kernel git commit in between, I/O
requests on the iSCSI disks can not complete at all -- see [5]) prevents me
from doing git-bisect. There are about 100 patches related to blk-mq, and
at first glance I have not found anything that can explain why the issue
does not reproduce with v5.16-rc1.

Thanks,
-- Dexuan

[1] The snippet of the "top" command ( v5.15.7 stable kernel; the NIC uses 2 queues)

top - 01:49:14 up 31 min, 2 users, load average: 1052.88, 999.40, 706.60
Tasks: 2862 total, 207 running, 2654 sleeping, 1 stopped, 0 zombie
%Cpu(s): 0.1 us, 35.2 sy, 0.0 ni, 1.1 id, 49.6 wa, 0.0 hi, 14.0 si, 0.0 st
MiB Mem : 2064232.+total, 2053872.+free, 9031.1 used, 1328.6 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 2049269.+avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
195 root 20 0 0 0 0 R 92.7 0.0 13:37.47 ksoftirqd/30
57 root 20 0 0 0 0 R 91.4 0.0 14:22.18 ksoftirqd/7
11649 root 20 0 18624 3544 1536 R 70.3 0.0 9:02.73 GenerateIO.Linu
11654 root 20 0 18624 3544 1536 R 69.8 0.0 9:05.26 GenerateIO.Linu
11650 root 20 0 18624 3544 1536 R 69.0 0.0 9:00.95 GenerateIO.Linu
11655 root 20 0 18624 3544 1536 R 68.9 0.0 8:39.59 GenerateIO.Linu
11652 root 20 0 18624 3544 1536 R 68.8 0.0 8:37.07 GenerateIO.Linu
11656 root 20 0 18624 3544 1536 R 68.5 0.0 8:40.68 GenerateIO.Linu
11651 root 20 0 18624 3544 1536 R 68.2 0.0 8:38.72 GenerateIO.Linu
11653 root 20 0 18624 3544 1536 S 67.5 0.0 8:39.97 GenerateIO.Linu
869 root 20 0 0 0 0 S 10.2 0.0 1:32.12 ksoftirqd/142
580 root 20 0 0 0 0 R 10.1 0.0 1:28.62 ksoftirqd/94
629 root 20 0 0 0 0 S 10.1 0.0 1:28.80 ksoftirqd/102
436 root 20 0 0 0 0 S 10.0 0.0 1:27.72 ksoftirqd/70
556 root 20 0 0 0 0 S 10.0 0.0 1:29.44 ksoftirqd/90
316 root 20 0 0 0 0 S 9.9 0.0 1:29.45 ksoftirqd/50
592 root 20 0 0 0 0 S 9.9 0.0 1:32.97 ksoftirqd/96
990 root 20 0 0 0 0 R 9.8 0.0 1:29.08 ksoftirqd/162
...

[2] "perf top -C 7 --call-graph dwarf -d 30" (only the first lines of the output are shown here)

Samples: 1M of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 246356121644 lost: 0/47 drop: 0/0
Children Self Shared Object Symbol
- 98.01% 0.03% [kernel] [k] __softirqentry_text_start
- 12.10% __softirqentry_text_start
- 7.10% blk_done_softirq
- 7.22% blk_complete_reqs
- 6.52% dm_softirq_done
- 6.55% dm_done
- 6.52% blk_mq_end_request
- 6.53% blk_mq_free_request
- 6.49% __blk_mq_free_request
- 6.48% blk_mq_sched_restart
- 6.46% blk_mq_run_hw_queue
- 6.46% __blk_mq_delay_run_hw_queue
- 7.20% kblockd_mod_delayed_work_on
- 51.73% mod_delayed_work_on
try_to_grab_pending
- 4.89% net_rx_action
- 4.90% __napi_poll
- 3.28% mlx5e_napi_poll
- 3.03% mlx5e_poll_rx_cq
- 3.85% mlx5e_handle_rx_cqe
- 2.65% napi_gro_receive
- 2.54% dev_gro_receive
- 1.78% napi_gro_complete.constprop.0.isra.0
- 1.80% netif_receive_skb_list_internal
- 1.82% __netif_receive_skb_list_core
- 1.76% ip_list_rcv
- 1.77% ip_sublist_rcv
- 1.67% ip_sublist_rcv_finish
- 1.70% ip_local_deliver
- 1.61% ip_local_deliver_finish
- 1.71% ip_protocol_deliver_rcu
- 1.81% tcp_v4_rcv
- 1.63% tcp_v4_do_rcv
- 1.73% tcp_rcv_established
- 0.98% tcp_data_ready
- 0.97% iscsi_sw_tcp_data_ready
- 0.99% tcp_read_sock
- 2.00% iscsi_sw_tcp_recv
- 2.01% iscsi_tcp_recv_skb
0.99% memcpy_erms
- 1.34% mlx5e_skb_from_cqe_linear
- 0.86% mlx5e_poll_tx_cq
- 1.93% napi_gro_flush
- 1.95% napi_gro_complete.constprop.0.isra.0
- 1.95% netif_receive_skb_list_internal
- 1.98% __netif_receive_skb_list_core
- 1.92% ip_list_rcv
- 1.93% ip_sublist_rcv
- 1.84% ip_sublist_rcv_finish
- 1.87% ip_local_deliver
- 1.78% ip_local_deliver_finish
- 1.89% ip_protocol_deliver_rcu
- 1.97% tcp_v4_rcv
- 1.82% tcp_v4_do_rcv
- 1.93% tcp_rcv_established
- 1.10% tcp_data_ready
- 1.10% iscsi_sw_tcp_data_ready
- 1.12% tcp_read_sock
- 2.59% iscsi_sw_tcp_recv
- 2.64% iscsi_tcp_recv_skb
0.78% iscsi_tcp_segment_done
1.38% memcpy_erms
+ 92.61% 0.01% [kernel] [k] smpboot_thread_fn
+ 92.58% 0.01% [kernel] [k] run_ksoftirqd
+ 56.55% 0.00% [kernel] [k] blk_done_softirq
+ 56.53% 0.02% [kernel] [k] blk_complete_reqs
+ 52.56% 0.00% [kernel] [k] __blk_mq_delay_run_hw_queue
+ 52.55% 0.01% [kernel] [k] blk_mq_run_hw_queue
+ 52.54% 0.00% [kernel] [k] kblockd_mod_delayed_work_on
+ 52.54% 0.87% [kernel] [k] mod_delayed_work_on
+ 52.10% 0.00% [kernel] [k] dm_softirq_done
- 52.10% 0.01% [kernel] [k] dm_done
- 6.54% dm_done
- 6.52% blk_mq_end_request
- 6.53% blk_mq_free_request
- 6.49% __blk_mq_free_request
- 6.48% blk_mq_sched_restart
- 6.46% blk_mq_run_hw_queue
- 6.46% __blk_mq_delay_run_hw_queue
- 7.20% kblockd_mod_delayed_work_on
- 51.73% mod_delayed_work_on
try_to_grab_pending
+ 51.97% 0.06% [kernel] [k] blk_mq_free_request
+ 51.88% 0.00% [kernel] [k] blk_mq_end_request
+ 51.88% 0.04% [kernel] [k] __blk_mq_free_request
+ 51.77% 0.02% [kernel] [k] blk_mq_sched_restart
- 51.67% 51.64% [kernel] [k] try_to_grab_pending
- 6.34% ret_from_fork
- kthread
- 6.06% smpboot_thread_fn
- 6.05% run_ksoftirqd
- 48.66% __softirqentry_text_start
blk_done_softirq
blk_complete_reqs
dm_softirq_done
dm_done
blk_mq_end_request
blk_mq_free_request
__blk_mq_free_request
blk_mq_sched_restart
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
- 1.53% worker_thread
- 1.45% asm_sysvec_hyperv_stimer0
sysvec_hyperv_stimer0
irq_exit_rcu
__softirqentry_text_start
blk_done_softirq
blk_complete_reqs
dm_softirq_done
dm_done
blk_mq_end_request
blk_mq_free_request
__blk_mq_free_request
blk_mq_sched_restart
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
+ 40.07% 0.02% [kernel] [k] net_rx_action
+ 40.04% 0.01% [kernel] [k] __napi_poll
+ 33.42% 0.06% [kernel] [k] netif_receive_skb_list_internal
+ 33.23% 0.05% [kernel] [k] __netif_receive_skb_list_core
+ 32.96% 0.04% [kernel] [k] ip_list_rcv
+ 32.87% 0.05% [kernel] [k] ip_sublist_rcv
+ 31.34% 0.05% [kernel] [k] ip_sublist_rcv_finish
+ 31.27% 0.06% [kernel] [k] ip_local_deliver
+ 30.69% 0.02% [kernel] [k] ip_local_deliver_finish
+ 30.66% 0.04% [kernel] [k] ip_protocol_deliver_rcu
+ 30.57% 0.26% [kernel] [k] tcp_v4_rcv
+ 30.20% 0.03% [kernel] [k] napi_gro_complete.constprop.0.isra.0
+ 29.79% 0.03% [kernel] [k] tcp_v4_do_rcv
+ 29.68% 0.15% [kernel] [k] tcp_rcv_established
+ 25.13% 0.04% [kernel] [k] tcp_data_ready
+ 25.05% 0.06% [kernel] [k] iscsi_sw_tcp_data_ready
+ 24.94% 0.05% [kernel] [k] tcp_read_sock
+ 22.24% 0.02% [kernel] [k] mlx5e_napi_poll
+ 21.04% 0.05% [kernel] [k] iscsi_sw_tcp_recv
+ 20.26% 0.30% [kernel] [k] mlx5e_poll_rx_cq
+ 19.93% 0.18% [kernel] [k] mlx5e_handle_rx_cqe
+ 17.35% 0.76% [kernel] [k] iscsi_tcp_recv_skb
+ 17.31% 0.07% [kernel] [k] napi_gro_receive
+ 16.60% 0.03% [kernel] [k] napi_gro_flush
+ 14.88% 0.32% [kernel] [k] dev_gro_receive
+ 12.23% 0.00% [kernel] [k] ret_from_fork
+ 12.23% 0.00% [kernel] [k] kthread
+ 10.17% 0.04% [kernel] [k] iscsi_complete_pdu
+ 10.06% 0.23% [kernel] [k] __iscsi_complete_pdu
+ 9.33% 0.03% [kernel] [k] iscsi_complete_task
+ 8.95% 0.52% [kernel] [k] __iscsi_put_task
+ 8.41% 0.03% [kernel] [k] blk_mq_complete_reque

[3] The first lines of the "top" output are shown here:

top - 02:09:58 up 52 min, 2 users, load average: 1050.01, 1038.20, 951.19
Tasks: 2862 total, 38 running, 2823 sleeping, 1 stopped, 0 zombie
%Cpu0 : 1.1 us, 37.0 sy, 0.0 ni, 1.3 id, 48.6 wa, 0.0 hi, 12.0 si, 0.0 st
%Cpu1 : 1.6 us, 26.1 sy, 0.0 ni, 7.9 id, 59.7 wa, 0.0 hi, 4.7 si, 0.0 st
%Cpu2 : 3.7 us, 34.4 sy, 0.0 ni, 1.3 id, 51.0 wa, 0.0 hi, 9.6 si, 0.0 st
%Cpu3 : 1.1 us, 24.6 sy, 0.0 ni, 8.0 id, 61.6 wa, 0.0 hi, 4.7 si, 0.0 st
%Cpu4 : 0.8 us, 36.0 sy, 0.0 ni, 0.4 id, 53.7 wa, 0.0 hi, 9.1 si, 0.0 st
%Cpu5 : 0.1 us, 25.9 sy, 0.0 ni, 5.6 id, 63.8 wa, 0.0 hi, 4.6 si, 0.0 st
%Cpu6 : 0.7 us, 25.1 sy, 0.0 ni, 2.8 id, 67.3 wa, 0.0 hi, 4.0 si, 0.0 st
%Cpu7 : 0.0 us, 2.1 sy, 0.0 ni, 0.1 id, 0.0 wa, 0.0 hi, 97.8 si, 0.0 st
%Cpu8 : 0.3 us, 38.2 sy, 0.0 ni, 0.1 id, 52.0 wa, 0.0 hi, 9.3 si, 0.0 st
%Cpu9 : 1.0 us, 28.3 sy, 0.0 ni, 1.4 id, 65.1 wa, 0.0 hi, 4.2 si, 0.0 st
%Cpu10 : 0.9 us, 38.2 sy, 0.0 ni, 0.5 id, 49.7 wa, 0.0 hi, 10.6 si, 0.0 st
%Cpu11 : 0.0 us, 27.0 sy, 0.0 ni, 2.3 id, 66.0 wa, 0.0 hi, 4.7 si, 0.0 st
%Cpu12 : 0.1 us, 37.8 sy, 0.0 ni, 1.2 id, 51.5 wa, 0.0 hi, 9.4 si, 0.0 st
%Cpu13 : 0.4 us, 27.3 sy, 0.0 ni, 2.9 id, 64.4 wa, 0.0 hi, 5.0 si, 0.0 st
%Cpu14 : 0.3 us, 38.2 sy, 0.0 ni, 0.6 id, 52.3 wa, 0.0 hi, 8.6 si, 0.0 st
%Cpu15 : 0.0 us, 26.4 sy, 0.0 ni, 2.9 id, 66.4 wa, 0.0 hi, 4.3 si, 0.0 st
%Cpu16 : 0.0 us, 36.3 sy, 0.0 ni, 0.5 id, 54.6 wa, 0.0 hi, 8.6 si, 0.0 st
%Cpu17 : 0.1 us, 26.8 sy, 0.0 ni, 5.0 id, 64.1 wa, 0.0 hi, 4.0 si, 0.0 st
%Cpu18 : 0.4 us, 38.2 sy, 0.0 ni, 0.8 id, 52.6 wa, 0.0 hi, 8.0 si, 0.0 st
%Cpu19 : 0.3 us, 24.5 sy, 0.0 ni, 6.0 id, 63.8 wa, 0.0 hi, 5.4 si, 0.0 st
%Cpu20 : 0.9 us, 37.0 sy, 0.0 ni, 0.6 id, 53.1 wa, 0.0 hi, 8.4 si, 0.0 st

[4] "perf top -C 0 --call-graph dwarf -d 30" (only the first lines of the output are shown here)

Samples: 63K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 15949750000 lost: 77/77 drop: 0/0
Children Self Shared Object Symbol
- 85.84% 0.01% [kernel] [k] __blk_mq_delay_run_hw_queue
- 85.83% __blk_mq_delay_run_hw_queue
- 69.75% __blk_mq_run_hw_queue
- blk_mq_sched_dispatch_requests
- 68.62% __blk_mq_sched_dispatch_requests
- 42.70% blk_mq_dispatch_rq_list
- 26.92% __blk_mq_delay_run_hw_queue
- kblockd_mod_delayed_work_on
- 26.91% mod_delayed_work_on
- try_to_grab_pending
- 4.03% asm_sysvec_call_function_single
sysvec_call_function_single
irq_exit_rcu
- __softirqentry_text_start
- blk_done_softirq
blk_complete_reqs
dm_softirq_done
- dm_done
- 3.81% blk_mq_end_request
- blk_mq_free_request
- __blk_mq_free_request
blk_mq_sched_restart
- blk_mq_run_hw_queue
- 3.81% __blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
- mod_delayed_work_on
try_to_grab_pending
- 15.03% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
- mod_delayed_work_on
- 14.82% try_to_grab_pending
- 2.41% asm_sysvec_call_function_single
sysvec_call_function_single
irq_exit_rcu
- __softirqentry_text_start
- blk_done_softirq
- blk_complete_reqs
- dm_softirq_done
dm_done
blk_mq_end_request
- blk_mq_free_request
- 2.30% __blk_mq_free_request
- blk_mq_sched_restart
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
- mod_delayed_work_on
try_to_grab_pending
- 0.72% dm_mq_queue_rq
0.60% map_request
- 25.85% blk_mq_do_dispatch_ctx
- 25.36% blk_mq_dispatch_rq_list
- 13.48% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
- kblockd_mod_delayed_work_on
- 13.48% mod_delayed_work_on
- 13.27% try_to_grab_pending
- 2.21% asm_sysvec_call_function_single
sysvec_call_function_single
irq_exit_rcu
- __softirqentry_text_start
- blk_done_softirq
- blk_complete_reqs
- dm_softirq_done
- 2.09% dm_done
- 2.09% blk_mq_end_request
- 2.09% blk_mq_free_request
- __blk_mq_free_request
blk_mq_sched_restart
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
- mod_delayed_work_on
try_to_grab_pending
- 11.07% __blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
- mod_delayed_work_on
- 11.02% try_to_grab_pending
- 1.89% asm_sysvec_call_function_single
sysvec_call_function_single
irq_exit_rcu
- __softirqentry_text_start
- blk_done_softirq
blk_complete_reqs
dm_softirq_done
- dm_done
- 1.81% blk_mq_end_request
- blk_mq_free_request
- 1.81% __blk_mq_free_request
blk_mq_sched_restart
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
- mod_delayed_work_on
try_to_grab_pending
- 0.77% dm_mq_queue_rq
0.59% map_request
- 1.12% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
- mod_delayed_work_on
1.10% try_to_grab_pending
- 16.09% kblockd_mod_delayed_work_on
- mod_delayed_work_on
try_to_grab_pending
+ 85.80% 0.01% [kernel] [k] blk_mq_run_hw_queue
+ 84.18% 0.00% [kernel] [k] kblockd_mod_delayed_work_on
+ 84.17% 0.95% [kernel] [k] mod_delayed_work_on
- 83.36% 82.52% [kernel] [k] try_to_grab_pending
- 54.58% pread64 (inlined
entry_SYSCALL_64_after_hwframe
do_syscall_64
- __x64_sys_pread64
- 54.57% vfs_read
new_sync_read
xfs_file_read_iter
xfs_file_dio_read
- iomap_dio_rw
- 54.55% __iomap_dio_rw
blk_finish_plug
blk_flush_plug_list
blk_mq_flush_plug_list
- blk_mq_sched_insert_requests
- 53.64% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
__blk_mq_run_hw_queue
- blk_mq_sched_dispatch_requests
- 52.67% __blk_mq_sched_dispatch_requests
- 32.38% blk_mq_dispatch_rq_list
- 19.96% __blk_mq_delay_run_hw_queue
- kblockd_mod_delayed_work_on
- 19.96% mod_delayed_work_on
- try_to_grab_pending
- 2.49% asm_sysvec_call_function_single
sysvec_call_function_single
irq_exit_rcu
__softirqentry_text_start
blk_done_softirq
blk_complete_reqs
dm_softirq_done
dm_done
blk_mq_end_request
blk_mq_free_request
__blk_mq_free_request
blk_mq_sched_restart
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 12.37% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
- mod_delayed_work_on
- 12.36% try_to_grab_pending
- asm_sysvec_call_function_single
sysvec_call_function_single
irq_exit_rcu
__softirqentry_text_start
blk_done_softirq
blk_complete_reqs
dm_softirq_done
dm_done
blk_mq_end_request
blk_mq_free_request
__blk_mq_free_request
blk_mq_sched_restart
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 20.29% blk_mq_do_dispatch_ctx
- 20.02% blk_mq_dispatch_rq_list
- 10.92% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
- kblockd_mod_delayed_work_on
- 10.91% mod_delayed_work_on
- 10.90% try_to_grab_pending
- asm_sysvec_call_function_single
sysvec_call_function_single
irq_exit_rcu
__softirqentry_text_start
blk_done_softirq
blk_complete_reqs
dm_softirq_done
dm_done
blk_mq_end_request
blk_mq_free_request
__blk_mq_free_request
blk_mq_sched_restart
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 9.03% __blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
- try_to_grab_pending
- asm_sysvec_call_function_single
sysvec_call_function_single
irq_exit_rcu
__softirqentry_text_start
blk_done_softirq
blk_complete_reqs
dm_softirq_done
dm_done
blk_mq_end_request
blk_mq_free_request
__blk_mq_free_request
blk_mq_sched_restart
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 0.97% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 0.91% blk_mq_try_issue_list_directly
blk_mq_request_bypass_insert
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
__blk_mq_run_hw_queue
- blk_mq_sched_dispatch_requests
- 0.91% __blk_mq_sched_dispatch_requests
- 0.83% blk_mq_dispatch_rq_list
- 0.59% __blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 11.05% syscall
entry_SYSCALL_64_after_hwframe
do_syscall_64
- __x64_sys_io_submit
- 5.77% blk_finish_plug
blk_flush_plug_list
blk_mq_flush_plug_list
blk_mq_sched_insert_requests
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
__blk_mq_run_hw_queue
- blk_mq_sched_dispatch_requests
- 5.73% __blk_mq_sched_dispatch_requests
- 3.73% blk_mq_dispatch_rq_list
- 2.79% __blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
- try_to_grab_pending
- 0.55% asm_sysvec_call_function_single
sysvec_call_function_single
irq_exit_rcu
__softirqentry_text_start
blk_done_softirq
blk_complete_reqs
dm_softirq_done
dm_done
blk_mq_end_request
blk_mq_free_request
__blk_mq_free_request
blk_mq_sched_restart
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 0.93% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 2.00% blk_mq_do_dispatch_ctx
- 1.99% blk_mq_dispatch_rq_list
- 1.14% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 0.85% __blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 5.28% io_submit_one
__io_submit_one.constprop.0
aio_write
xfs_file_write_iter
- xfs_file_dio_write_aligned
- 5.26% iomap_dio_rw
- __iomap_dio_rw
- 5.25% blk_finish_plug
blk_flush_plug_list
blk_mq_flush_plug_list
- blk_mq_sched_insert_requests
- 5.07% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
__blk_mq_run_hw_queue
- blk_mq_sched_dispatch_requests
- 5.01% __blk_mq_sched_dispatch_requests
- 3.40% blk_mq_dispatch_rq_list
- 2.63% __blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
- try_to_grab_pending
- asm_sysvec_call_function_single
sysvec_call_function_single
irq_exit_rcu
__softirqentry_text_start
blk_done_softirq
blk_complete_reqs
dm_softirq_done
dm_done
blk_mq_end_request
blk_mq_free_request
__blk_mq_free_request
blk_mq_sched_restart
blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 0.76% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending
- 1.62% blk_mq_do_dispatch_ctx
- 1.57% blk_mq_dispatch_rq_list
- 0.79% blk_mq_run_hw_queue
__blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
- mod_delayed_work_on
0.79% try_to_grab_pending
- 0.77% __blk_mq_delay_run_hw_queue
kblockd_mod_delayed_work_on
mod_delayed_work_on
try_to_grab_pending

[5] the "dmesg" snippet for a kernel between v5.15 and v5.16-rc1 ( iSCSI I/O requests can
not complete due to some bug. Is this a known issue?):

[ 45.991286] sd 21:0:0:1: [sdbp] Optimal transfer size 1048576 bytes
[ 45.991347] sd 25:0:0:2: [sdbr] Write Protect is off
[ 45.991349] sd 25:0:0:2: [sdbr] Mode Sense: 85 00 10 08
[ 45.991541] sd 23:0:0:1: [sdbq] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 45.991590] sd 22:0:0:2: Attached scsi generic sg71 type 0
[ 45.991746] sd 25:0:0:2: [sdbr] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 45.991877] sd 22:0:0:2: [sdbt] 268435456 512-byte logical blocks: (137 GB/128 GiB)
[ 45.992077] sd 23:0:0:1: [sdbq] Optimal transfer size 1048576 bytes
[ 45.992093] sd 22:0:0:2: [sdbt] Write Protect is off
[ 45.992094] sd 22:0:0:2: [sdbt] Mode Sense: 85 00 10 08
[ 45.992165] sd 25:0:0:2: [sdbr] Optimal transfer size 1048576 bytes
[ 45.992394] sd 24:0:0:1: [sdbs] 10737418240 512-byte logical blocks: (5.50 TB/5.00 TiB)
[ 45.992426] sd 21:0:0:2: Attached scsi generic sg72 type 0
[ 45.992533] sd 22:0:0:2: [sdbt] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 45.992691] sd 24:0:0:1: [sdbs] Write Protect is off
[ 45.992693] sd 21:0:0:2: [sdbu] 268435456 512-byte logical blocks: (137 GB/128 GiB)
[ 45.992694] sd 24:0:0:1: [sdbs] Mode Sense: 85 00 10 08
[ 45.992995] sd 22:0:0:2: [sdbt] Optimal transfer size 1048576 bytes
[ 45.993000] sd 21:0:0:2: [sdbu] Write Protect is off
[ 45.993002] sd 21:0:0:2: [sdbu] Mode Sense: 85 00 10 08
[ 45.993367] sd 24:0:0:1: [sdbs] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 45.993480] sd 23:0:0:2: Attached scsi generic sg73 type 0
[ 45.993603] sd 21:0:0:2: [sdbu] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 45.993993] sd 24:0:0:1: [sdbs] Optimal transfer size 1048576 bytes
[ 45.994283] sd 21:0:0:2: [sdbu] Optimal transfer size 1048576 bytes
[ 45.994359] sd 23:0:0:2: Power-on or device reset occurred
[ 45.995506] sd 24:0:0:2: Attached scsi generic sg74 type 0
[ 45.995754] sd 24:0:0:2: Power-on or device reset occurred
[ 45.996180] sd 23:0:0:2: [sdbv] 268435456 512-byte logical blocks: (137 GB/128 GiB)
[ 45.996399] sd 23:0:0:2: [sdbv] Write Protect is off
[ 45.996402] sd 23:0:0:2: [sdbv] Mode Sense: 85 00 10 08
[ 45.996811] sd 23:0:0:2: [sdbv] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 45.997236] sd 23:0:0:2: [sdbv] Optimal transfer size 1048576 bytes
[ 45.998626] sd 24:0:0:2: [sdbw] 268435456 512-byte logical blocks: (137 GB/128 GiB)
[ 45.998948] sd 24:0:0:2: [sdbw] Write Protect is off
[ 45.998949] sd 24:0:0:2: [sdbw] Mode Sense: 85 00 10 08
[ 45.999562] sd 24:0:0:2: [sdbw] Write cache: disabled, read cache: enabled, supports DPO and FUA
[ 46.000155] sd 24:0:0:2: [sdbw] Optimal transfer size 1048576 bytes
[ 46.115711] sd 25:0:0:2: [sdbr] Attached SCSI disk
[ 46.116112] sd 21:0:0:0: [sdbj] Attached SCSI disk
[ 46.116162] sd 22:0:0:0: [sdbk] Attached SCSI disk
[ 46.119829] sd 23:0:0:1: [sdbq] Attached SCSI disk
[ 46.120044] sd 21:0:0:2: [sdbu] Attached SCSI disk
[ 46.120146] sd 21:0:0:1: [sdbp] Attached SCSI disk
[ 46.120234] sd 22:0:0:1: [sdbo] Attached SCSI disk
[ 46.127808] sd 23:0:0:2: [sdbv] Attached SCSI disk
[ 46.127810] sd 25:0:0:1: [sdbn] Attached SCSI disk
[ 46.127941] sd 23:0:0:0: [sdbl] Attached SCSI disk
[ 46.128571] sd 24:0:0:1: [sdbs] Attached SCSI disk
[ 46.128584] sd 24:0:0:2: [sdbw] Attached SCSI disk
[ 46.128703] sd 24:0:0:0: [sdbm] Attached SCSI disk
[ 46.131549] sd 25:0:0:0: [sdbi] Attached SCSI disk
[ 46.132250] sd 22:0:0:2: [sdbt] Attached SCSI disk
[ 51.584107] device-mapper: multipath queue-length: version 0.2.0 loaded
[ 54.932029] loop6: detected capacity change from 0 to 8
[ 74.779368] hv_balloon: Max. dynamic memory size: 2097152 MB
[ 128.762895] sysrq: Show Blocked State
[ 128.827260] task:systemd-udevd state:D stack: 0 pid: 4125 ppid: 2402 flags:0x00000000
[ 128.827265] Call Trace:
[ 128.827269] __schedule+0xce6/0x13d0
[ 128.827281] ? blk_finish_plug+0x26/0x40
[ 128.827288] ? read_pages+0x1db/0x280
[ 128.827295] schedule+0x4e/0xb0
[ 128.827298] io_schedule+0x3f/0x70
[ 128.827301] wait_on_page_bit_common+0x16e/0x3b0
[ 128.827308] ? filemap_invalidate_unlock_two+0x40/0x40
[ 128.827311] put_and_wait_on_page_locked+0x4f/0x60
[ 128.827313] filemap_get_pages+0x643/0x660
[ 128.827317] filemap_read+0xbb/0x3e0
[ 128.827319] ? refill_obj_stock+0xc8/0x130
[ 128.827324] generic_file_read_iter+0xe5/0x150
[ 128.827327] blkdev_read_iter+0x38/0x70
[ 128.827332] new_sync_read+0x113/0x1a0
[ 128.827341] vfs_read+0xfe/0x1a0
[ 128.827345] ksys_read+0x67/0xe0
[ 128.827347] __x64_sys_read+0x1a/0x20
[ 128.827350] do_syscall_64+0x5c/0xc0
[ 128.827356] ? irqentry_exit_to_user_mode+0x9/0x20
[ 128.827359] ? irqentry_exit+0x19/0x30
[ 128.827361] ? exc_page_fault+0x89/0x160
[ 128.827362] ? asm_exc_page_fault+0x8/0x30
[ 128.827367] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 128.827369] RIP: 0033:0x7fee5a814142
[ 128.827373] RSP: 002b:00007ffee6f253d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 128.827375] RAX: ffffffffffffffda RBX: 00005644e54956a8 RCX: 00007fee5a814142
[ 128.827376] RDX: 0000000000000400 RSI: 00005644e54956b8 RDI: 0000000000000006
[ 128.827378] RBP: 00005644e52edec0 R08: 00005644e5495690 R09: 00007fee5a8ef200
[ 128.827379] R10: 00005644e520c010 R11: 0000000000000246 R12: 0000000000000000
[ 128.827380] R13: 0000000000000400 R14: 00005644e5495690 R15: 00005644e52edf10
[ 128.828282] task:systemd-udevd state:D stack: 0 pid: 5166 ppid: 2402 flags:0x00000000
[ 128.828284] Call Trace:
[ 128.828286] __schedule+0xce6/0x13d0
[ 128.828289] ? blk_finish_plug+0x26/0x40
[ 128.828291] ? read_pages+0x1db/0x280
[ 128.828293] schedule+0x4e/0xb0
[ 128.828296] io_schedule+0x3f/0x70
[ 128.828298] wait_on_page_bit_common+0x16e/0x3b0
[ 128.828301] ? filemap_invalidate_unlock_two+0x40/0x40
[ 128.828304] put_and_wait_on_page_locked+0x4f/0x60
[ 128.828307] filemap_get_pages+0x643/0x660
[ 128.828310] ? __cond_resched+0x19/0x30
[ 128.828313] filemap_read+0xbb/0x3e0
[ 128.828316] ? sed_ioctl+0x7d1/0x8a0
[ 128.828320] ? cp_new_stat+0x152/0x180
[ 128.828324] generic_file_read_iter+0xe5/0x150
[ 128.828326] ? event_triggers_call+0x5d/0xe0
[ 128.828332] blkdev_read_iter+0x38/0x70
[ 128.828334] new_sync_read+0x113/0x1a0
[ 128.828337] vfs_read+0xfe/0x1a0
[ 128.828340] ksys_read+0x67/0xe0
[ 128.828343] __x64_sys_read+0x1a/0x20
[ 128.828345] do_syscall_64+0x5c/0xc0
[ 128.828348] ? irqentry_exit+0x19/0x30
[ 128.828350] ? exc_page_fault+0x89/0x160
[ 128.828351] ? asm_exc_page_fault+0x8/0x30
[ 128.828353] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 128.828355] RIP: 0033:0x7fee5a814142
[ 128.828357] RSP: 002b:00007ffee6f25318 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 128.828358] RAX: ffffffffffffffda RBX: 00005644e529ff68 RCX: 00007fee5a814142
[ 128.828360] RDX: 0000000000000040 RSI: 00005644e529ff78 RDI: 0000000000000006
[ 128.828360] RBP: 00005644e52dd960 R08: 00005644e529ff50 R09: 00005644e5459b90
[ 128.828361] R10: 00005644e520c010 R11: 0000000000000246 R12: 0000001fffff0000
[ 128.828362] R13: 0000000000000040 R14: 00005644e529ff50 R15: 00005644e52dd9b0
[ 128.828430] task:systemd-udevd state:D stack: 0 pid: 5178 ppid: 2402 flags:0x00000000
[ 128.828433] Call Trace:
[ 128.828434] __schedule+0xce6/0x13d0
[ 128.828437] ? blk_finish_plug+0x26/0x40
[ 128.828439] ? read_pages+0x1db/0x280
[ 128.828441] schedule+0x4e/0xb0
[ 128.828444] io_schedule+0x3f/0x70
[ 128.828446] wait_on_page_bit_common+0x16e/0x3b0
[ 128.828449] ? filemap_invalidate_unlock_two+0x40/0x40
[ 128.828452] put_and_wait_on_page_locked+0x4f/0x60
[ 128.828454] filemap_get_pages+0x643/0x660
[ 128.828458] filemap_read+0xbb/0x3e0
[ 128.828460] ? page_counter_try_charge+0x34/0xb0
[ 128.828466] ? page_counter_cancel+0x2c/0x70
[ 128.828468] ? page_counter_uncharge+0x22/0x40
[ 128.828470] generic_file_read_iter+0xe5/0x150
[ 128.828473] blkdev_read_iter+0x38/0x70
[ 128.828475] new_sync_read+0x113/0x1a0
[ 128.828478] vfs_read+0xfe/0x1a0
[ 128.828480] ksys_read+0x67/0xe0
[ 128.828483] __x64_sys_read+0x1a/0x20
[ 128.828485] do_syscall_64+0x5c/0xc0
[ 128.828488] ? __audit_syscall_exit+0x233/0x290
[ 128.828495] ? exit_to_user_mode_prepare+0x3d/0x1c0
[ 128.828503] ? syscall_exit_to_user_mode+0x27/0x50
[ 128.828505] ? do_syscall_64+0x69/0xc0
[ 128.828508] ? do_syscall_64+0x69/0xc0
[ 128.828510] ? do_syscall_64+0x69/0xc0
[ 128.828512] ? syscall_exit_to_user_mode+0x27/0x50
[ 128.828514] ? do_syscall_64+0x69/0xc0
[ 128.828516] ? asm_exc_page_fault+0x8/0x30
[ 128.828519] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 128.828520] RIP: 0033:0x7fee5a814142
[ 128.828522] RSP: 002b:00007ffee6f25318 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 128.828523] RAX: ffffffffffffffda RBX: 00005644e5215ba8 RCX: 00007fee5a814142
[ 128.828524] RDX: 0000000000000040 RSI: 00005644e5215bb8 RDI: 0000000000000006
[ 128.828525] RBP: 00005644e52ab260 R08: 00005644e5215b90 R09: 00005644e52b81e0
[ 128.828526] R10: 00005644e520c010 R11: 0000000000000246 R12: 000004ffffff0000
[ 128.828527] R13: 0000000000000040 R14: 00005644e5215b90 R15: 00005644e52ab2b0

2021-12-11 01:29:08

by Dexuan Cui

[permalink] [raw]

Subject: RE: Random high CPU utilization in blk-mq with the none scheduler

> From: Dexuan Cui
> Sent: Thursday, December 9, 2021 7:30 PM
>
> Hi all,
> I found a random high CPU utilization issue with some database benchmark
> program running on a 192-CPU virtual machine (VM). Originally the issue
> was found with RHEL 8.4 and Ubuntu 20.04, and further tests show that the
> issue also reproduces with the latest upstream stable kernel v5.15.7, but
> *not* with v5.16-rc1. It looks like someone resolved the issue in v5.16-rc1
> recently?

I did git-bisect on the linux-block tree's for-5.16/block branch and this patch
resolves the random high CPU utilization issue (I'm not sure how):
dc5fc361d891 ("block: attempt direct issue of plug list")
https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-5.16/block&id=dc5fc361d891e089dfd9c0a975dc78041036b906

Do you think if it's easy to backport it to earlier versions like 5.10?
It looks like there are a lot of prerequisite patches.

Thanks,
Dexuan

2021-12-11 02:05:14

by Jens Axboe

[permalink] [raw]

Subject: Re: Random high CPU utilization in blk-mq with the none scheduler

On 12/10/21 6:29 PM, Dexuan Cui wrote:
>> From: Dexuan Cui
>> Sent: Thursday, December 9, 2021 7:30 PM
>>
>> Hi all,
>> I found a random high CPU utilization issue with some database benchmark
>> program running on a 192-CPU virtual machine (VM). Originally the issue
>> was found with RHEL 8.4 and Ubuntu 20.04, and further tests show that the
>> issue also reproduces with the latest upstream stable kernel v5.15.7, but
>> *not* with v5.16-rc1. It looks like someone resolved the issue in v5.16-rc1
>> recently?
>
> I did git-bisect on the linux-block tree's for-5.16/block branch and this patch
> resolves the random high CPU utilization issue (I'm not sure how):
> dc5fc361d891 ("block: attempt direct issue of plug list")
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-5.16/block&id=dc5fc361d891e089dfd9c0a975dc78041036b906
>
> Do you think if it's easy to backport it to earlier versions like 5.10?
> It looks like there are a lot of prerequisite patches.

It's more likely the real fix is avoiding the repeated plug list scan,
which I guess makes sense. That is this commit:

commit d38a9c04c0d5637a828269dccb9703d42d40d42b
Author: Jens Axboe <[email protected]>
Date: Thu Oct 14 07:24:07 2021 -0600

block: only check previous entry for plug merge attempt

If that's the case, try 5.15.x again and do:

echo 2 > /sys/block/<dev>/queue/nomerges

for each drive you are using in the IO test, and see if that gets
rid of the excess CPU usage.

--
Jens Axboe

2021-12-11 03:10:52

by Dexuan Cui

[permalink] [raw]

Subject: RE: Random high CPU utilization in blk-mq with the none scheduler

> From: Jens Axboe <[email protected]>
> Sent: Friday, December 10, 2021 6:05 PM
> ...
> It's more likely the real fix is avoiding the repeated plug list scan,
> which I guess makes sense. That is this commit:
>
> commit d38a9c04c0d5637a828269dccb9703d42d40d42b
> Author: Jens Axboe <[email protected]>
> Date: Thu Oct 14 07:24:07 2021 -0600
>
> block: only check previous entry for plug merge attempt
>
> If that's the case, try 5.15.x again and do:
>
> echo 2 > /sys/block/<dev>/queue/nomerges
>
> for each drive you are using in the IO test, and see if that gets
> rid of the excess CPU usage.
>
> --
> Jens Axboe

Thanks for the reply! Unluckily this does not work.

I tried the below command:

for i in `ls /sys/block/*/queue/nomerges`; do echo 2 > $i; done

and verified that the "nomerges" are changed to "2", but the
excess CPU usage can still reproduce easily.

(BTW, I got my 5.15.7 kernel from here:
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.7/)

root@localhost:~# uname -r
5.15.7-051507-generic
root@localhost:~# grep . /sys/block/*/queue/nomerges
/sys/block/dm-0/queue/nomerges:2
/sys/block/dm-1/queue/nomerges:2
/sys/block/dm-2/queue/nomerges:2
/sys/block/dm-3/queue/nomerges:2
/sys/block/loop0/queue/nomerges:2
/sys/block/loop1/queue/nomerges:2
/sys/block/loop2/queue/nomerges:2
/sys/block/loop3/queue/nomerges:2
/sys/block/loop4/queue/nomerges:2
/sys/block/loop5/queue/nomerges:2
/sys/block/loop6/queue/nomerges:2
/sys/block/loop7/queue/nomerges:2
/sys/block/sda/queue/nomerges:2
/sys/block/sdaa/queue/nomerges:2
/sys/block/sdab/queue/nomerges:2
/sys/block/sdac/queue/nomerges:2
/sys/block/sdad/queue/nomerges:2
/sys/block/sdae/queue/nomerges:2
/sys/block/sdaf/queue/nomerges:2
/sys/block/sdag/queue/nomerges:2
/sys/block/sdah/queue/nomerges:2
/sys/block/sdai/queue/nomerges:2
/sys/block/sdaj/queue/nomerges:2
/sys/block/sdak/queue/nomerges:2
/sys/block/sdal/queue/nomerges:2
/sys/block/sdam/queue/nomerges:2
/sys/block/sdan/queue/nomerges:2
/sys/block/sdao/queue/nomerges:2
/sys/block/sdap/queue/nomerges:2
/sys/block/sdaq/queue/nomerges:2
/sys/block/sdar/queue/nomerges:2
/sys/block/sdas/queue/nomerges:2
/sys/block/sdat/queue/nomerges:2
/sys/block/sdau/queue/nomerges:2
/sys/block/sdav/queue/nomerges:2
/sys/block/sdaw/queue/nomerges:2
/sys/block/sdax/queue/nomerges:2
/sys/block/sday/queue/nomerges:2
/sys/block/sdaz/queue/nomerges:2
/sys/block/sdb/queue/nomerges:2
/sys/block/sdba/queue/nomerges:2
/sys/block/sdbb/queue/nomerges:2
/sys/block/sdbc/queue/nomerges:2
/sys/block/sdbd/queue/nomerges:2
/sys/block/sdbe/queue/nomerges:2
/sys/block/sdbf/queue/nomerges:2
/sys/block/sdbg/queue/nomerges:2
/sys/block/sdbh/queue/nomerges:2
/sys/block/sdbi/queue/nomerges:2
/sys/block/sdbj/queue/nomerges:2
/sys/block/sdbk/queue/nomerges:2
/sys/block/sdbl/queue/nomerges:2
/sys/block/sdbm/queue/nomerges:2
/sys/block/sdbn/queue/nomerges:2
/sys/block/sdbo/queue/nomerges:2
/sys/block/sdbp/queue/nomerges:2
/sys/block/sdbq/queue/nomerges:2
/sys/block/sdbr/queue/nomerges:2
/sys/block/sdbs/queue/nomerges:2
/sys/block/sdbt/queue/nomerges:2
/sys/block/sdbu/queue/nomerges:2
/sys/block/sdbv/queue/nomerges:2
/sys/block/sdbw/queue/nomerges:2
/sys/block/sdc/queue/nomerges:2
/sys/block/sdd/queue/nomerges:2
/sys/block/sde/queue/nomerges:2
/sys/block/sdf/queue/nomerges:2
/sys/block/sdg/queue/nomerges:2
/sys/block/sdh/queue/nomerges:2
/sys/block/sdi/queue/nomerges:2
/sys/block/sdj/queue/nomerges:2
/sys/block/sdk/queue/nomerges:2
/sys/block/sdl/queue/nomerges:2
/sys/block/sdm/queue/nomerges:2
/sys/block/sdn/queue/nomerges:2
/sys/block/sdo/queue/nomerges:2
/sys/block/sdp/queue/nomerges:2
/sys/block/sdq/queue/nomerges:2
/sys/block/sdr/queue/nomerges:2
/sys/block/sds/queue/nomerges:2
/sys/block/sdt/queue/nomerges:2
/sys/block/sdu/queue/nomerges:2
/sys/block/sdv/queue/nomerges:2
/sys/block/sdw/queue/nomerges:2
/sys/block/sdx/queue/nomerges:2
/sys/block/sdy/queue/nomerges:2
/sys/block/sdz/queue/nomerges:2

2021-12-11 03:15:45

by Jens Axboe

[permalink] [raw]

Subject: Re: Random high CPU utilization in blk-mq with the none scheduler

On 12/10/21 8:10 PM, Dexuan Cui wrote:
>> From: Jens Axboe <[email protected]>
>> Sent: Friday, December 10, 2021 6:05 PM
>> ...
>> It's more likely the real fix is avoiding the repeated plug list scan,
>> which I guess makes sense. That is this commit:
>>
>> commit d38a9c04c0d5637a828269dccb9703d42d40d42b
>> Author: Jens Axboe <[email protected]>
>> Date: Thu Oct 14 07:24:07 2021 -0600
>>
>> block: only check previous entry for plug merge attempt
>>
>> If that's the case, try 5.15.x again and do:
>>
>> echo 2 > /sys/block/<dev>/queue/nomerges
>>
>> for each drive you are using in the IO test, and see if that gets
>> rid of the excess CPU usage.
>>
>> --
>> Jens Axboe
>
> Thanks for the reply! Unluckily this does not work.
>
> I tried the below command:
>
> for i in `ls /sys/block/*/queue/nomerges`; do echo 2 > $i; done
>
> and verified that the "nomerges" are changed to "2", but the
> excess CPU usage can still reproduce easily.

Just out of curiosity, can you do:

# perf record -a -g -- sleep 3

when you see the excessive CPU usage, then attach the output of

# perf report -g

to a reply?

How confident are you in your bisect result?

--
Jens Axboe

2021-12-11 03:44:47

by Dexuan Cui

[permalink] [raw]

Subject: RE: Random high CPU utilization in blk-mq with the none scheduler

> From: Jens Axboe <[email protected]>
>
> Just out of curiosity, can you do:
>
> # perf record -a -g -- sleep 3
>
> when you see the excessive CPU usage, then attach the output of
>
> # perf report -g
>
> to a reply?

I ran the commands against the 5.15.7 kernel and generated a
771-MB file. After I ran 'bzip2 -9', the file size is still 22 MB, which
I guess is too big to be shared via the list. I'll try to put it somewhere
and send you a link.

> How confident are you in your bisect result?
>
> --
> Jens Axboe

I'm pretty confident:
1) I can't reproduce the issue with v5.16-rc4 even if I run the test
10 times. Typically the issue repros every time.

2) If I revert the commit

dc5fc361d891 ("block: attempt direct issue of plug list")

and the related patches

ff1552232b36 ("blk-mq: don't issue request directly in case that current is to be blocked")
b22809092c70 ("block: replace always false argument with 'false'")

from v5.16-rc4, I'm able to repro the issue immediately.

Thanks,
Dexuan

2021-12-11 07:09:58

by Dexuan Cui

[permalink] [raw]

Subject: RE: Random high CPU utilization in blk-mq with the none scheduler

> From: Dexuan Cui
> Sent: Friday, December 10, 2021 7:45 PM
>
> > From: Jens Axboe <[email protected]>
> >
> > Just out of curiosity, can you do:
> >
> > # perf record -a -g -- sleep 3
> >
> > when you see the excessive CPU usage, then attach the output of
> >
> > # perf report -g
> >
> > to a reply?

I realized you only asked for the output of "pref report -g", which
is much smaller. Please see the attachment for it.
try_to_grab_pending() is the hottest function, e.g. see line 2479.

It's generated using v5.16-rc4 with the 3 commits reverted:
b22809092c70 ("block: replace always false argument with 'false'")
ff1552232b36 ("blk-mq: don't issue request directly in case that current is to be blocked")
dc5fc361d891 ("block: attempt direct issue of plug list")

> > How confident are you in your bisect result?
> >
> > Jens Axboe

I did some tests again and I'm pretty confident that the commit
b22809092c70 resolves the excessive CPU usage.

Thanks,
Dexuan

Attachments:

perf.data.output.log.bz2 (126.80 kB)
perf.data.output.log.bz2

2021-12-11 14:21:36

by Jens Axboe

[permalink] [raw]

Subject: Re: Random high CPU utilization in blk-mq with the none scheduler

On 12/11/21 12:09 AM, Dexuan Cui wrote:
>> From: Dexuan Cui
>> Sent: Friday, December 10, 2021 7:45 PM
>>
>>> From: Jens Axboe <[email protected]>
>>>
>>> Just out of curiosity, can you do:
>>>
>>> # perf record -a -g -- sleep 3
>>>
>>> when you see the excessive CPU usage, then attach the output of
>>>
>>> # perf report -g
>>>
>>> to a reply?
>
> I realized you only asked for the output of "pref report -g", which
> is much smaller. Please see the attachment for it.
> try_to_grab_pending() is the hottest function, e.g. see line 2479.

Sorry, can you do:

# perf report -g --no-children

instead?

--
Jens Axboe

2021-12-11 18:55:10

by Dexuan Cui

[permalink] [raw]

Subject: RE: Random high CPU utilization in blk-mq with the none scheduler

> From: Jens Axboe <[email protected]>
> Sent: Saturday, December 11, 2021 6:21 AM
>
> Sorry, can you do:
>
> # perf report -g --no-children
>
> instead?

Attached.

Attachments:

perf-report-g-no-children.log.bz2 (41.45 kB)
perf-report-g-no-children.log.bz2

2021-12-13 03:23:32

by Ming Lei

[permalink] [raw]

Subject: Re: Random high CPU utilization in blk-mq with the none scheduler

Hi Dexuan,

On Sat, Dec 11, 2021 at 03:10:43AM +0000, Dexuan Cui wrote:
> > From: Jens Axboe <[email protected]>
> > Sent: Friday, December 10, 2021 6:05 PM
> > ...
> > It's more likely the real fix is avoiding the repeated plug list scan,
> > which I guess makes sense. That is this commit:
> >
> > commit d38a9c04c0d5637a828269dccb9703d42d40d42b
> > Author: Jens Axboe <[email protected]>
> > Date: Thu Oct 14 07:24:07 2021 -0600
> >
> > block: only check previous entry for plug merge attempt
> >
> > If that's the case, try 5.15.x again and do:
> >
> > echo 2 > /sys/block/<dev>/queue/nomerges
> >
> > for each drive you are using in the IO test, and see if that gets
> > rid of the excess CPU usage.
> >
> > --
> > Jens Axboe
>
> Thanks for the reply! Unluckily this does not work.
>
> I tried the below command:
>
> for i in `ls /sys/block/*/queue/nomerges`; do echo 2 > $i; done
>
> and verified that the "nomerges" are changed to "2", but the
> excess CPU usage can still reproduce easily.

Can you provide the following blk-mq debugfs log?

(cd /sys/kernel/debug/block/dm-N && find . -type f -exec grep -aH . {} \;)

(cd /sys/kernel/debug/block/sdN && find . -type f -exec grep -aH . {} \;)

And it is enough to just collect log from one dm-mpath & one underlying iscsi disk,
so we can understand basic blk-mq setting, such as nr_hw_queues, queue depths, ...

Thanks,
Ming

2021-12-13 07:38:32

by Ming Lei

[permalink] [raw]

Subject: Re: Random high CPU utilization in blk-mq with the none scheduler

On Mon, Dec 13, 2021 at 04:20:49AM +0000, Dexuan Cui wrote:
> > From: Ming Lei <[email protected]>
> > ...
> > Can you provide the following blk-mq debugfs log?
> >
> > (cd /sys/kernel/debug/block/dm-N && find . -type f -exec grep -aH . {} \;)
> >
> > (cd /sys/kernel/debug/block/sdN && find . -type f -exec grep -aH . {} \;)
> >
> > And it is enough to just collect log from one dm-mpath & one underlying iscsi
> > disk,
> > so we can understand basic blk-mq setting, such as nr_hw_queues, queue
> > depths, ...
> >
> > Thanks,
> > Ming
>
> Attached. I collected the logs for all the dm-* and sd* devices against
> v5.16-rc4 with the 3 commits reverted:
> b22809092c70 ("block: replace always false argument with 'false'")
> ff1552232b36 ("blk-mq: don't issue request directly in case that current is to be blocked")
> dc5fc361d891 ("block: attempt direct issue of plug list")
>
> v5.16-rc4 does not reproduce the issue, so I'm pretty sure b22809092c70 is the
> patch that resolves the excessive CPU usage.

From the log:

1) dm-mpath:
- queue depth: 2048
- busy: 848, and 62 of them are in sw queue, so run queue is often
caused
- nr_hw_queues: 1
- dm-2 is in use, and dm-1/dm-3 is idle
- dm-2's dispatch busy is 8, that should be the reason why excessive CPU
usage is observed when flushing plug list without commit dc5fc361d891 in
which hctx->dispatch_busy is just bypassed

2) iscsi
- dispatch_busy is 0
- nr_hw_queues: 1
- queue depth: 113
- busy=~33, active_queues is 3, so each LUN/iscsi host is saturated
- 23 active LUNs, 23 * 33 = 759 in-flight commands

The high CPU utilization may be caused by:

1) big queue depth of dm mpath, the situation may be improved much if it
is reduced to 1024 or 800. The max allowed inflight commands from iscsi
hosts can be figured out, if dm's queue depth is much more than this number,
the extra commands need to dispatch, and run queue can be scheduled
immediately, so high CPU utilization is caused.

2) single hw queue, so contention should be big, which should be avoided
in big machine, nvme-tcp might be better than iscsi here

3) iscsi io latency is a bit big

Even CPU utilization is reduced by commit dc5fc361d891, io performance
can't be good too with v5.16-rc, I guess.

Thanks,
Ming

2021-12-13 04:21:06

by Dexuan Cui

[permalink] [raw]

Subject: RE: Random high CPU utilization in blk-mq with the none scheduler

> From: Ming Lei <[email protected]>
> ...
> Can you provide the following blk-mq debugfs log?
>
> (cd /sys/kernel/debug/block/dm-N && find . -type f -exec grep -aH . {} \;)
>
> (cd /sys/kernel/debug/block/sdN && find . -type f -exec grep -aH . {} \;)
>
> And it is enough to just collect log from one dm-mpath & one underlying iscsi
> disk,
> so we can understand basic blk-mq setting, such as nr_hw_queues, queue
> depths, ...
>
> Thanks,
> Ming

Attached. I collected the logs for all the dm-* and sd* devices against
v5.16-rc4 with the 3 commits reverted:
b22809092c70 ("block: replace always false argument with 'false'")
ff1552232b36 ("blk-mq: don't issue request directly in case that current is to be blocked")
dc5fc361d891 ("block: attempt direct issue of plug list")

v5.16-rc4 does not reproduce the issue, so I'm pretty sure b22809092c70 is the
patch that resolves the excessive CPU usage.

Attachments:

dm-sd.log.zip (594.72 kB)
dm-sd.log.zip

2021-12-13 18:43:08

by Jens Axboe

[permalink] [raw]

Subject: Re: Random high CPU utilization in blk-mq with the none scheduler

On 12/11/21 11:54 AM, Dexuan Cui wrote:
>> From: Jens Axboe <[email protected]>
>> Sent: Saturday, December 11, 2021 6:21 AM
>>
>> Sorry, can you do:
>>
>> # perf report -g --no-children
>>
>> instead?
>
> Attached.

I wonder if this will help, SCSI artifact making us hit the
mod delayed work path all the time. Might be a race in there,
but should be fine for testing.

diff --git a/block/blk-core.c b/block/blk-core.c
index 4d8f5fe91588..cb2f4d604bad 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1625,7 +1625,9 @@ EXPORT_SYMBOL(kblockd_schedule_work);
int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
unsigned long delay)
{
- return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
+ if (!work_pending(&dwork->work))
+ return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
+ return true;
}
EXPORT_SYMBOL(kblockd_mod_delayed_work_on);

--
Jens Axboe

2021-12-14 00:31:29

by Dexuan Cui

[permalink] [raw]

Subject: RE: Random high CPU utilization in blk-mq with the none scheduler

> From: Ming Lei <[email protected]>
> Sent: Sunday, December 12, 2021 11:38 PM

Ming, thanks so much for the detailed analysis!

> From the log:
>
> 1) dm-mpath:
> - queue depth: 2048
> - busy: 848, and 62 of them are in sw queue, so run queue is often
> caused
> - nr_hw_queues: 1
> - dm-2 is in use, and dm-1/dm-3 is idle
> - dm-2's dispatch busy is 8, that should be the reason why excessive CPU
> usage is observed when flushing plug list without commit dc5fc361d891 in
> which hctx->dispatch_busy is just bypassed
>
> 2) iscsi
> - dispatch_busy is 0
> - nr_hw_queues: 1
> - queue depth: 113
> - busy=~33, active_queues is 3, so each LUN/iscsi host is saturated
> - 23 active LUNs, 23 * 33 = 759 in-flight commands
>
> The high CPU utilization may be caused by:
>
> 1) big queue depth of dm mpath, the situation may be improved much if it
> is reduced to 1024 or 800. The max allowed inflight commands from iscsi
> hosts can be figured out, if dm's queue depth is much more than this number,
> the extra commands need to dispatch, and run queue can be scheduled
> immediately, so high CPU utilization is caused.

I think you're correct:
with dm_mod.dm_mq_queue_depth=256, the max CPU utilization is 8%.
with dm_mod.dm_mq_queue_depth=400, the max CPU utilization is 12%.
with dm_mod.dm_mq_queue_depth=800, the max CPU utilization is 88%.

The performance with queue_depth=800 is poor.
The performance with queue_depth=400 is good.
The performance with queue_depth=256 is also good, and there is only a
small drop comared with the 400 case.

> 2) single hw queue, so contention should be big, which should be avoided
> in big machine, nvme-tcp might be better than iscsi here
>
> 3) iscsi io latency is a bit big
>
> Even CPU utilization is reduced by commit dc5fc361d891, io performance
> can't be good too with v5.16-rc, I guess.
>
> Thanks,
> Ming

Actually the I/O performance of v5.16-rc4 (commit dc5fc361d891 is included)
is good -- it's about the same as the case where v5.16-rc4 + reverting
dc5fc361d891 + dm_mod.dm_mq_queue_depth=400 (or 256).

Thanks,
Dexuan

2021-12-14 00:43:37

by Dexuan Cui

[permalink] [raw]

Subject: RE: Random high CPU utilization in blk-mq with the none scheduler

From: Jens Axboe <[email protected]>
> Sent: Monday, December 13, 2021 10:43 AM
>
> I wonder if this will help, SCSI artifact making us hit the
> mod delayed work path all the time.

Yes, I think this is what's happening here.

> Might be a race in there, but should be fine for testing.
>
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4d8f5fe91588..cb2f4d604bad 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1625,7 +1625,9 @@ EXPORT_SYMBOL(kblockd_schedule_work);
> int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
> unsigned long delay)
> {
> - return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
> + if (!work_pending(&dwork->work))
> + return mod_delayed_work_on(cpu, kblockd_workqueue, dwork,
> delay);
> + return true;
> }
> EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
> --
> Jens Axboe

Thanks very much! This path works!

In the test, "v5.16-rc4 + reverting dc5fc361d891
("block: attempt direct issue of plug list") + this patch" shows a good I/O
performance that's about the same as v5.16-rc4.

I suppose "v5.16-rc4 + this patch" would show a slightly even better
performance.

It would be great if you can commit this small patch so that it can be
backported to old stable kernels.

Thanks,
Dexuan

2021-12-14 00:54:06

by Ming Lei

[permalink] [raw]

Subject: Re: Random high CPU utilization in blk-mq with the none scheduler

On Tue, Dec 14, 2021 at 12:31:23AM +0000, Dexuan Cui wrote:
> > From: Ming Lei <[email protected]>
> > Sent: Sunday, December 12, 2021 11:38 PM
>
> Ming, thanks so much for the detailed analysis!
>
> > From the log:
> >
> > 1) dm-mpath:
> > - queue depth: 2048
> > - busy: 848, and 62 of them are in sw queue, so run queue is often
> > caused
> > - nr_hw_queues: 1
> > - dm-2 is in use, and dm-1/dm-3 is idle
> > - dm-2's dispatch busy is 8, that should be the reason why excessive CPU
> > usage is observed when flushing plug list without commit dc5fc361d891 in
> > which hctx->dispatch_busy is just bypassed
> >
> > 2) iscsi
> > - dispatch_busy is 0
> > - nr_hw_queues: 1
> > - queue depth: 113
> > - busy=~33, active_queues is 3, so each LUN/iscsi host is saturated
> > - 23 active LUNs, 23 * 33 = 759 in-flight commands
> >
> > The high CPU utilization may be caused by:
> >
> > 1) big queue depth of dm mpath, the situation may be improved much if it
> > is reduced to 1024 or 800. The max allowed inflight commands from iscsi
> > hosts can be figured out, if dm's queue depth is much more than this number,
> > the extra commands need to dispatch, and run queue can be scheduled
> > immediately, so high CPU utilization is caused.
>
> I think you're correct:
> with dm_mod.dm_mq_queue_depth=256, the max CPU utilization is 8%.
> with dm_mod.dm_mq_queue_depth=400, the max CPU utilization is 12%.
> with dm_mod.dm_mq_queue_depth=800, the max CPU utilization is 88%.
>
> The performance with queue_depth=800 is poor.
> The performance with queue_depth=400 is good.
> The performance with queue_depth=256 is also good, and there is only a
> small drop comared with the 400 case.

That should be the reason why the issue isn't triggered in case of real
io scheduler.

So far blk-mq doesn't provide way to adjust tags queue depth
dynamically.

But not understand reason of default dm_mq_queue_depth(2048), in this
situation, each LUN can just queue 113/3 requests at most, and 3 LUNs
are attached to single iscsi host.

Mike, can you share why the default dm_mq_queue_depth is so big? And
seems it doesn't consider the underlying queue's queue depth. What is
the biggest dm rq queue depth? which need to saturate all underlying paths?

>
> > 2) single hw queue, so contention should be big, which should be avoided
> > in big machine, nvme-tcp might be better than iscsi here
> >
> > 3) iscsi io latency is a bit big
> >
> > Even CPU utilization is reduced by commit dc5fc361d891, io performance
> > can't be good too with v5.16-rc, I guess.
> >
> > Thanks,
> > Ming
>
> Actually the I/O performance of v5.16-rc4 (commit dc5fc361d891 is included)
> is good -- it's about the same as the case where v5.16-rc4 + reverting
> dc5fc361d891 + dm_mod.dm_mq_queue_depth=400 (or 256).

The single hw queue may be the root cause of your issue, and there
is only single run_work, which can be touched by all CPUs(~200) almost, so cache
ping-pong could be very serious.

Jens patch may improve it more or less, please test it.

Thanks,
Ming

2021-12-14 03:09:39

by Dexuan Cui

[permalink] [raw]

Subject: RE: Random high CPU utilization in blk-mq with the none scheduler

> From: Ming Lei <[email protected]>
> Sent: Monday, December 13, 2021 4:53 PM
> ...
> The single hw queue may be the root cause of your issue, and there
> is only single run_work, which can be touched by all CPUs(~200) almost, so
> cache
> ping-pong could be very serious.
>
> Jens patch may improve it more or less, please test it.
>
> Thanks,
> Ming

I replied to Jens's email with the test result before your email. :-)

Jen's patch does help in the absence of dc5fc361d891
("block: attempt direct issue of plug list").

BTW, the throughput in the case of dm_mod.dm_mq_queue_depth=256
is slightly better than the case of dm_mod.dm_mq_queue_depth=2048 with
Jen's patch or dc5fc361d891.