From: Jiwei Sun <[email protected]>
A soft lockup issue will be triggered when do fio test on a 448-core
server, such as the following warning:
[ 519.743511] watchdog: BUG: soft lockup - CPU#305 stuck for 144s! [fio:11226]
[ 519.744834] CPU: 305 PID: 11226 Comm: fio Kdump: loaded Tainted: G L 6.3.0 #48
[ 519.744850] Hardware name: Lenovo ThinkSystem SR645 V3 MB,Genoa,DDR5,Oahu,1U/SB27B31174, BIOS KAE111F-2.10 04/17/2023
[ 519.744867] RIP: 0010:__do_softirq+0x93/0x2d3
[ 519.744901] Code: c3 c0 5a ff c7 44 24 08 0a 00 00 00 48 c7 c7 45 6b c5 a0 e8 1f ac fe ff 65 66 c7 05 dd 76 a8 5f 00 00 fb 49 c7 c2 c0 60 20 a1 <eb> 70 48 98 48 c7 c7 f8 fc c5 a0 4d 8d 3c c2 4c 89 fd 48 81 ed c0
[ 519.744916] RSP: 0018:ff6e423a8b584fa0 EFLAGS: 00000202
[ 519.744944] RAX: 0000000000000131 RBX: ff3143f856cd3f80 RCX: 0000000000000000
[ 519.744959] RDX: 000000562c5e4652 RSI: ffffffffa0c56b45 RDI: ffffffffa0c35ab3
[ 519.744974] RBP: 0000000000000000 R08: 0000000000000b01 R09: 0000000000031600
[ 519.744988] R10: ffffffffa12060c0 R11: 00000054ea7837c0 R12: 0000000000000080
[ 519.745003] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 519.745017] FS: 00007fa41bb9e000(0000) GS:ff3143f7af840000(0000) knlGS:0000000000000000
[ 519.745032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 519.745046] CR2: 00007fa3fc05b000 CR3: 00000008d6f56004 CR4: 0000000000771ee0
[ 519.745060] PKRU: 55555554
[ 519.745075] Call Trace:
[ 519.745095] <IRQ>
[ 519.745160] ? ftrace_regs_caller_end+0x61/0x61
[ 519.745185] irq_exit_rcu+0xcd/0x100
[ 519.745217] sysvec_apic_timer_interrupt+0xa2/0xd0
[ 519.745251] </IRQ>
[ 519.745265] <TASK>
[ 519.745292] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 519.745318] RIP: 0010:_raw_spin_unlock_irqrestore+0x1d/0x40
[ 519.745339] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa e8 77 9b 0b 20 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 01 fb bf 01 00 00 00 <e8> be 65 58 ff 65 8b 05 9f 81 a8 5f 85 c0 74 05 c3 cc cc cc cc 0f
[ 519.745354] RSP: 0018:ff6e423b244279f0 EFLAGS: 00000206
[ 519.745381] RAX: 000000098b346200 RBX: ff3143f0f0fccc00 RCX: 0000000000000286
[ 519.745396] RDX: 0000000000000700 RSI: 0000000000000286 RDI: 0000000000000001
[ 519.745410] RBP: ff3143f835e5c600 R08: 000000058bb02000 R09: 0000000000000000
[ 519.745424] R10: 00000107d08ab6ea R11: 0000000000000800 R12: 0000000000000820
[ 519.745438] R13: ff3143f835e5c610 R14: ff6e423b24427a50 R15: ff3143f90b346200
[ 519.745576] dma_pool_alloc+0x184/0x200
[ 519.745647] nvme_prep_rq.part.58+0x416/0x840 [nvme]
[ 519.745760] nvme_queue_rq+0x7b/0x210 [nvme]
[ 519.745821] __blk_mq_try_issue_directly+0x153/0x1c0
[ 519.745906] blk_mq_try_issue_directly+0x15/0x50
[ 519.745935] blk_mq_submit_bio+0x4c4/0x510
[ 519.746047] submit_bio_noacct_nocheck+0x331/0x370
[ 519.746135] blkdev_direct_IO.part.22+0x138/0x5b0
[ 519.746165] ? __fsnotify_parent+0x119/0x350
[ 519.746274] blkdev_read_iter+0xe3/0x170
[ 519.746325] aio_read+0xf6/0x1b0
[ 519.746423] ? 0xffffffffc066309b
[ 519.746567] ? io_submit_one+0x18c/0xd60
[ 519.746587] ? aio_read+0x5/0x1b0
[ 519.746608] io_submit_one+0x18c/0xd60
[ 519.746672] ? __x64_sys_io_submit+0x4/0x180
[ 519.746794] ? __x64_sys_io_submit+0x93/0x180
[ 519.746815] __x64_sys_io_submit+0x93/0x180
[ 519.746838] ? __pfx___x64_sys_io_submit+0x10/0x10
[ 519.746923] do_syscall_64+0x3b/0x90
[ 519.746958] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 519.746983] RIP: 0033:0x7fa41b83ee5d
[ 519.747009] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[ 519.747024] RSP: 002b:00007fff698e8cc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
[ 519.747053] RAX: ffffffffffffffda RBX: 00007fa41bb9df70 RCX: 00007fa41b83ee5d
[ 519.747067] RDX: 000055a53098ba68 RSI: 0000000000000001 RDI: 00007fa3fc05b000
[ 519.747082] RBP: 00007fa3fc05b000 R08: 0000000000000000 R09: 0000000000000048
[ 519.747095] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[ 519.747110] R13: 0000000000000000 R14: 000055a53098ba68 R15: 000055a530989440
[ 519.747245] </TASK>
and we have collected the corresponding ftrace log,
515.948558 | 305) | nvme_irq [nvme]() {
+--1503 lines: 515.948560 | 305) | blk_mq_complete_request_remote() {---
515.948995 | 305) | nvme_pci_complete_batch [nvme]() {
515.948996 | 305) | nvme_unmap_data [nvme]() {
+-- 3 lines: 515.948996 | 305) | dma_unmap_sg_attrs() {---
515.948997 | 305) | dma_pool_free() {
515.948997 | 305) | _raw_spin_lock_irqsave() {
515.948997 | 305) 0.190 us | preempt_count_add();
515.948999 | 305) * 14893.70 us | native_queued_spin_lock_slowpath();
515.963893 | 305) * 14895.59 us | }
515.963929 | 305) | _raw_spin_unlock_irqrestore() {
515.963930 | 305) 0.160 us | preempt_count_sub();
515.963930 | 305) 1.011 us | }
515.963931 | 305) * 14933.71 us | }
+-- 9 lines: 515.963931 | 305) | mempool_free() {---
515.963933 | 305) * 14937.45 us | }
515.963933 | 305) 0.160 us | nvme_complete_batch_req [nvme_core]();
515.963934 | 305) | nvme_unmap_data [nvme]() {
+-- 3 lines: 515.963934 | 305) | dma_unmap_sg_attrs() {---
515.963935 | 305) | dma_pool_free() {
515.963935 | 305) | _raw_spin_lock_irqsave() {
515.963936 | 305) 0.170 us | preempt_count_add();
515.963936 | 305) * 19865.82 us | native_queued_spin_lock_slowpath();
515.983803 | 305) * 19867.21 us | }
515.983833 | 305) | _raw_spin_unlock_irqrestore() {
515.983833 | 305) 0.200 us | preempt_count_sub();
515.983834 | 305) 1.312 us | }
515.983834 | 305) * 19898.71 us | }
+--- 11 lines: 515.983834 | 305) | mempool_free() {---
515.983837 | 305) * 19902.73 us | }
515.983837 | 305) 0.170 us | nvme_complete_batch_req [nvme_core]();
+--40055 lines: 515.983837 | 305) | nvme_unmap_data [nvme]() {---
520.816867 | 305) $ 4867871 us | }
520.816867 | 305) $ 4868309 us | }
520.816868 | 305) $ 4868310 us | }
According to the above two logs, we can know the nvme_irq() cost too much
time, in the above case, about 4.8 second. And we can also know that the
main bottlenecks is in the competition for the spin lock pool->lock.
In order to avoid the lockup log, we can enable the nvme irq threading
by adding nvme.use_threaded_interrupts=1 in cmdline and with preempt=full,
but for the kernel with preempt=off, the nvme.use_threaded_interrupts=1
will fail. So, we can add cond_resched() in the loops of the
nvme_complete_batch().
Reviewed-by: Adrian Huang <[email protected]>
Signed-off-by: Jiwei Sun <[email protected]>
---
drivers/nvme/host/nvme.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index bf46f122e9e1..e247cb664e9e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -725,6 +725,7 @@ static __always_inline void nvme_complete_batch(struct io_comp_batch *iob,
rq_list_for_each(&iob->req_list, req) {
fn(req);
nvme_complete_batch_req(req);
+ cond_resched();
}
blk_mq_end_request_batch(iob);
}
--
2.27.0
On Tue, May 02, 2023 at 08:54:12PM +0800, [email protected] wrote:
> From: Jiwei Sun <[email protected]>
>
> A soft lockup issue will be triggered when do fio test on a 448-core
> server, such as the following warning:
nvme_complete_batch can be called from irq context, so I don't think
a cond_resched is a good idea here.
On Tue, May 02, 2023 at 08:54:12PM +0800, [email protected] wrote:
> According to the above two logs, we can know the nvme_irq() cost too much
> time, in the above case, about 4.8 second. And we can also know that the
> main bottlenecks is in the competition for the spin lock pool->lock.
I have a fix that makes pool->lock critical section much shorter and constant
time. It was staged in mm tree for months, but mysteriously it's missing...
On Wed, May 03, 2023 at 10:28:53AM -0600, Keith Busch wrote:
> On Tue, May 02, 2023 at 08:54:12PM +0800, [email protected] wrote:
> > According to the above two logs, we can know the nvme_irq() cost too much
> > time, in the above case, about 4.8 second. And we can also know that the
> > main bottlenecks is in the competition for the spin lock pool->lock.
>
> I have a fix that makes pool->lock critical section much shorter and constant
> time. It was staged in mm tree for months, but mysteriously it's missing...
Well this is certainly odd: the commit I'm referring to is in Linus' log, but
the file it modified is not updated. What kind of 'git' madness is this??
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4de12a032fa6d0670aa0bb43a2bf9f812680d0f
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/mm/dmapool.c
On Wed, May 03, 2023 at 05:42:22PM -0600, Keith Busch wrote:
> On Wed, May 03, 2023 at 10:28:53AM -0600, Keith Busch wrote:
> > On Tue, May 02, 2023 at 08:54:12PM +0800, [email protected] wrote:
> > > According to the above two logs, we can know the nvme_irq() cost too much
> > > time, in the above case, about 4.8 second. And we can also know that the
> > > main bottlenecks is in the competition for the spin lock pool->lock.
> >
> > I have a fix that makes pool->lock critical section much shorter and constant
> > time. It was staged in mm tree for months, but mysteriously it's missing...
>
> Well this is certainly odd: the commit I'm referring to is in Linus' log, but
> the file it modified is not updated. What kind of 'git' madness is this??
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4de12a032fa6d0670aa0bb43a2bf9f812680d0f
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/mm/dmapool.c
>
Hmm, this is very interesting indeed.
It looks like:
a4de12a032fa ("dmapool: link blocks across pages") matches
https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
so everything good.
However, the following commit:
2d55c16c0c54 ("dmapool: create/destroy cleanup")
does not match what was on the list:
https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
It kind of looks like
2d55c16c0c54 ("dmapool: create/destroy cleanup") is actually
a revert of a4de12a032fa ("dmapool: link blocks across pages"),
plus some additional changes. Scary.
Additionally, it seems like the fix:
https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
was never merged.
Something is fishy here...
(Perhaps the maintainer did a git revert instead of applying the fix...
and accidentally squashed the revert with a proper commit...
But even that does not make sense, as there simply seems to be too many
lines changed in 2d55c16c0c54 ("dmapool: create/destroy cleanup") for that
to be the case...)
Kind regards,
Niklas
On Thu, May 04, 2023 at 11:39:53AM +0200, Niklas Cassel wrote:
> On Wed, May 03, 2023 at 05:42:22PM -0600, Keith Busch wrote:
> > On Wed, May 03, 2023 at 10:28:53AM -0600, Keith Busch wrote:
> > > On Tue, May 02, 2023 at 08:54:12PM +0800, [email protected] wrote:
> > > > According to the above two logs, we can know the nvme_irq() cost too much
> > > > time, in the above case, about 4.8 second. And we can also know that the
> > > > main bottlenecks is in the competition for the spin lock pool->lock.
> > >
> > > I have a fix that makes pool->lock critical section much shorter and constant
> > > time. It was staged in mm tree for months, but mysteriously it's missing...
> >
> > Well this is certainly odd: the commit I'm referring to is in Linus' log, but
> > the file it modified is not updated. What kind of 'git' madness is this??
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a4de12a032fa6d0670aa0bb43a2bf9f812680d0f
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/mm/dmapool.c
> >
>
> Hmm, this is very interesting indeed.
>
> It looks like:
> a4de12a032fa ("dmapool: link blocks across pages") matches
> https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
> so everything good.
>
> However, the following commit:
> 2d55c16c0c54 ("dmapool: create/destroy cleanup")
> does not match what was on the list:
> https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
>
>
>
> It kind of looks like
> 2d55c16c0c54 ("dmapool: create/destroy cleanup") is actually
> a revert of a4de12a032fa ("dmapool: link blocks across pages"),
> plus some additional changes. Scary.
>
>
>
> Additionally, it seems like the fix:
> https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
> was never merged.
>
>
> Something is fishy here...
>
> (Perhaps the maintainer did a git revert instead of applying the fix...
> and accidentally squashed the revert with a proper commit...
> But even that does not make sense, as there simply seems to be too many
> lines changed in 2d55c16c0c54 ("dmapool: create/destroy cleanup") for that
> to be the case...)
It seems like the additional lines in
2d55c16c0c54 ("dmapool: create/destroy cleanup") is explained by the
maintainer (probably accidentally) reverting additional patches in that
very same commit.
It seems like that commit reverted all changes to mm/dmapool.c since v6.3:
$ git log --oneline 2d55c16c0c54325bf15286cfa6ba6c268036b9e4 --not v6.3 mm/dmapool.c
2d55c16c0c54 dmapool: create/destroy cleanup
a4de12a032fa dmapool: link blocks across pages
9d062a8a4c6d dmapool: don't memset on free twice
887aef615818 dmapool: simplify freeing
2591b516533b dmapool: consolidate page initialization
36d1a28921a4 dmapool: rearrange page alloc failure handling
52e7d5653979 dmapool: move debug code to own functions
19f504584038 dmapool: speedup DMAPOOL_DEBUG with init_on_alloc
347e4e44c0a9 dmapool: cleanup integer types
65216545436b dmapool: use sysfs_emit() instead of scnprintf()
7f796d141c07 dmapool: remove checks for dev == NULL
$ git diff v6.3 2d55c16c0c54325bf15286cfa6ba6c268036b9e4 mm/dmapool.c
<nothing>
Should probably tell Andrew so that he has time to send out a new PR
before the merge window closes. (I added him in To: .)
He should probably pick up this fix too:
https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
Kind regards,
Niklas
On Thu, May 04, 2023 at 10:12:46AM +0000, Niklas Cassel wrote:
> On Thu, May 04, 2023 at 11:39:53AM +0200, Niklas Cassel wrote:
> >
> > Something is fishy here...
> >
> > (Perhaps the maintainer did a git revert instead of applying the fix...
> > and accidentally squashed the revert with a proper commit...
> > But even that does not make sense, as there simply seems to be too many
> > lines changed in 2d55c16c0c54 ("dmapool: create/destroy cleanup") for that
> > to be the case...)
>
> It seems like the additional lines in
> 2d55c16c0c54 ("dmapool: create/destroy cleanup") is explained by the
> maintainer (probably accidentally) reverting additional patches in that
> very same commit.
>
> It seems like that commit reverted all changes to mm/dmapool.c since v6.3:
Spot on, thanks for confirming! Somehow a series revert was squahed into
the final patch. :(
We reached the same conclusion pretty late yesterday, so I'll wait for
Andrew to suggest how to proceed. I think we have to rebase whole series
with the correct final patch and resend the pull.
> $ git log --oneline 2d55c16c0c54325bf15286cfa6ba6c268036b9e4 --not v6.3 mm/dmapool.c
> 2d55c16c0c54 dmapool: create/destroy cleanup
> a4de12a032fa dmapool: link blocks across pages
> 9d062a8a4c6d dmapool: don't memset on free twice
> 887aef615818 dmapool: simplify freeing
> 2591b516533b dmapool: consolidate page initialization
> 36d1a28921a4 dmapool: rearrange page alloc failure handling
> 52e7d5653979 dmapool: move debug code to own functions
> 19f504584038 dmapool: speedup DMAPOOL_DEBUG with init_on_alloc
> 347e4e44c0a9 dmapool: cleanup integer types
> 65216545436b dmapool: use sysfs_emit() instead of scnprintf()
> 7f796d141c07 dmapool: remove checks for dev == NULL
>
>
> $ git diff v6.3 2d55c16c0c54325bf15286cfa6ba6c268036b9e4 mm/dmapool.c
> <nothing>
>
>
> Should probably tell Andrew so that he has time to send out a new PR
> before the merge window closes. (I added him in To: .)
>
> He should probably pick up this fix too:
> https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
Andrew actually squashed that fix into the original patch.
On Thu, May 04, 2023 at 09:06:20AM -0600, Keith Busch wrote:
> On Thu, May 04, 2023 at 10:12:46AM +0000, Niklas Cassel wrote:
> > On Thu, May 04, 2023 at 11:39:53AM +0200, Niklas Cassel wrote:
> > >
> > > Something is fishy here...
> > >
> > > (Perhaps the maintainer did a git revert instead of applying the fix...
> > > and accidentally squashed the revert with a proper commit...
> > > But even that does not make sense, as there simply seems to be too many
> > > lines changed in 2d55c16c0c54 ("dmapool: create/destroy cleanup") for that
> > > to be the case...)
> >
> > It seems like the additional lines in
> > 2d55c16c0c54 ("dmapool: create/destroy cleanup") is explained by the
> > maintainer (probably accidentally) reverting additional patches in that
> > very same commit.
> >
> > It seems like that commit reverted all changes to mm/dmapool.c since v6.3:
>
> Spot on, thanks for confirming! Somehow a series revert was squahed into
> the final patch. :(
>
> We reached the same conclusion pretty late yesterday, so I'll wait for
> Andrew to suggest how to proceed. I think we have to rebase whole series
> with the correct final patch and resend the pull.
>
> > $ git log --oneline 2d55c16c0c54325bf15286cfa6ba6c268036b9e4 --not v6.3 mm/dmapool.c
> > 2d55c16c0c54 dmapool: create/destroy cleanup
> > a4de12a032fa dmapool: link blocks across pages
> > 9d062a8a4c6d dmapool: don't memset on free twice
> > 887aef615818 dmapool: simplify freeing
> > 2591b516533b dmapool: consolidate page initialization
> > 36d1a28921a4 dmapool: rearrange page alloc failure handling
> > 52e7d5653979 dmapool: move debug code to own functions
> > 19f504584038 dmapool: speedup DMAPOOL_DEBUG with init_on_alloc
> > 347e4e44c0a9 dmapool: cleanup integer types
> > 65216545436b dmapool: use sysfs_emit() instead of scnprintf()
> > 7f796d141c07 dmapool: remove checks for dev == NULL
Hello Keith, Andrew,
Note that the first four patches were not part of Keith's series,
but they were reverted as well.
> Andrew actually squashed that fix into the original patch.
Ok, strange why there was a revert done at all then.
Kind regards,
Niklas
On Tue, May 02, 2023 at 08:54:12PM +0800, [email protected] wrote:
> From: Jiwei Sun <[email protected]>
>
> A soft lockup issue will be triggered when do fio test on a 448-core
> server, such as the following warning:
...
> According to the above two logs, we can know the nvme_irq() cost too much
> time, in the above case, about 4.8 second. And we can also know that the
> main bottlenecks is in the competition for the spin lock pool->lock.
The most recent 6.4-rc has included a significant changeset to the pool
allocator that may show a considerable difference in pool->lock timing.
It would be interesting to hear if it changes your observation with your
448-core setup. Would you be able to re-run your experiements that
produced the soft lockup with this kernel on that machine?
Hi Keith,
On 2023/5/16 04:40, Keith Busch wrote:
> On Tue, May 02, 2023 at 08:54:12PM +0800, [email protected] wrote:
>> From: Jiwei Sun <[email protected]>
>>
>> A soft lockup issue will be triggered when do fio test on a 448-core
>> server, such as the following warning:
> ...
>
>> According to the above two logs, we can know the nvme_irq() cost too much
>> time, in the above case, about 4.8 second. And we can also know that the
>> main bottlenecks is in the competition for the spin lock pool->lock.
> The most recent 6.4-rc has included a significant changeset to the pool
> allocator that may show a considerable difference in pool->lock timing.
> It would be interesting to hear if it changes your observation with your
> 448-core setup. Would you be able to re-run your experiements that
> produced the soft lockup with this kernel on that machine?
We have done some testes with the latest kernel, the issue can not be
reproduced,
and we have analyzed the ftrace log of nvme_irq, we did NOT find any
competition for
the spin lock pool->lock, and all the dma_pool_free function completed
within 2us.
287) | dma_pool_free() {
287) 0.150 us | _raw_spin_lock_irqsave();
287) 0.421 us | _raw_spin_unlock_irqrestore();
287) 1.472 us | }
+-- 63 lines: 287) | mempool_free() {-----------
435) | dma_pool_free() {
435) 0.170 us | _raw_spin_lock_irqsave();
435) 0.210 us | _raw_spin_unlock_irqrestore();
435) 1.172 us | }
+--145 lines: 435) | mempool_free() {---------
317) | dma_pool_free() {
317) 0.160 us | _raw_spin_lock_irqsave();
317) 0.401 us | _raw_spin_unlock_irqrestore();
317) 1.252 us | }
Based on the test results and analysis of the code principles, your
patch has fixed this performance issue.
By the way, another task hung issue was triggered in the test. We are
analyzing it, but this is another story,
we can discuss it in other thread.
Thanks,
Regards,
Jiwei