2024-04-30 06:17:22

by David Wang

[permalink] [raw]
Subject: [Regression] 6.9.0: WARNING: workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl]

Hi,
I got following kernel WARNING when the my 2-core KVM(6.9.0-rc6) is under high cpu load.

[Mon Apr 29 21:36:04 2024] ------------[ cut here ]------------
[Mon Apr 29 21:36:04 2024] workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl]
[Mon Apr 29 21:36:04 2024] WARNING: CPU: 1 PID: 792 at kernel/workqueue.c:3728 check_flush_dependency+0xfd/0x120
[Mon Apr 29 21:36:04 2024] Modules linked in: xt_conntrack(E) nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) nft_compat(E) nf_tables(E) br_netfilter(E) bridge(E) stp(E) llc(E) ip_set(E) nfnetlink(E) ip_vs_sh(E) ip_vs_wrr(E) ip_vs_rr(E) ip_vs(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) intel_rapl_msr(E) intel_rapl_common(E) crct10dif_pclmul(E) ghash_clmulni_intel(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_intel_dspcfg(E) sha512_ssse3(E) snd_hda_codec(E) sha512_generic(E) sha256_ssse3(E) overlay(E) sha1_ssse3(E) snd_hda_core(E) snd_hwdep(E) aesni_intel(E) snd_pcm(E) crypto_simd(E) pcspkr(E) cryptd(E) joydev(E) qxl(E) snd_timer(E) drm_ttm_helper(E) ttm(E) evdev(E) snd(E) iTCO_wdt(E) serio_raw(E) sg(E) virtio_balloon(E) virtio_console(E) iTCO_vendor_support(E) soundcore(E) qemu_fw_cfg(E) drm_kms_helper(E) button(E) binfmt_misc(E) fuse(E) drm(E) configfs(E) virtio_rng(E) rng_core(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E)
[Mon Apr 29 21:36:04 2024] hid_generic(E) usbhid(E) hid(E) sr_mod(E) cdrom(E) ahci(E) libahci(E) virtio_net(E) net_failover(E) failover(E) virtio_blk(E) libata(E) xhci_pci(E) crc32_pclmul(E) crc32c_intel(E) scsi_mod(E) scsi_common(E) lpc_ich(E) i2c_i801(E) xhci_hcd(E) psmouse(E) i2c_smbus(E) virtio_pci(E) usbcore(E) virtio_pci_legacy_dev(E) virtio_pci_modern_dev(E) usb_common(E) virtio(E) mfd_core(E) virtio_ring(E)
[Mon Apr 29 21:36:04 2024] CPU: 1 PID: 792 Comm: kworker/u13:4 Tainted: G E 6.9.0-rc6-linan-5 #197
[Mon Apr 29 21:36:04 2024] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[Mon Apr 29 21:36:04 2024] Workqueue: ttm ttm_bo_delayed_delete [ttm]
[Mon Apr 29 21:36:04 2024] RIP: 0010:check_flush_dependency+0xfd/0x120
[Mon Apr 29 21:36:04 2024] Code: 8b 45 18 48 8d b2 c0 00 00 00 49 89 e8 48 8d 8b c0 00 00 00 48 c7 c7 68 30 a4 a7 c6 05 9b 12 6e 01 01 48 89 c2 e8 53 b9 fd ff <0f> 0b e9 1e ff ff ff 80 3d 86 12 6e 01 00 75 93 e9 4a ff ff ff 66
[Mon Apr 29 21:36:04 2024] RSP: 0018:ffff9d31805abce8 EFLAGS: 00010086
[Mon Apr 29 21:36:04 2024] RAX: 0000000000000000 RBX: ffff8c8c4004ee00 RCX: 0000000000000000
[Mon Apr 29 21:36:04 2024] RDX: 0000000000000003 RSI: 0000000000000027 RDI: 00000000ffffffff
[Mon Apr 29 21:36:04 2024] RBP: ffffffffc0b53570 R08: 0000000000000000 R09: 0000000000000003
[Mon Apr 29 21:36:04 2024] R10: ffff9d31805abb80 R11: ffffffffa7cc1108 R12: ffff8c8c42eb8000
[Mon Apr 29 21:36:04 2024] R13: ffff8c8c48077900 R14: ffff8c8cbbd30b80 R15: 0000000000000001
[Mon Apr 29 21:36:04 2024] FS: 0000000000000000(0000) GS:ffff8c8cbbd00000(0000) knlGS:0000000000000000
[Mon Apr 29 21:36:04 2024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Mon Apr 29 21:36:04 2024] CR2: 00007ffd38bb3ff8 CR3: 000000010217a000 CR4: 0000000000350ef0
[Mon Apr 29 21:36:04 2024] Call Trace:
[Mon Apr 29 21:36:04 2024] <TASK>
[Mon Apr 29 21:36:04 2024] ? __warn+0x7c/0x120
[Mon Apr 29 21:36:04 2024] ? check_flush_dependency+0xfd/0x120
[Mon Apr 29 21:36:04 2024] ? report_bug+0x18d/0x1c0
[Mon Apr 29 21:36:04 2024] ? srso_return_thunk+0x5/0x5f
[Mon Apr 29 21:36:04 2024] ? handle_bug+0x3c/0x80
[Mon Apr 29 21:36:04 2024] ? exc_invalid_op+0x13/0x60
[Mon Apr 29 21:36:04 2024] ? asm_exc_invalid_op+0x16/0x20
[Mon Apr 29 21:36:04 2024] ? __pfx_qxl_gc_work+0x10/0x10 [qxl]
[Mon Apr 29 21:36:04 2024] ? check_flush_dependency+0xfd/0x120
[Mon Apr 29 21:36:04 2024] ? check_flush_dependency+0xfd/0x120
[Mon Apr 29 21:36:04 2024] __flush_work.isra.0+0xc0/0x270
[Mon Apr 29 21:36:04 2024] ? srso_return_thunk+0x5/0x5f
[Mon Apr 29 21:36:04 2024] ? srso_return_thunk+0x5/0x5f
[Mon Apr 29 21:36:04 2024] ? __queue_work.part.0+0x18b/0x3d0
[Mon Apr 29 21:36:04 2024] ? srso_return_thunk+0x5/0x5f
[Mon Apr 29 21:36:04 2024] qxl_queue_garbage_collect+0x7f/0x90 [qxl]
[Mon Apr 29 21:36:04 2024] qxl_fence_wait+0x9c/0x180 [qxl]
[Mon Apr 29 21:36:04 2024] dma_fence_wait_timeout+0x61/0x130
[Mon Apr 29 21:36:04 2024] dma_resv_wait_timeout+0x6d/0xd0
[Mon Apr 29 21:36:04 2024] ttm_bo_delayed_delete+0x26/0x80 [ttm]
[Mon Apr 29 21:36:04 2024] process_one_work+0x18c/0x3b0
[Mon Apr 29 21:36:04 2024] worker_thread+0x273/0x390
[Mon Apr 29 21:36:04 2024] ? __pfx_worker_thread+0x10/0x10
[Mon Apr 29 21:36:04 2024] kthread+0xdd/0x110
[Mon Apr 29 21:36:04 2024] ? __pfx_kthread+0x10/0x10
[Mon Apr 29 21:36:04 2024] ret_from_fork+0x30/0x50
[Mon Apr 29 21:36:04 2024] ? __pfx_kthread+0x10/0x10
[Mon Apr 29 21:36:04 2024] ret_from_fork_asm+0x1a/0x30
[Mon Apr 29 21:36:04 2024] </TASK>
[Mon Apr 29 21:36:04 2024] ---[ end trace 0000000000000000 ]---

I find that the exact warning message mentioned in
https://lore.kernel.org/lkml/[email protected]/T/#m8c2ecc83ebba8717b1290ec28d4dc15f2fa595d5
And confirmed that the warning is caused by 07ed11afb68d94eadd4ffc082b97c2331307c5ea and reverting it can fix.


It seems that under heavy load, qxl_queue_garbage_collect would be called within
a WQ_MEM_RECLAIM worker, and flush qxl_gc_work which is a
!WQ_MEM_RECLAIM worker. This will trigger the kernel WARNING by
check_flush_dependency.

And I tried following changes, setting flush flag to false.
The warning is gone, but I am not sure whether there is any other side-effect,
especially the issue mentioned in
https://lore.kernel.org/lkml/[email protected]/T/#m988ffad2000c794dcfdab7e60b03db93d8726391

Signed-off-by: David Wang <[email protected]>
---
drivers/gpu/drm/qxl/qxl_release.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/qxl/qxl_release.c b/drivers/gpu/drm/qxl/qxl_release.c
index 9febc8b73f09..f372085c5aad 100644
--- a/drivers/gpu/drm/qxl/qxl_release.c
+++ b/drivers/gpu/drm/qxl/qxl_release.c
@@ -76,7 +76,7 @@ static long qxl_fence_wait(struct dma_fence *fence, bool intr,
qxl_io_notify_oom(qdev);

for (count = 0; count < 11; count++) {
- if (!qxl_queue_garbage_collect(qdev, true))
+ if (!qxl_queue_garbage_collect(qdev, false))
break;

if (dma_fence_is_signaled(fence))
--
2.39.2



David



2024-05-06 14:31:00

by David Wang

[permalink] [raw]
Subject: Re: [Regression] 6.9.0: WARNING: workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl]

The kernel warning still shows up in 6.9.0-rc7.

(I think 4 high load processes on a 2-Core VM could easily trigger the kernel warning.)

Thanks
David


2024-05-07 05:04:38

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [Regression] 6.9.0: WARNING: workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl]



On 06.05.24 16:30, David Wang wrote:
>> On 30.04.24 08:13, David Wang wrote:

>> And confirmed that the warning is caused by
>> 07ed11afb68d94eadd4ffc082b97c2331307c5ea and reverting it can fix.
>
> The kernel warning still shows up in 6.9.0-rc7.
> (I think 4 high load processes on a 2-Core VM could easily trigger the kernel warning.)

Thx for the report. Linus just reverted the commit 07ed11afb68 you
mentioned in your initial mail (I put that quote in again, see above):

3628e0383dd349 ("Reapply "drm/qxl: simplify qxl_fence_wait"")
https://git.kernel.org/torvalds/c/3628e0383dd349f02f882e612ab6184e4bb3dc10

So this hopefully should be history now.

Ciao, Thorsten

2024-05-08 12:35:22

by Anders Blomdell

[permalink] [raw]
Subject: Re: [Regression] 6.9.0: WARNING: workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl]



On 2024-05-07 07:04, Linux regression tracking (Thorsten Leemhuis) wrote:
>
>
> On 06.05.24 16:30, David Wang wrote:
>>> On 30.04.24 08:13, David Wang wrote:
>
>>> And confirmed that the warning is caused by
>>> 07ed11afb68d94eadd4ffc082b97c2331307c5ea and reverting it can fix.
>>
>> The kernel warning still shows up in 6.9.0-rc7.
>> (I think 4 high load processes on a 2-Core VM could easily trigger the kernel warning.)
>
> Thx for the report. Linus just reverted the commit 07ed11afb68 you
> mentioned in your initial mail (I put that quote in again, see above):
>
> 3628e0383dd349 ("Reapply "drm/qxl: simplify qxl_fence_wait"")
> https://git.kernel.org/torvalds/c/3628e0383dd349f02f882e612ab6184e4bb3dc10
>
> So this hopefully should be history now.
>
> Ciao, Thorsten
>
Since this affects the 6.8 series (6.8.7 and onwards), I made a CC to [email protected]

/Anders

2024-05-08 12:51:28

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [Regression] 6.9.0: WARNING: workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete [ttm] is flushing !WQ_MEM_RECLAIM events:qxl_gc_work [qxl]

On 08.05.24 14:35, Anders Blomdell wrote:
> On 2024-05-07 07:04, Linux regression tracking (Thorsten Leemhuis) wrote:
>> On 06.05.24 16:30, David Wang wrote:
>>>> On 30.04.24 08:13, David Wang wrote:
>>
>>>> And confirmed that the warning is caused by
>>>> 07ed11afb68d94eadd4ffc082b97c2331307c5ea and reverting it can fix.
>>>
>>> The kernel warning still shows up in 6.9.0-rc7.
>>> (I think 4 high load processes on a 2-Core VM could easily trigger
>>> the kernel warning.)
>>
>> Thx for the report. Linus just reverted the commit 07ed11afb68 you
>> mentioned in your initial mail (I put that quote in again, see above):
>>
>> 3628e0383dd349 ("Reapply "drm/qxl: simplify qxl_fence_wait"")
>> https://git.kernel.org/torvalds/c/3628e0383dd349f02f882e612ab6184e4bb3dc10
>>
>> So this hopefully should be history now.
>>
> Since this affects the 6.8 series (6.8.7 and onwards), I made a CC to
> [email protected]

Ohh, good idea, I thought Linus had added a stable tag, but that is not
the case. Adding Greg as well and making things explicit:

@Greg: you might want to add 3628e0383dd349 ("Reapply "drm/qxl: simplify
qxl_fence_wait"") to all branches that received 07ed11afb68d94 ("Revert
"drm/qxl: simplify qxl_fence_wait"") (which afaics went into v6.8.7,
v6.6.28, v6.1.87, and v5.15.156).

Ciao, Thorsten