I went to run my tests on my VMs and the tests hung on boot up.
Unfortunately, the most I ever got out was:
[ 93.607888] Testing event system initcall: OK
[ 93.667730] Running tests on all trace events:
[ 93.669757] Testing all events: OK
[ 95.631064] ------------[ cut here ]------------
Timed out after 60 seconds
I ran a bisect and it came up with:
# first bad commit: [07ed11afb68d94eadd4ffc082b97c2331307c5ea] Revert "drm/qxl: simplify qxl_fence_wait"
I checked out 07ed11afb68d94eadd4ffc082b97c2331307c5ea~1 and it booted
fine. Added back that commit, it failed to boot. I did this twice, and got
the same results.
But the last time I ran it, it did trigger this:
------------[ cut here ]------------
======================================================
WARNING: possible circular locking dependency detected
6.9.0-rc1-test-00021-g07ed11afb68d #5 Not tainted
------------------------------------------------------
kworker/u24:3/119 is trying to acquire lock:
ffffffff95aa4600 (console_owner){....}-{0:0}, at: console_flush_all+0x1f5/0x530
but task is already holding lock:
ffff93c4bbd37218 (&pool->lock){-.-.}-{2:2}, at: __flush_work+0xc1/0x440
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&pool->lock){-.-.}-{2:2}:
_raw_spin_lock+0x33/0x40
__queue_work+0xd6/0x610
queue_work_on+0x8a/0x90
soft_cursor+0x1a0/0x230
bit_cursor+0x386/0x5f0
hide_cursor+0x27/0xb0
vt_console_print+0x474/0x490
console_flush_all+0x22e/0x530
console_unlock+0x56/0x160
vprintk_emit+0x160/0x390
dev_printk_emit+0xa5/0xd0
_dev_info+0x79/0xa0
__drm_fb_helper_initial_config_and_unlock+0x3a9/0x5f0
drm_fbdev_generic_client_hotplug+0x69/0xc0
drm_client_register+0x7b/0xc0
qxl_pci_probe+0x107/0x1a0
local_pci_probe+0x45/0xa0
pci_device_probe+0xc7/0x240
really_probe+0xd6/0x390
__driver_probe_device+0x78/0x150
driver_probe_device+0x1f/0x90
__driver_attach+0xd6/0x1d0
bus_for_each_dev+0x8f/0xe0
bus_add_driver+0x119/0x220
driver_register+0x59/0x100
do_one_initcall+0x76/0x3c0
kernel_init_freeable+0x3a5/0x5b0
kernel_init+0x1a/0x1c0
ret_from_fork+0x34/0x50
ret_from_fork_asm+0x1a/0x30
-> #0 (console_owner){....}-{0:0}:
__lock_acquire+0x13e7/0x2180
lock_acquire+0xd9/0x300
console_flush_all+0x212/0x530
console_unlock+0x56/0x160
vprintk_emit+0x160/0x390
_printk+0x64/0x80
__warn_printk+0x8e/0x180
check_flush_dependency+0xfd/0x120
__flush_work+0xfa/0x440
qxl_queue_garbage_collect+0x83/0x90
qxl_fence_wait+0xa4/0x1a0
dma_fence_wait_timeout+0x98/0x1e0
dma_resv_wait_timeout+0x7f/0xe0
ttm_bo_delayed_delete+0x2b/0x90
process_one_work+0x228/0x740
worker_thread+0x1dc/0x3c0
kthread+0xf2/0x120
ret_from_fork+0x34/0x50
ret_from_fork_asm+0x1a/0x30
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&pool->lock);
lock(console_owner);
lock(&pool->lock);
lock(console_owner);
*** DEADLOCK ***
6 locks held by kworker/u24:3/119:
#0: ffff93c440245948 ((wq_completion)ttm){+.+.}-{0:0}, at: process_one_work+0x43a/0x740
#1: ffffa01380d83e60 ((work_completion)(&bo->delayed_delete)){+.+.}-{0:0}, at: process_one_work+0x1e2/0x740
#2: ffffffff95b17880 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x86/0x440
#3: ffff93c4bbd37218 (&pool->lock){-.-.}-{2:2}, at: __flush_work+0xc1/0x440
#4: ffffffff95b149c0 (console_lock){+.+.}-{0:0}, at: _printk+0x64/0x80
#5: ffffffff95b14a10 (console_srcu){....}-{0:0}, at: console_flush_all+0x7b/0x530
stack backtrace:
CPU: 2 PID: 119 Comm: kworker/u24:3 Not tainted 6.9.0-rc1-test-00021-g07ed11afb68d #5
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: ttm ttm_bo_delayed_delete
Call Trace:
<TASK>
dump_stack_lvl+0x77/0xb0
check_noncircular+0x148/0x160
__lock_acquire+0x13e7/0x2180
lock_acquire+0xd9/0x300
? console_flush_all+0x1f5/0x530
? lock_release+0x147/0x2c0
? console_flush_all+0x1f5/0x530
console_flush_all+0x212/0x530
? console_flush_all+0x1f5/0x530
console_unlock+0x56/0x160
vprintk_emit+0x160/0x390
_printk+0x64/0x80
? __pfx_ttm_bo_delayed_delete+0x10/0x10
? __pfx_qxl_gc_work+0x10/0x10
__warn_printk+0x8e/0x180
? __pfx_ttm_bo_delayed_delete+0x10/0x10
? __pfx_qxl_gc_work+0x10/0x10
? __pfx_qxl_gc_work+0x10/0x10
check_flush_dependency+0xfd/0x120
__flush_work+0xfa/0x440
qxl_queue_garbage_collect+0x83/0x90
qxl_fence_wait+0xa4/0x1a0
dma_fence_wait_timeout+0x98/0x1e0
dma_resv_wait_timeout+0x7f/0xe0
ttm_bo_delayed_delete+0x2b/0x90
process_one_work+0x228/0x740
worker_thread+0x1dc/0x3c0
? __pfx_worker_thread+0x10/0x10
kthread+0xf2/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete is flushing !WQ_MEM_RECLAIM events:qxl_gc_work
WARNING: CPU: 2 PID: 119 at kernel/workqueue.c:3728 check_flush_dependency+0xfd/0x120
Modules linked in:
CPU: 2 PID: 119 Comm: kworker/u24:3 Not tainted 6.9.0-rc1-test-00021-g07ed11afb68d #5
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Workqueue: ttm ttm_bo_delayed_delete
RIP: 0010:check_flush_dependency+0xfd/0x120
Code: 8b 45 18 48 8d b2 70 01 00 00 49 89 e8 48 8d 8b 70 01 00 00 48 c7 c7 60 46 7b 95 c6 05 48 67 d2 01 01 48 89 c2 e8 63 40 fd ff <0f> 0b e9 1e ff ff ff 80 3d 33 67 d2 01 00 75 93 e9 4a ff ff ff 66
RSP: 0000:ffffa01380d83c28 EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffff93c44004ee00 RCX: 0000000000000000
RDX: 0000000080000003 RSI: 00000000ffffefff RDI: 0000000000000001
RBP: ffffffff9497b100 R08: 0000000000000000 R09: 0000000000000003
R10: ffffa01380d83ab8 R11: ffffffff95b14828 R12: ffff93c443980000
R13: ffff93c440fbe300 R14: 0000000000000001 R15: ffff93c44004ee00
FS: 0000000000000000(0000) GS:ffff93c4bbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000007c864001 CR4: 0000000000170ef0
Call Trace:
<TASK>
? __warn+0x8c/0x180
? check_flush_dependency+0xfd/0x120
? report_bug+0x191/0x1c0
? prb_read_valid+0x1b/0x30
? handle_bug+0x3c/0x80
? exc_invalid_op+0x17/0x70
? asm_exc_invalid_op+0x1a/0x20
? __pfx_qxl_gc_work+0x10/0x10
? check_flush_dependency+0xfd/0x120
? check_flush_dependency+0xfd/0x120
__flush_work+0xfa/0x440
qxl_queue_garbage_collect+0x83/0x90
qxl_fence_wait+0xa4/0x1a0
dma_fence_wait_timeout+0x98/0x1e0
dma_resv_wait_timeout+0x7f/0xe0
ttm_bo_delayed_delete+0x2b/0x90
process_one_work+0x228/0x740
worker_thread+0x1dc/0x3c0
? __pfx_worker_thread+0x10/0x10
kthread+0xf2/0x120
? __pfx_kthread+0x10/0x10
ret_from_fork+0x34/0x50
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
irq event stamp: 58
hardirqs last enabled at (57): [<ffffffff93fede30>] queue_work_on+0x60/0x90
hardirqs last disabled at (58): [<ffffffff94ea7f66>] _raw_spin_lock_irq+0x56/0x60
softirqs last enabled at (0): [<ffffffff93fbae27>] copy_process+0xc07/0x2c60
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace 0000000000000000 ]---
So there's an issue with dma_fence and a workqueue.
-- Steve
On Thu, 2 May 2024 08:16:41 -0400
Steven Rostedt <[email protected]> wrote:
> I went to run my tests on my VMs and the tests hung on boot up.
> Unfortunately, the most I ever got out was:
>
> [ 93.607888] Testing event system initcall: OK
> [ 93.667730] Running tests on all trace events:
> [ 93.669757] Testing all events: OK
> [ 95.631064] ------------[ cut here ]------------
> Timed out after 60 seconds
>
> I ran a bisect and it came up with:
>
> # first bad commit: [07ed11afb68d94eadd4ffc082b97c2331307c5ea] Revert "drm/qxl: simplify qxl_fence_wait"
>
> I checked out 07ed11afb68d94eadd4ffc082b97c2331307c5ea~1 and it booted
> fine. Added back that commit, it failed to boot. I did this twice, and got
> the same results.
>
Note, I reverted the above revert from v6.9-rc6 and that lets my VM boot to
completion again.
Attached is the config I used.
-- Steve
Did anyone see this?
-- Steve
On Thu, 2 May 2024 08:16:41 -0400
Steven Rostedt <[email protected]> wrote:
> I went to run my tests on my VMs and the tests hung on boot up.
> Unfortunately, the most I ever got out was:
>
> [ 93.607888] Testing event system initcall: OK
> [ 93.667730] Running tests on all trace events:
> [ 93.669757] Testing all events: OK
> [ 95.631064] ------------[ cut here ]------------
> Timed out after 60 seconds
>
> I ran a bisect and it came up with:
>
> # first bad commit: [07ed11afb68d94eadd4ffc082b97c2331307c5ea] Revert "drm/qxl: simplify qxl_fence_wait"
>
> I checked out 07ed11afb68d94eadd4ffc082b97c2331307c5ea~1 and it booted
> fine. Added back that commit, it failed to boot. I did this twice, and got
> the same results.
>
> But the last time I ran it, it did trigger this:
>
> ------------[ cut here ]------------
>
> ======================================================
> WARNING: possible circular locking dependency detected
> 6.9.0-rc1-test-00021-g07ed11afb68d #5 Not tainted
> ------------------------------------------------------
> kworker/u24:3/119 is trying to acquire lock:
> ffffffff95aa4600 (console_owner){....}-{0:0}, at: console_flush_all+0x1f5/0x530
>
> but task is already holding lock:
> ffff93c4bbd37218 (&pool->lock){-.-.}-{2:2}, at: __flush_work+0xc1/0x440
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #1 (&pool->lock){-.-.}-{2:2}:
> _raw_spin_lock+0x33/0x40
> __queue_work+0xd6/0x610
> queue_work_on+0x8a/0x90
> soft_cursor+0x1a0/0x230
> bit_cursor+0x386/0x5f0
> hide_cursor+0x27/0xb0
> vt_console_print+0x474/0x490
> console_flush_all+0x22e/0x530
> console_unlock+0x56/0x160
> vprintk_emit+0x160/0x390
> dev_printk_emit+0xa5/0xd0
> _dev_info+0x79/0xa0
> __drm_fb_helper_initial_config_and_unlock+0x3a9/0x5f0
> drm_fbdev_generic_client_hotplug+0x69/0xc0
> drm_client_register+0x7b/0xc0
> qxl_pci_probe+0x107/0x1a0
> local_pci_probe+0x45/0xa0
> pci_device_probe+0xc7/0x240
> really_probe+0xd6/0x390
> __driver_probe_device+0x78/0x150
> driver_probe_device+0x1f/0x90
> __driver_attach+0xd6/0x1d0
> bus_for_each_dev+0x8f/0xe0
> bus_add_driver+0x119/0x220
> driver_register+0x59/0x100
> do_one_initcall+0x76/0x3c0
> kernel_init_freeable+0x3a5/0x5b0
> kernel_init+0x1a/0x1c0
> ret_from_fork+0x34/0x50
> ret_from_fork_asm+0x1a/0x30
>
> -> #0 (console_owner){....}-{0:0}:
> __lock_acquire+0x13e7/0x2180
> lock_acquire+0xd9/0x300
> console_flush_all+0x212/0x530
> console_unlock+0x56/0x160
> vprintk_emit+0x160/0x390
> _printk+0x64/0x80
> __warn_printk+0x8e/0x180
> check_flush_dependency+0xfd/0x120
> __flush_work+0xfa/0x440
> qxl_queue_garbage_collect+0x83/0x90
> qxl_fence_wait+0xa4/0x1a0
> dma_fence_wait_timeout+0x98/0x1e0
> dma_resv_wait_timeout+0x7f/0xe0
> ttm_bo_delayed_delete+0x2b/0x90
> process_one_work+0x228/0x740
> worker_thread+0x1dc/0x3c0
> kthread+0xf2/0x120
> ret_from_fork+0x34/0x50
> ret_from_fork_asm+0x1a/0x30
>
> other info that might help us debug this:
>
> Possible unsafe locking scenario:
>
> CPU0 CPU1
> ---- ----
> lock(&pool->lock);
> lock(console_owner);
> lock(&pool->lock);
> lock(console_owner);
>
> *** DEADLOCK ***
>
> 6 locks held by kworker/u24:3/119:
> #0: ffff93c440245948 ((wq_completion)ttm){+.+.}-{0:0}, at: process_one_work+0x43a/0x740
> #1: ffffa01380d83e60 ((work_completion)(&bo->delayed_delete)){+.+.}-{0:0}, at: process_one_work+0x1e2/0x740
> #2: ffffffff95b17880 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x86/0x440
> #3: ffff93c4bbd37218 (&pool->lock){-.-.}-{2:2}, at: __flush_work+0xc1/0x440
> #4: ffffffff95b149c0 (console_lock){+.+.}-{0:0}, at: _printk+0x64/0x80
> #5: ffffffff95b14a10 (console_srcu){....}-{0:0}, at: console_flush_all+0x7b/0x530
>
> stack backtrace:
> CPU: 2 PID: 119 Comm: kworker/u24:3 Not tainted 6.9.0-rc1-test-00021-g07ed11afb68d #5
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> Workqueue: ttm ttm_bo_delayed_delete
> Call Trace:
> <TASK>
> dump_stack_lvl+0x77/0xb0
> check_noncircular+0x148/0x160
> __lock_acquire+0x13e7/0x2180
> lock_acquire+0xd9/0x300
> ? console_flush_all+0x1f5/0x530
> ? lock_release+0x147/0x2c0
> ? console_flush_all+0x1f5/0x530
> console_flush_all+0x212/0x530
> ? console_flush_all+0x1f5/0x530
> console_unlock+0x56/0x160
> vprintk_emit+0x160/0x390
> _printk+0x64/0x80
> ? __pfx_ttm_bo_delayed_delete+0x10/0x10
> ? __pfx_qxl_gc_work+0x10/0x10
> __warn_printk+0x8e/0x180
> ? __pfx_ttm_bo_delayed_delete+0x10/0x10
> ? __pfx_qxl_gc_work+0x10/0x10
> ? __pfx_qxl_gc_work+0x10/0x10
> check_flush_dependency+0xfd/0x120
> __flush_work+0xfa/0x440
> qxl_queue_garbage_collect+0x83/0x90
> qxl_fence_wait+0xa4/0x1a0
> dma_fence_wait_timeout+0x98/0x1e0
> dma_resv_wait_timeout+0x7f/0xe0
> ttm_bo_delayed_delete+0x2b/0x90
> process_one_work+0x228/0x740
> worker_thread+0x1dc/0x3c0
> ? __pfx_worker_thread+0x10/0x10
> kthread+0xf2/0x120
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x34/0x50
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete is flushing !WQ_MEM_RECLAIM events:qxl_gc_work
> WARNING: CPU: 2 PID: 119 at kernel/workqueue.c:3728 check_flush_dependency+0xfd/0x120
> Modules linked in:
> CPU: 2 PID: 119 Comm: kworker/u24:3 Not tainted 6.9.0-rc1-test-00021-g07ed11afb68d #5
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> Workqueue: ttm ttm_bo_delayed_delete
> RIP: 0010:check_flush_dependency+0xfd/0x120
> Code: 8b 45 18 48 8d b2 70 01 00 00 49 89 e8 48 8d 8b 70 01 00 00 48 c7 c7 60 46 7b 95 c6 05 48 67 d2 01 01 48 89 c2 e8 63 40 fd ff <0f> 0b e9 1e ff ff ff 80 3d 33 67 d2 01 00 75 93 e9 4a ff ff ff 66
> RSP: 0000:ffffa01380d83c28 EFLAGS: 00010086
> RAX: 0000000000000000 RBX: ffff93c44004ee00 RCX: 0000000000000000
> RDX: 0000000080000003 RSI: 00000000ffffefff RDI: 0000000000000001
> RBP: ffffffff9497b100 R08: 0000000000000000 R09: 0000000000000003
> R10: ffffa01380d83ab8 R11: ffffffff95b14828 R12: ffff93c443980000
> R13: ffff93c440fbe300 R14: 0000000000000001 R15: ffff93c44004ee00
> FS: 0000000000000000(0000) GS:ffff93c4bbd00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 000000007c864001 CR4: 0000000000170ef0
> Call Trace:
> <TASK>
> ? __warn+0x8c/0x180
> ? check_flush_dependency+0xfd/0x120
> ? report_bug+0x191/0x1c0
> ? prb_read_valid+0x1b/0x30
> ? handle_bug+0x3c/0x80
> ? exc_invalid_op+0x17/0x70
> ? asm_exc_invalid_op+0x1a/0x20
> ? __pfx_qxl_gc_work+0x10/0x10
> ? check_flush_dependency+0xfd/0x120
> ? check_flush_dependency+0xfd/0x120
> __flush_work+0xfa/0x440
> qxl_queue_garbage_collect+0x83/0x90
> qxl_fence_wait+0xa4/0x1a0
> dma_fence_wait_timeout+0x98/0x1e0
> dma_resv_wait_timeout+0x7f/0xe0
> ttm_bo_delayed_delete+0x2b/0x90
> process_one_work+0x228/0x740
> worker_thread+0x1dc/0x3c0
> ? __pfx_worker_thread+0x10/0x10
> kthread+0xf2/0x120
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x34/0x50
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> irq event stamp: 58
> hardirqs last enabled at (57): [<ffffffff93fede30>] queue_work_on+0x60/0x90
> hardirqs last disabled at (58): [<ffffffff94ea7f66>] _raw_spin_lock_irq+0x56/0x60
> softirqs last enabled at (0): [<ffffffff93fbae27>] copy_process+0xc07/0x2c60
> softirqs last disabled at (0): [<0000000000000000>] 0x0
> ---[ end trace 0000000000000000 ]---
>
> So there's an issue with dma_fence and a workqueue.
>
> -- Steve
>
Hi Steven,
It looks like qxl is not well maintained. Please send the revert, we'll
merge it.
Maxime
On Sat, May 04, 2024 at 04:39:57AM GMT, Steven Rostedt wrote:
>
> Did anyone see this?
>
> -- Steve
>
>
> On Thu, 2 May 2024 08:16:41 -0400
> Steven Rostedt <[email protected]> wrote:
>
> > I went to run my tests on my VMs and the tests hung on boot up.
> > Unfortunately, the most I ever got out was:
> >
> > [ 93.607888] Testing event system initcall: OK
> > [ 93.667730] Running tests on all trace events:
> > [ 93.669757] Testing all events: OK
> > [ 95.631064] ------------[ cut here ]------------
> > Timed out after 60 seconds
> >
> > I ran a bisect and it came up with:
> >
> > # first bad commit: [07ed11afb68d94eadd4ffc082b97c2331307c5ea] Revert "drm/qxl: simplify qxl_fence_wait"
> >
> > I checked out 07ed11afb68d94eadd4ffc082b97c2331307c5ea~1 and it booted
> > fine. Added back that commit, it failed to boot. I did this twice, and got
> > the same results.
> >
> > But the last time I ran it, it did trigger this:
> >
> > ------------[ cut here ]------------
> >
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 6.9.0-rc1-test-00021-g07ed11afb68d #5 Not tainted
> > ------------------------------------------------------
> > kworker/u24:3/119 is trying to acquire lock:
> > ffffffff95aa4600 (console_owner){....}-{0:0}, at: console_flush_all+0x1f5/0x530
> >
> > but task is already holding lock:
> > ffff93c4bbd37218 (&pool->lock){-.-.}-{2:2}, at: __flush_work+0xc1/0x440
> >
> > which lock already depends on the new lock.
> >
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #1 (&pool->lock){-.-.}-{2:2}:
> > _raw_spin_lock+0x33/0x40
> > __queue_work+0xd6/0x610
> > queue_work_on+0x8a/0x90
> > soft_cursor+0x1a0/0x230
> > bit_cursor+0x386/0x5f0
> > hide_cursor+0x27/0xb0
> > vt_console_print+0x474/0x490
> > console_flush_all+0x22e/0x530
> > console_unlock+0x56/0x160
> > vprintk_emit+0x160/0x390
> > dev_printk_emit+0xa5/0xd0
> > _dev_info+0x79/0xa0
> > __drm_fb_helper_initial_config_and_unlock+0x3a9/0x5f0
> > drm_fbdev_generic_client_hotplug+0x69/0xc0
> > drm_client_register+0x7b/0xc0
> > qxl_pci_probe+0x107/0x1a0
> > local_pci_probe+0x45/0xa0
> > pci_device_probe+0xc7/0x240
> > really_probe+0xd6/0x390
> > __driver_probe_device+0x78/0x150
> > driver_probe_device+0x1f/0x90
> > __driver_attach+0xd6/0x1d0
> > bus_for_each_dev+0x8f/0xe0
> > bus_add_driver+0x119/0x220
> > driver_register+0x59/0x100
> > do_one_initcall+0x76/0x3c0
> > kernel_init_freeable+0x3a5/0x5b0
> > kernel_init+0x1a/0x1c0
> > ret_from_fork+0x34/0x50
> > ret_from_fork_asm+0x1a/0x30
> >
> > -> #0 (console_owner){....}-{0:0}:
> > __lock_acquire+0x13e7/0x2180
> > lock_acquire+0xd9/0x300
> > console_flush_all+0x212/0x530
> > console_unlock+0x56/0x160
> > vprintk_emit+0x160/0x390
> > _printk+0x64/0x80
> > __warn_printk+0x8e/0x180
> > check_flush_dependency+0xfd/0x120
> > __flush_work+0xfa/0x440
> > qxl_queue_garbage_collect+0x83/0x90
> > qxl_fence_wait+0xa4/0x1a0
> > dma_fence_wait_timeout+0x98/0x1e0
> > dma_resv_wait_timeout+0x7f/0xe0
> > ttm_bo_delayed_delete+0x2b/0x90
> > process_one_work+0x228/0x740
> > worker_thread+0x1dc/0x3c0
> > kthread+0xf2/0x120
> > ret_from_fork+0x34/0x50
> > ret_from_fork_asm+0x1a/0x30
> >
> > other info that might help us debug this:
> >
> > Possible unsafe locking scenario:
> >
> > CPU0 CPU1
> > ---- ----
> > lock(&pool->lock);
> > lock(console_owner);
> > lock(&pool->lock);
> > lock(console_owner);
> >
> > *** DEADLOCK ***
> >
> > 6 locks held by kworker/u24:3/119:
> > #0: ffff93c440245948 ((wq_completion)ttm){+.+.}-{0:0}, at: process_one_work+0x43a/0x740
> > #1: ffffa01380d83e60 ((work_completion)(&bo->delayed_delete)){+.+.}-{0:0}, at: process_one_work+0x1e2/0x740
> > #2: ffffffff95b17880 (rcu_read_lock){....}-{1:2}, at: __flush_work+0x86/0x440
> > #3: ffff93c4bbd37218 (&pool->lock){-.-.}-{2:2}, at: __flush_work+0xc1/0x440
> > #4: ffffffff95b149c0 (console_lock){+.+.}-{0:0}, at: _printk+0x64/0x80
> > #5: ffffffff95b14a10 (console_srcu){....}-{0:0}, at: console_flush_all+0x7b/0x530
> >
> > stack backtrace:
> > CPU: 2 PID: 119 Comm: kworker/u24:3 Not tainted 6.9.0-rc1-test-00021-g07ed11afb68d #5
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > Workqueue: ttm ttm_bo_delayed_delete
> > Call Trace:
> > <TASK>
> > dump_stack_lvl+0x77/0xb0
> > check_noncircular+0x148/0x160
> > __lock_acquire+0x13e7/0x2180
> > lock_acquire+0xd9/0x300
> > ? console_flush_all+0x1f5/0x530
> > ? lock_release+0x147/0x2c0
> > ? console_flush_all+0x1f5/0x530
> > console_flush_all+0x212/0x530
> > ? console_flush_all+0x1f5/0x530
> > console_unlock+0x56/0x160
> > vprintk_emit+0x160/0x390
> > _printk+0x64/0x80
> > ? __pfx_ttm_bo_delayed_delete+0x10/0x10
> > ? __pfx_qxl_gc_work+0x10/0x10
> > __warn_printk+0x8e/0x180
> > ? __pfx_ttm_bo_delayed_delete+0x10/0x10
> > ? __pfx_qxl_gc_work+0x10/0x10
> > ? __pfx_qxl_gc_work+0x10/0x10
> > check_flush_dependency+0xfd/0x120
> > __flush_work+0xfa/0x440
> > qxl_queue_garbage_collect+0x83/0x90
> > qxl_fence_wait+0xa4/0x1a0
> > dma_fence_wait_timeout+0x98/0x1e0
> > dma_resv_wait_timeout+0x7f/0xe0
> > ttm_bo_delayed_delete+0x2b/0x90
> > process_one_work+0x228/0x740
> > worker_thread+0x1dc/0x3c0
> > ? __pfx_worker_thread+0x10/0x10
> > kthread+0xf2/0x120
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork+0x34/0x50
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork_asm+0x1a/0x30
> > </TASK>
> > workqueue: WQ_MEM_RECLAIM ttm:ttm_bo_delayed_delete is flushing !WQ_MEM_RECLAIM events:qxl_gc_work
> > WARNING: CPU: 2 PID: 119 at kernel/workqueue.c:3728 check_flush_dependency+0xfd/0x120
> > Modules linked in:
> > CPU: 2 PID: 119 Comm: kworker/u24:3 Not tainted 6.9.0-rc1-test-00021-g07ed11afb68d #5
> > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > Workqueue: ttm ttm_bo_delayed_delete
> > RIP: 0010:check_flush_dependency+0xfd/0x120
> > Code: 8b 45 18 48 8d b2 70 01 00 00 49 89 e8 48 8d 8b 70 01 00 00 48 c7 c7 60 46 7b 95 c6 05 48 67 d2 01 01 48 89 c2 e8 63 40 fd ff <0f> 0b e9 1e ff ff ff 80 3d 33 67 d2 01 00 75 93 e9 4a ff ff ff 66
> > RSP: 0000:ffffa01380d83c28 EFLAGS: 00010086
> > RAX: 0000000000000000 RBX: ffff93c44004ee00 RCX: 0000000000000000
> > RDX: 0000000080000003 RSI: 00000000ffffefff RDI: 0000000000000001
> > RBP: ffffffff9497b100 R08: 0000000000000000 R09: 0000000000000003
> > R10: ffffa01380d83ab8 R11: ffffffff95b14828 R12: ffff93c443980000
> > R13: ffff93c440fbe300 R14: 0000000000000001 R15: ffff93c44004ee00
> > FS: 0000000000000000(0000) GS:ffff93c4bbd00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 0000000000000000 CR3: 000000007c864001 CR4: 0000000000170ef0
> > Call Trace:
> > <TASK>
> > ? __warn+0x8c/0x180
> > ? check_flush_dependency+0xfd/0x120
> > ? report_bug+0x191/0x1c0
> > ? prb_read_valid+0x1b/0x30
> > ? handle_bug+0x3c/0x80
> > ? exc_invalid_op+0x17/0x70
> > ? asm_exc_invalid_op+0x1a/0x20
> > ? __pfx_qxl_gc_work+0x10/0x10
> > ? check_flush_dependency+0xfd/0x120
> > ? check_flush_dependency+0xfd/0x120
> > __flush_work+0xfa/0x440
> > qxl_queue_garbage_collect+0x83/0x90
> > qxl_fence_wait+0xa4/0x1a0
> > dma_fence_wait_timeout+0x98/0x1e0
> > dma_resv_wait_timeout+0x7f/0xe0
> > ttm_bo_delayed_delete+0x2b/0x90
> > process_one_work+0x228/0x740
> > worker_thread+0x1dc/0x3c0
> > ? __pfx_worker_thread+0x10/0x10
> > kthread+0xf2/0x120
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork+0x34/0x50
> > ? __pfx_kthread+0x10/0x10
> > ret_from_fork_asm+0x1a/0x30
> > </TASK>
> > irq event stamp: 58
> > hardirqs last enabled at (57): [<ffffffff93fede30>] queue_work_on+0x60/0x90
> > hardirqs last disabled at (58): [<ffffffff94ea7f66>] _raw_spin_lock_irq+0x56/0x60
> > softirqs last enabled at (0): [<ffffffff93fbae27>] copy_process+0xc07/0x2c60
> > softirqs last disabled at (0): [<0000000000000000>] 0x0
> > ---[ end trace 0000000000000000 ]---
> >
> > So there's an issue with dma_fence and a workqueue.
> >
> > -- Steve
> >
>
On Mon, 6 May 2024 at 05:46, Maxime Ripard <[email protected]> wrote:
>
> It looks like qxl is not well maintained. Please send the revert, we'll
> merge it.
I'll just do the revert and we don't have to do the round-trip
overhead (since we're already in rc7 and I hope to just do final next
weekend).
Linus
On Tue, May 7, 2024 at 6:29 AM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, 6 May 2024 at 05:46, Maxime Ripard <[email protected]> wrote:
> >
> > It looks like qxl is not well maintained. Please send the revert, we'll
> > merge it.
>
> I'll just do the revert and we don't have to do the round-trip
> overhead (since we're already in rc7 and I hope to just do final next
> weekend).
I expec this will reintroduce the other problems that caused this
change in the first place, but I think this should at least bring us
back to regression equilibrium. I can't recommend anyone use qxl hw
over virtio-gpu hw in their VMs, since virtio-gpu is actually hw
designed for virt.
Dave.
On Tue, 7 May 2024, David Airlie wrote:
> I expec this will reintroduce the other problems that caused this
> change in the first place, but I think this should at least bring us
> back to regression equilibrium. I can't recommend anyone use qxl hw
> over virtio-gpu hw in their VMs, since virtio-gpu is actually hw
> designed for virt.
I would love to switch to virtio. It works ok for local virtual machines
but I have many users who are using Linux desktops hosted on a powerful
server where it is more difficult. With spice and qxl scrolling in a web
browser is smooth, with spice and virtio it seems like larger parts are
getting fully redrawn (and resent over network?). Now that new gnome is
going to come with RDP support I'm also considering switching to that. Any
tips would be appreciated.
Here's a quick (1min42s) demo video that I recorded. spice+qxl on the
left, spice+virtio on the right. The VMs were created from the exact same
disk image template.
https://lindi.iki.fi/lindi/qxl/spice_qxl_vs_virtio.webm
-Timo
On Tue, 7 May 2024 15:54:21 +1000
David Airlie <[email protected]> wrote:
> I expec this will reintroduce the other problems that caused this
> change in the first place, but I think this should at least bring us
> back to regression equilibrium. I can't recommend anyone use qxl hw
> over virtio-gpu hw in their VMs, since virtio-gpu is actually hw
> designed for virt.
I agree that it will likely cause the issues that this was addressing
to come back, but you can't have a fix that introduces a deadlock.
Perhaps the deadlock didn't exist before, and the printk updates
changed the way it works to introduce it now?
-- Steve
On Tue, May 07, 2024 at 09:38:56AM GMT, Timo Lindfors wrote:
> On Tue, 7 May 2024, David Airlie wrote:
> > I expec this will reintroduce the other problems that caused this
> > change in the first place, but I think this should at least bring us
> > back to regression equilibrium. I can't recommend anyone use qxl hw
> > over virtio-gpu hw in their VMs, since virtio-gpu is actually hw
> > designed for virt.
Well, qxl was designed for virt too, but almost two decades ago with 2d
acceleration. And alot of complexity, trying to move 2d-accel rendering
from the VM guest to the spice client. That complexity still bites us
today as this issue shows.
With desktop rendering moving to 3d acceleration all that 2d accel stuff
became mostly useless though, so you have the downsides of the
complexity without any upsides.
There have been plans to add 3d-acceleration support to qxl, but they
never took off and with virtio-gpu having taken that role meanwhile the
idea is dead now.
I second the recommendation to avoid qxl.
> I would love to switch to virtio. It works ok for local virtual machines but
> I have many users who are using Linux desktops hosted on a powerful server
> where it is more difficult. With spice and qxl scrolling in a web browser is
> smooth, with spice and virtio it seems like larger parts are getting fully
> redrawn (and resent over network?).
Hmm, I'd expect behavior being quite simliar for stdvga, qxl and
virtio-gpu when running a wayland desktop remotely.
When running on X11 without 3d acceleration it could be that scroll
ops are actually sent as 2d accel blits on qxl, explaining the
difference you are seeing.
> Now that new gnome is going to come with RDP support I'm also
> considering switching to that. Any tips would be appreciated.
Worth trying: use vnc. The qemu vnc server will skip updates when the
network pipeline is full. That throttles the frame rate, but also
reduces the display update latency.
In general the trend seems to be to move to in-guest remote desktop
solutions. Do NOT make VMs a special case with a special solution,
instead handle VMs like bare metal. Commercial solutions exist for a
while, some of them are doing hardware-assisted video encoding to send
the display as video stream to the remote end. gnome+rdp is a new
player here, clearly worth a try.
take care,
Gerd
On Tue, 7 May 2024, Gerd Hoffmann wrote:
> Worth trying: use vnc. The qemu vnc server will skip updates when the
> network pipeline is full. That throttles the frame rate, but also
> reduces the display update latency.
With Debian 12 it seems that the virtio + vnc + firefox combination
results in a black firefox window. Adding MOZ_ENABLE_WAYLAND=1 makes
the browser visible but I still see the constant redraw behavior when scrolling.
The qxl + spice + firefox combination scrolls smoothly even if I force
firefox to use wayland. I suppose I could write a more synthetic test if
it would be useful? With systemtap I can trace what happens during a single
scroll event.
> In general the trend seems to be to move to in-guest remote desktop
> solutions. Do NOT make VMs a special case with a special solution,
> instead handle VMs like bare metal. Commercial solutions exist for a
> while, some of them are doing hardware-assisted video encoding to send
> the display as video stream to the remote end. gnome+rdp is a new
> player here, clearly worth a try.
Yes this makes sense. I'm definitely not trying to stick to a particular
technology but instead try to find the best FOSS solution for the use case
of hosting linux desktops on a server.
-Timo
On Tue, May 07, 2024 at 06:46:41PM GMT, Timo Lindfors wrote:
> The qxl + spice + firefox combination scrolls smoothly even if I force
> firefox to use wayland.
scrolling sending 2d accel bitblits works only when running the xorg
xserver with the qxl driver. xwayland wouldn't do that. So that
doesn't explain the performance difference.
> I suppose I could write a more synthetic test if it would be useful?
Well, don't work on qemu graphics any more. So I can't sink much time
into this, and my memories on details are starting to fade ...
> With systemtap I can trace what happens during a single scroll event.
Had a brief look at the driver source code. One possibly important
difference is damage tracking.
qxl seems to send updates for the list of damaged rectangle to the host
(see qxl_draw_dirty_fb).
virtio-gpu goes call drm_atomic_helper_damage_merged() to get a single
rectangle covering all regions which need updating and works with that.
Which could very well explain the performance difference. If you want
dig deeper I'd suggest to start there.
HTH & take care,
Gerd
On 2024-05-07 11:03, Steven Rostedt wrote:
> On Tue, 7 May 2024 15:54:21 +1000
> David Airlie <[email protected]> wrote:
>
>> I expec this will reintroduce the other problems that caused this
>> change in the first place, but I think this should at least bring us
>> back to regression equilibrium. I can't recommend anyone use qxl hw
>> over virtio-gpu hw in their VMs, since virtio-gpu is actually hw
>> designed for virt.
>
> I agree that it will likely cause the issues that this was addressing
> to come back, but you can't have a fix that introduces a deadlock.
>
> Perhaps the deadlock didn't exist before, and the printk updates
> changed the way it works to introduce it now?
It deadlocks on 6.8.8 under qemu/kvm, reverting the patch fixes the problem there as well (6.8.7 hangs as well, but I have not tested reverting on that version)
/Anders