LinuxLists.cc - Yet another RX Vega hang with another kernel panic signature. WARNING: inconsistent lock state

2019-01-31 03:37:33

Subject: Yet another RX Vega hang with another kernel panic signature. WARNING: inconsistent lock state

Hi folks.
Yet another kernel panic happens while GPU again is hang:

[ 1469.906798] ================================
[ 1469.906799] WARNING: inconsistent lock state
[ 1469.906801] 5.0.0-0.rc4.git2.2.fc30.x86_64 #1 Tainted: G C
[ 1469.906802] --------------------------------
[ 1469.906804] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[ 1469.906806] kworker/12:3/681 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1469.906807] 00000000d591b82b
(&(&adev->vm_manager.pasid_lock)->rlock){?...}, at:
amdgpu_vm_get_task_info+0x23/0x80 [amdgpu]
[ 1469.906851] {IN-HARDIRQ-W} state was registered at:
[ 1469.906855] _raw_spin_lock+0x31/0x80
[ 1469.906893] amdgpu_vm_get_task_info+0x23/0x80 [amdgpu]
[ 1469.906936] gmc_v9_0_process_interrupt+0x198/0x2b0 [amdgpu]
[ 1469.906978] amdgpu_irq_dispatch+0x90/0x1f0 [amdgpu]
[ 1469.907018] amdgpu_irq_callback+0x4a/0x70 [amdgpu]
[ 1469.907061] amdgpu_ih_process+0x89/0x100 [amdgpu]
[ 1469.907103] amdgpu_irq_handler+0x22/0x50 [amdgpu]
[ 1469.907106] __handle_irq_event_percpu+0x3f/0x290
[ 1469.907108] handle_irq_event_percpu+0x31/0x80
[ 1469.907109] handle_irq_event+0x34/0x51
[ 1469.907111] handle_edge_irq+0x7c/0x1a0
[ 1469.907114] handle_irq+0xbf/0x100
[ 1469.907116] do_IRQ+0x61/0x120
[ 1469.907118] ret_from_intr+0x0/0x22
[ 1469.907121] cpuidle_enter_state+0xbf/0x470
[ 1469.907123] do_idle+0x1ec/0x280
[ 1469.907125] cpu_startup_entry+0x19/0x20
[ 1469.907127] start_secondary+0x1b3/0x200
[ 1469.907129] secondary_startup_64+0xa4/0xb0
[ 1469.907131] irq event stamp: 5546749
[ 1469.907133] hardirqs last enabled at (5546749):
[<ffffffff9719112a>] ktime_get+0xfa/0x130
[ 1469.907135] hardirqs last disabled at (5546748):
[<ffffffff9719105b>] ktime_get+0x2b/0x130
[ 1469.907137] softirqs last enabled at (5498318):
[<ffffffff97e0035f>] __do_softirq+0x35f/0x46a
[ 1469.907140] softirqs last disabled at (5497393):
[<ffffffff970ee119>] irq_exit+0x119/0x120
[ 1469.907141]
other info that might help us debug this:
[ 1469.907142] Possible unsafe locking scenario:

[ 1469.907143] CPU0
[ 1469.907144] ----
[ 1469.907144] lock(&(&adev->vm_manager.pasid_lock)->rlock);
[ 1469.907146] <Interrupt>
[ 1469.907147] lock(&(&adev->vm_manager.pasid_lock)->rlock);
[ 1469.907148]
*** DEADLOCK ***

[ 1469.907150] 2 locks held by kworker/12:3/681:
[ 1469.907152] #0: 00000000953235a7 ((wq_completion)"events"){+.+.},
at: process_one_work+0x1e9/0x5d0
[ 1469.907157] #1: 0000000071a3d218
((work_completion)(&(&sched->work_tdr)->work)){+.+.}, at:
process_one_work+0x1e9/0x5d0
[ 1469.907160]
stack backtrace:
[ 1469.907163] CPU: 12 PID: 681 Comm: kworker/12:3 Tainted: G
C 5.0.0-0.rc4.git2.2.fc30.x86_64 #1
[ 1469.907165] Hardware name: System manufacturer System Product
Name/ROG STRIX X470-I GAMING, BIOS 1103 11/16/2018
[ 1469.907169] Workqueue: events drm_sched_job_timedout [gpu_sched]
[ 1469.907171] Call Trace:
[ 1469.907176] dump_stack+0x85/0xc0
[ 1469.907180] print_usage_bug.cold+0x1ae/0x1e8
[ 1469.907183] ? print_shortest_lock_dependencies+0x40/0x40
[ 1469.907185] mark_lock+0x50a/0x600
[ 1469.907186] ? print_shortest_lock_dependencies+0x40/0x40
[ 1469.907189] __lock_acquire+0x544/0x1660
[ 1469.907191] ? mark_held_locks+0x57/0x80
[ 1469.907193] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1469.907195] ? lockdep_hardirqs_on+0xed/0x180
[ 1469.907197] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1469.907200] ? retint_kernel+0x10/0x10
[ 1469.907202] lock_acquire+0xa2/0x1b0
[ 1469.907242] ? amdgpu_vm_get_task_info+0x23/0x80 [amdgpu]
[ 1469.907245] _raw_spin_lock+0x31/0x80
[ 1469.907283] ? amdgpu_vm_get_task_info+0x23/0x80 [amdgpu]
[ 1469.907323] amdgpu_vm_get_task_info+0x23/0x80 [amdgpu]
[ 1469.907324] ------------[ cut here ]------------

My kernel commit is: 62967898789d

--
Best Regards,
Mike Gavrilov.

Attachments:

dmesg.txt (159.52 kB)

2019-01-31 14:52:41

by Yang, Philip

[permalink] [raw]

Subject: Re: Yet another RX Vega hang with another kernel panic signature. WARNING: inconsistent lock state

I found same issue while debugging, I will submit patch to fix this shortly.

Philip

On 2019-01-30 10:35 p.m., Mikhail Gavrilov wrote:
> Hi folks.
> Yet another kernel panic happens while GPU again is hang:
>
> [ 1469.906798] ================================
> [ 1469.906799] WARNING: inconsistent lock state
> [ 1469.906801] 5.0.0-0.rc4.git2.2.fc30.x86_64 #1 Tainted: G C
> [ 1469.906802] --------------------------------
> [ 1469.906804] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
> [ 1469.906806] kworker/12:3/681 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 1469.906807] 00000000d591b82b
> (&(&adev->vm_manager.pasid_lock)->rlock){?...}, at:
> amdgpu_vm_get_task_info+0x23/0x80 [amdgpu]
> [ 1469.906851] {IN-HARDIRQ-W} state was registered at:
> [ 1469.906855] _raw_spin_lock+0x31/0x80
> [ 1469.906893] amdgpu_vm_get_task_info+0x23/0x80 [amdgpu]
> [ 1469.906936] gmc_v9_0_process_interrupt+0x198/0x2b0 [amdgpu]
> [ 1469.906978] amdgpu_irq_dispatch+0x90/0x1f0 [amdgpu]
> [ 1469.907018] amdgpu_irq_callback+0x4a/0x70 [amdgpu]
> [ 1469.907061] amdgpu_ih_process+0x89/0x100 [amdgpu]
> [ 1469.907103] amdgpu_irq_handler+0x22/0x50 [amdgpu]
> [ 1469.907106] __handle_irq_event_percpu+0x3f/0x290
> [ 1469.907108] handle_irq_event_percpu+0x31/0x80
> [ 1469.907109] handle_irq_event+0x34/0x51
> [ 1469.907111] handle_edge_irq+0x7c/0x1a0
> [ 1469.907114] handle_irq+0xbf/0x100
> [ 1469.907116] do_IRQ+0x61/0x120
> [ 1469.907118] ret_from_intr+0x0/0x22
> [ 1469.907121] cpuidle_enter_state+0xbf/0x470
> [ 1469.907123] do_idle+0x1ec/0x280
> [ 1469.907125] cpu_startup_entry+0x19/0x20
> [ 1469.907127] start_secondary+0x1b3/0x200
> [ 1469.907129] secondary_startup_64+0xa4/0xb0
> [ 1469.907131] irq event stamp: 5546749
> [ 1469.907133] hardirqs last enabled at (5546749):
> [<ffffffff9719112a>] ktime_get+0xfa/0x130
> [ 1469.907135] hardirqs last disabled at (5546748):
> [<ffffffff9719105b>] ktime_get+0x2b/0x130
> [ 1469.907137] softirqs last enabled at (5498318):
> [<ffffffff97e0035f>] __do_softirq+0x35f/0x46a
> [ 1469.907140] softirqs last disabled at (5497393):
> [<ffffffff970ee119>] irq_exit+0x119/0x120
> [ 1469.907141]
> other info that might help us debug this:
> [ 1469.907142] Possible unsafe locking scenario:
>
> [ 1469.907143] CPU0
> [ 1469.907144] ----
> [ 1469.907144] lock(&(&adev->vm_manager.pasid_lock)->rlock);
> [ 1469.907146] <Interrupt>
> [ 1469.907147] lock(&(&adev->vm_manager.pasid_lock)->rlock);
> [ 1469.907148]
> *** DEADLOCK ***
>
> [ 1469.907150] 2 locks held by kworker/12:3/681:
> [ 1469.907152] #0: 00000000953235a7 ((wq_completion)"events"){+.+.},
> at: process_one_work+0x1e9/0x5d0
> [ 1469.907157] #1: 0000000071a3d218
> ((work_completion)(&(&sched->work_tdr)->work)){+.+.}, at:
> process_one_work+0x1e9/0x5d0
> [ 1469.907160]
> stack backtrace:
> [ 1469.907163] CPU: 12 PID: 681 Comm: kworker/12:3 Tainted: G
> C 5.0.0-0.rc4.git2.2.fc30.x86_64 #1
> [ 1469.907165] Hardware name: System manufacturer System Product
> Name/ROG STRIX X470-I GAMING, BIOS 1103 11/16/2018
> [ 1469.907169] Workqueue: events drm_sched_job_timedout [gpu_sched]
> [ 1469.907171] Call Trace:
> [ 1469.907176] dump_stack+0x85/0xc0
> [ 1469.907180] print_usage_bug.cold+0x1ae/0x1e8
> [ 1469.907183] ? print_shortest_lock_dependencies+0x40/0x40
> [ 1469.907185] mark_lock+0x50a/0x600
> [ 1469.907186] ? print_shortest_lock_dependencies+0x40/0x40
> [ 1469.907189] __lock_acquire+0x544/0x1660
> [ 1469.907191] ? mark_held_locks+0x57/0x80
> [ 1469.907193] ? trace_hardirqs_on_thunk+0x1a/0x1c
> [ 1469.907195] ? lockdep_hardirqs_on+0xed/0x180
> [ 1469.907197] ? trace_hardirqs_on_thunk+0x1a/0x1c
> [ 1469.907200] ? retint_kernel+0x10/0x10
> [ 1469.907202] lock_acquire+0xa2/0x1b0
> [ 1469.907242] ? amdgpu_vm_get_task_info+0x23/0x80 [amdgpu]
> [ 1469.907245] _raw_spin_lock+0x31/0x80
> [ 1469.907283] ? amdgpu_vm_get_task_info+0x23/0x80 [amdgpu]
> [ 1469.907323] amdgpu_vm_get_task_info+0x23/0x80 [amdgpu]
> [ 1469.907324] ------------[ cut here ]------------
>
>
> My kernel commit is: 62967898789d
>
>
>
> --
> Best Regards,
> Mike Gavrilov.
>
>
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

2019-02-12 12:06:05

by Mikhail Gavrilov

[permalink] [raw]

Subject: Re: Yet another RX Vega hang with another kernel panic signature. WARNING: inconsistent lock state

On Thu, 31 Jan 2019 at 19:22, Yang, Philip <[email protected]> wrote:
>
> I found same issue while debugging, I will submit patch to fix this shortly.
>
> Philip
>

Philip, I tested 5.0.0 rc6 kernel and see that inconsistent lock state
not appeared again.
But GPU reset still not working. And finally I see that some threads
stuck again.

Best Regards,
Mike Gavrilov.

Attachments:

dmesg.txt (161.52 kB)