2022-07-19 00:10:07

by Mikhail Gavrilov

[permalink] [raw]
Subject: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu

Hi guys I continue testing 5.19 rc7 and found the bug.
Command "clinfo" causes BUG: kernel NULL pointer dereference, address:
0000000000000008 on driver amdgpu.

Here is trace:
[ 1320.203332] BUG: kernel NULL pointer dereference, address: 0000000000000008
[ 1320.203338] #PF: supervisor read access in kernel mode
[ 1320.203340] #PF: error_code(0x0000) - not-present page
[ 1320.203341] PGD 0 P4D 0
[ 1320.203344] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 1320.203346] CPU: 5 PID: 1226 Comm: kworker/5:2 Tainted: G W L
-------- --- 5.19.0-0.rc7.53.fc37.x86_64+debug #1
[ 1320.203348] Hardware name: System manufacturer System Product
Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022
[ 1320.203350] Workqueue: events delayed_fput
[ 1320.203354] RIP: 0010:dma_resv_add_fence+0x5a/0x2d0
[ 1320.203358] Code: 85 c0 0f 84 43 02 00 00 8d 50 01 09 c2 0f 88 47
02 00 00 8b 15 73 10 99 01 49 8d 45 70 48 89 44 24 10 85 d2 0f 85 05
02 00 00 <49> 8b 44 24 08 48 3d 80 93 53 97 0f 84 06 01 00 00 48 3d 20
93 53
[ 1320.203360] RSP: 0018:ffffaf4cc1adfc68 EFLAGS: 00010246
[ 1320.203362] RAX: ffff976660408208 RBX: ffff975f545f2000 RCX: 0000000000000000
[ 1320.203363] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff976660408198
[ 1320.203364] RBP: ffff976806f6e800 R08: 0000000000000000 R09: 0000000000000000
[ 1320.203366] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[ 1320.203367] R13: ffff976660408198 R14: ffff975f545f2000 R15: ffff976660408198
[ 1320.203368] FS: 0000000000000000(0000) GS:ffff976de1200000(0000)
knlGS:0000000000000000
[ 1320.203370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1320.203371] CR2: 0000000000000008 CR3: 00000007fb31c000 CR4: 0000000000350ee0
[ 1320.203372] Call Trace:
[ 1320.203374] <TASK>
[ 1320.203378] amdgpu_amdkfd_gpuvm_destroy_cb+0x5d/0x1e0 [amdgpu]
[ 1320.203516] amdgpu_vm_fini+0x2f/0x4e0 [amdgpu]
[ 1320.203625] ? mutex_destroy+0x21/0x50
[ 1320.203629] amdgpu_driver_postclose_kms+0x1da/0x2b0 [amdgpu]
[ 1320.203734] drm_file_free.part.0+0x20d/0x260
[ 1320.203738] drm_release+0x6a/0x120
[ 1320.203741] __fput+0xab/0x270
[ 1320.203743] delayed_fput+0x1f/0x30
[ 1320.203745] process_one_work+0x2a0/0x600
[ 1320.203749] worker_thread+0x4f/0x3a0
[ 1320.203751] ? process_one_work+0x600/0x600
[ 1320.203753] kthread+0xf5/0x120
[ 1320.203755] ? kthread_complete_and_exit+0x20/0x20
[ 1320.203758] ret_from_fork+0x22/0x30
[ 1320.203764] </TASK>

Full kernel log is here:
https://pastebin.com/EeKh2LEr

And one hour later after a lot of messages "BUG: workqueue lockup" GPU
completely hung.

I will be glad to test patches that fix this bug.

--
Best Regards,
Mike Gavrilov.


2022-07-19 02:42:07

by Chen, Guchun

[permalink] [raw]
Subject: RE: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu

Patch https://patchwork.freedesktop.org/series/106024/ should fix this.

Regards,
Guchun

-----Original Message-----
From: amd-gfx <[email protected]> On Behalf Of Mikhail Gavrilov
Sent: Tuesday, July 19, 2022 7:50 AM
To: amd-gfx list <[email protected]>; Linux List Kernel Mailing <[email protected]>; Christian K?nig <[email protected]>
Subject: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu

Hi guys I continue testing 5.19 rc7 and found the bug.
Command "clinfo" causes BUG: kernel NULL pointer dereference, address:
0000000000000008 on driver amdgpu.

Here is trace:
[ 1320.203332] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 1320.203338] #PF: supervisor read access in kernel mode [ 1320.203340] #PF: error_code(0x0000) - not-present page [ 1320.203341] PGD 0 P4D 0 [ 1320.203344] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 1320.203346] CPU: 5 PID: 1226 Comm: kworker/5:2 Tainted: G W L
-------- --- 5.19.0-0.rc7.53.fc37.x86_64+debug #1 [ 1320.203348] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 1320.203350] Workqueue: events delayed_fput [ 1320.203354] RIP: 0010:dma_resv_add_fence+0x5a/0x2d0
[ 1320.203358] Code: 85 c0 0f 84 43 02 00 00 8d 50 01 09 c2 0f 88 47
02 00 00 8b 15 73 10 99 01 49 8d 45 70 48 89 44 24 10 85 d2 0f 85 05
02 00 00 <49> 8b 44 24 08 48 3d 80 93 53 97 0f 84 06 01 00 00 48 3d 20
93 53
[ 1320.203360] RSP: 0018:ffffaf4cc1adfc68 EFLAGS: 00010246 [ 1320.203362] RAX: ffff976660408208 RBX: ffff975f545f2000 RCX: 0000000000000000 [ 1320.203363] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff976660408198 [ 1320.203364] RBP: ffff976806f6e800 R08: 0000000000000000 R09: 0000000000000000 [ 1320.203366] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 1320.203367] R13: ffff976660408198 R14: ffff975f545f2000 R15: ffff976660408198 [ 1320.203368] FS: 0000000000000000(0000) GS:ffff976de1200000(0000)
knlGS:0000000000000000
[ 1320.203370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1320.203371] CR2: 0000000000000008 CR3: 00000007fb31c000 CR4: 0000000000350ee0 [ 1320.203372] Call Trace:
[ 1320.203374] <TASK>
[ 1320.203378] amdgpu_amdkfd_gpuvm_destroy_cb+0x5d/0x1e0 [amdgpu] [ 1320.203516] amdgpu_vm_fini+0x2f/0x4e0 [amdgpu] [ 1320.203625] ? mutex_destroy+0x21/0x50 [ 1320.203629] amdgpu_driver_postclose_kms+0x1da/0x2b0 [amdgpu] [ 1320.203734] drm_file_free.part.0+0x20d/0x260 [ 1320.203738] drm_release+0x6a/0x120 [ 1320.203741] __fput+0xab/0x270 [ 1320.203743] delayed_fput+0x1f/0x30 [ 1320.203745] process_one_work+0x2a0/0x600 [ 1320.203749] worker_thread+0x4f/0x3a0 [ 1320.203751] ? process_one_work+0x600/0x600 [ 1320.203753] kthread+0xf5/0x120 [ 1320.203755] ? kthread_complete_and_exit+0x20/0x20
[ 1320.203758] ret_from_fork+0x22/0x30
[ 1320.203764] </TASK>

Full kernel log is here:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2FEeKh2LEr&amp;data=05%7C01%7Cguchun.chen%40amd.com%7C06749e19d65b418748dc08da6918435f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637937850184140997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=x1%2FR7m9Vy2XwkXKXsmEOeaAyv44ZKNsU4caZJOOSIvY%3D&amp;reserved=0

And one hour later after a lot of messages "BUG: workqueue lockup" GPU completely hung.

I will be glad to test patches that fix this bug.

--
Best Regards,
Mike Gavrilov.

2022-07-19 08:47:10

by Mike Lothian

[permalink] [raw]
Subject: Re: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu

I was told that this patch replaces the patch you mentioned
https://patchwork.freedesktop.org/series/106078/ and it the one
that'll hopefully land in Linus's tree

On Tue, 19 Jul 2022 at 03:33, Chen, Guchun <[email protected]> wrote:
>
> Patch https://patchwork.freedesktop.org/series/106024/ should fix this.
>
> Regards,
> Guchun
>
> -----Original Message-----
> From: amd-gfx <[email protected]> On Behalf Of Mikhail Gavrilov
> Sent: Tuesday, July 19, 2022 7:50 AM
> To: amd-gfx list <[email protected]>; Linux List Kernel Mailing <[email protected]>; Christian König <[email protected]>
> Subject: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu
>
> Hi guys I continue testing 5.19 rc7 and found the bug.
> Command "clinfo" causes BUG: kernel NULL pointer dereference, address:
> 0000000000000008 on driver amdgpu.
>
> Here is trace:
> [ 1320.203332] BUG: kernel NULL pointer dereference, address: 0000000000000008 [ 1320.203338] #PF: supervisor read access in kernel mode [ 1320.203340] #PF: error_code(0x0000) - not-present page [ 1320.203341] PGD 0 P4D 0 [ 1320.203344] Oops: 0000 [#1] PREEMPT SMP NOPTI [ 1320.203346] CPU: 5 PID: 1226 Comm: kworker/5:2 Tainted: G W L
> -------- --- 5.19.0-0.rc7.53.fc37.x86_64+debug #1 [ 1320.203348] Hardware name: System manufacturer System Product Name/ROG STRIX X570-I GAMING, BIOS 4403 04/27/2022 [ 1320.203350] Workqueue: events delayed_fput [ 1320.203354] RIP: 0010:dma_resv_add_fence+0x5a/0x2d0
> [ 1320.203358] Code: 85 c0 0f 84 43 02 00 00 8d 50 01 09 c2 0f 88 47
> 02 00 00 8b 15 73 10 99 01 49 8d 45 70 48 89 44 24 10 85 d2 0f 85 05
> 02 00 00 <49> 8b 44 24 08 48 3d 80 93 53 97 0f 84 06 01 00 00 48 3d 20
> 93 53
> [ 1320.203360] RSP: 0018:ffffaf4cc1adfc68 EFLAGS: 00010246 [ 1320.203362] RAX: ffff976660408208 RBX: ffff975f545f2000 RCX: 0000000000000000 [ 1320.203363] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff976660408198 [ 1320.203364] RBP: ffff976806f6e800 R08: 0000000000000000 R09: 0000000000000000 [ 1320.203366] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [ 1320.203367] R13: ffff976660408198 R14: ffff975f545f2000 R15: ffff976660408198 [ 1320.203368] FS: 0000000000000000(0000) GS:ffff976de1200000(0000)
> knlGS:0000000000000000
> [ 1320.203370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1320.203371] CR2: 0000000000000008 CR3: 00000007fb31c000 CR4: 0000000000350ee0 [ 1320.203372] Call Trace:
> [ 1320.203374] <TASK>
> [ 1320.203378] amdgpu_amdkfd_gpuvm_destroy_cb+0x5d/0x1e0 [amdgpu] [ 1320.203516] amdgpu_vm_fini+0x2f/0x4e0 [amdgpu] [ 1320.203625] ? mutex_destroy+0x21/0x50 [ 1320.203629] amdgpu_driver_postclose_kms+0x1da/0x2b0 [amdgpu] [ 1320.203734] drm_file_free.part.0+0x20d/0x260 [ 1320.203738] drm_release+0x6a/0x120 [ 1320.203741] __fput+0xab/0x270 [ 1320.203743] delayed_fput+0x1f/0x30 [ 1320.203745] process_one_work+0x2a0/0x600 [ 1320.203749] worker_thread+0x4f/0x3a0 [ 1320.203751] ? process_one_work+0x600/0x600 [ 1320.203753] kthread+0xf5/0x120 [ 1320.203755] ? kthread_complete_and_exit+0x20/0x20
> [ 1320.203758] ret_from_fork+0x22/0x30
> [ 1320.203764] </TASK>
>
> Full kernel log is here:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpastebin.com%2FEeKh2LEr&amp;data=05%7C01%7Cguchun.chen%40amd.com%7C06749e19d65b418748dc08da6918435f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637937850184140997%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp;sdata=x1%2FR7m9Vy2XwkXKXsmEOeaAyv44ZKNsU4caZJOOSIvY%3D&amp;reserved=0
>
> And one hour later after a lot of messages "BUG: workqueue lockup" GPU completely hung.
>
> I will be glad to test patches that fix this bug.
>
> --
> Best Regards,
> Mike Gavrilov.

2022-07-19 12:25:29

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu

On Tue, Jul 19, 2022 at 1:40 PM Mike Lothian <[email protected]> wrote:
>
> I was told that this patch replaces the patch you mentioned
> https://patchwork.freedesktop.org/series/106078/ and it the one
> that'll hopefully land in Linus's tree
>

Great, I confirm that both patches solve the issue.
As I understand the second patch [1] is more right and it should be
land merged 5.19 soon, right?

And since we are talking about clinfo, there is a question.
No one has encountered the problem that on configurations with two
GPUs, it hangs in a cycle since it completely occupies one processor
core. In my case, one GPU is in the RENOIR processor, and the other is
a discrete AMD Radeon 6800M. In the BIOS there is no ability to turn
off the integrated GPU in the processor, so there is no way to check
this configuration with each GPU separately. In the kernel log there
is no error so it is most likely a user space issue , but I am not
sure about it.

clinfo backtrace is here [2]

[1] https://patchwork.freedesktop.org/series/106078/
[2] https://pastebin.com/wv5iGibi

--
Best Regards,
Mike Gavrilov.

2022-07-19 17:20:16

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: Command "clinfo" causes BUG: kernel NULL pointer dereference, address: 0000000000000008 on driver amdgpu

On Tue, Jul 19, 2022 at 4:26 PM Mikhail Gavrilov
<[email protected]> wrote:
> In the kernel log there is no error so it is most likely a user space issue , but I am not
> sure about it.

But I am confused by the message in the kernel log:
[ 1962.000909] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue
preemption time out
[ 1962.000912] amdgpu: Failed to evict process queues
[ 1962.000918] amdgpu: Failed to quiesce KFD
[ 1966.010395] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue
preemption time out
[ 1966.010406] amdgpu: Resetting wave fronts (cpsch) on dev 00000000b40e7982


--
Best Regards,
Mike Gavrilov.