LinuxLists.cc - BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu

2023-04-11 17:42:41

Subject: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

Hi,
KASAN continues to find problems in the drm_sched_job_cleanup code at 6.3rc6.
I not got any feedback in the thread
https://lore.kernel.org/lkml/CABXGCsMVUB2RA4D+k5CnA0_2521TOX++D4NmOukKi4X2-Q_RfQ@mail.gmail.com/
Therefore, I decided to start a separate thread. Since the problems
are different, the symptoms are also different.

Reproduction scenario.
After launching one of the listed games:
- Cyberpunk 2077
- Forza Horizon 4
- Forza Horizon 5
- Sackboy: A Big Adventure

Firstly after some time (may be after several attempts) appears bug
message from KASAN:
==================================================================
BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
Read of size 4 at addr 0000000000000078 by task ForzaHorizon4.e/31587

CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: G W L
------- --- 6.3.0-0.rc6.49.fc39.x86_64+debug #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023
Call Trace:
<TASK>
dump_stack_lvl+0x72/0xc0
kasan_report+0xa4/0xe0
? drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
kasan_check_range+0x104/0x1b0
drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched]
? slab_free_freelist_hook+0x11e/0x1d0
? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu]
amdgpu_job_free+0x40/0x1b0 [amdgpu]
amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu]
? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu]
amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu]
? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
? __kmem_cache_free+0xbc/0x2e0
? mark_lock+0x101/0x16e0
? __lock_acquire+0xe54/0x59f0
? kasan_save_stack+0x3f/0x50
? __pfx_lock_release+0x10/0x10
? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
drm_ioctl_kernel+0x1f8/0x3d0
? __pfx_drm_ioctl_kernel+0x10/0x10
drm_ioctl+0x4c1/0xaa0
? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
? __pfx_drm_ioctl+0x10/0x10
? _raw_spin_unlock_irqrestore+0x62/0x80
? lockdep_hardirqs_on+0x7d/0x100
? _raw_spin_unlock_irqrestore+0x4b/0x80
amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
__x64_sys_ioctl+0x12d/0x1a0
do_syscall_64+0x5c/0x90
? do_syscall_64+0x68/0x90
? lockdep_hardirqs_on+0x7d/0x100
? do_syscall_64+0x68/0x90
? do_syscall_64+0x68/0x90
? lockdep_hardirqs_on+0x7d/0x100
? do_syscall_64+0x68/0x90
? asm_exc_page_fault+0x22/0x30
? lockdep_hardirqs_on+0x7d/0x100
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fb8a270881d
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:00000000467ad060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000467ad358 RCX: 00007fb8a270881d
RDX: 00000000467ad140 RSI: 00000000c0186444 RDI: 000000000000005a
RBP: 00000000467ad0b0 R08: 00007fb7f00d3eb0 R09: 00000000467ad100
R10: 00007fb88c68fb20 R11: 0000000000000246 R12: 00000000467ad140
R13: 00000000c0186444 R14: 000000000000005a R15: 00007fb7f00d3e50
</TASK>
==================================================================

Finally it ends up with the games listed above stopping working they
stuck after a kernel warning:
general protection fault, probably for non-canonical address
0xdffffc000000000f: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000078-0x000000000000007f]
CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: G B W L
------- --- 6.3.0-0.rc6.49.fc39.x86_64+debug #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023
RIP: 0010:drm_sched_job_cleanup+0xa7/0x290 [gpu_sched]
Code: d6 01 00 00 4c 8b 75 20 be 04 00 00 00 4d 8d 66 78 4c 89 e7 e8
ba 4d 4e c9 4c 89 e2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6
14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 8a
RSP: 0018:ffffc9003676f5a8 EFLAGS: 00010216
RAX: dffffc0000000000 RBX: ffff88816f81f020 RCX: 0000000000000001
RDX: 000000000000000f RSI: 0000000000000008 RDI: ffffffff9053e5e0
RBP: ffff88816f81f000 R08: 0000000000000001 R09: ffffffff9053e5e7
R10: fffffbfff20a7cbc R11: 6e696c6261736944 R12: 0000000000000078
R13: 1ffff92006cedeb5 R14: 0000000000000000 R15: ffffc9003676f870
FS: 000000004680f6c0(0000) GS:ffff888fa5c00000(0000) knlGS:0000000029910000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb854d6f010 CR3: 000000017b2d6000 CR4: 0000000000350ee0
Call Trace:
<TASK>
? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched]
? slab_free_freelist_hook+0x11e/0x1d0
? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu]
amdgpu_job_free+0x40/0x1b0 [amdgpu]
amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu]
? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu]
amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu]
? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
? __kmem_cache_free+0xbc/0x2e0
? mark_lock+0x101/0x16e0
? __lock_acquire+0xe54/0x59f0
? kasan_save_stack+0x3f/0x50
? __pfx_lock_release+0x10/0x10
? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
drm_ioctl_kernel+0x1f8/0x3d0
? __pfx_drm_ioctl_kernel+0x10/0x10
drm_ioctl+0x4c1/0xaa0
? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
? __pfx_drm_ioctl+0x10/0x10
? _raw_spin_unlock_irqrestore+0x62/0x80
? lockdep_hardirqs_on+0x7d/0x100
? _raw_spin_unlock_irqrestore+0x4b/0x80
amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
__x64_sys_ioctl+0x12d/0x1a0
do_syscall_64+0x5c/0x90
? do_syscall_64+0x68/0x90
? lockdep_hardirqs_on+0x7d/0x100
? do_syscall_64+0x68/0x90
? do_syscall_64+0x68/0x90
? lockdep_hardirqs_on+0x7d/0x100
? do_syscall_64+0x68/0x90
? asm_exc_page_fault+0x22/0x30
? lockdep_hardirqs_on+0x7d/0x100
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fb8a270881d
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:00000000467ad060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000467ad358 RCX: 00007fb8a270881d
RDX: 00000000467ad140 RSI: 00000000c0186444 RDI: 000000000000005a
RBP: 00000000467ad0b0 R08: 00007fb7f00d3eb0 R09: 00000000467ad100
R10: 00007fb88c68fb20 R11: 0000000000000246 R12: 00000000467ad140
R13: 00000000c0186444 R14: 000000000000005a R15: 00007fb7f00d3e50
</TASK>
Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer
nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack
nf_defrag_ipv6 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc
mt76x2u mt76x2_common mt76x02_usb iwlmvm snd_hda_codec_realtek
mt76_usb intel_rapl_msr snd_hda_codec_generic snd_hda_codec_hdmi
intel_rapl_common mt76x02_lib mt76 snd_hda_intel edac_mce_amd
snd_intel_dspcfg cpi snd_usb_audio snd_hda_codec mac80211 kvm_amd
snd_usbmidi_lib btusb snd_hda_core snd_rawmidi snd_hwdep mc btrtl kvm
btbcm btintel snd_seq libarc4 iwlwifi btmtk snd_seq_device vfat
eeepc_wmi fat bluetooth asus_ec_sensors snd_pcm asus_wmi irqbypass
ledtrig_audio _keymap snd_timer xpad platform_profile wmi_bmof
ff_memless rapl joydev pcspkr snd k10temp i2c_piix4 soundcore rfkill
acpi_cpufreq loop zram amdgpu drm_ttm_helper ttm video iommu_v2
drm_buddy gpu_sched drm_display_helper crct10dif_pclmul ucsi_ccg
crc32_pclmul crc32c_intel typec_ucsi polyval_clmulni polyval_generic
typec ghash_clmulni_intel cec ccp sha512_ssse3 sp5100_tco igb nvme
nvme_core dca i2c_algo_bit nvme_common wmi ip6_tables ip_tables
---[ end trace 0000000000000000 ]---
RIP: 0010:drm_sched_job_cleanup+0xa7/0x290 [gpu_sched]
Code: d6 01 00 00 4c 8b 75 20 be 04 00 00 00 4d 8d 66 78 4c 89 e7 e8
ba 4d 4e c9 4c 89 e2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6
14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 8a
RSP: 0018:ffffc9003676f5a8 EFLAGS: 00010216
RAX: dffffc0000000000 RBX: ffff88816f81f020 RCX: 0000000000000001
RDX: 000000000000000f RSI: 0000000000000008 RDI: ffffffff9053e5e0
RBP: ffff88816f81f000 R08: 0000000000000001 R09: ffffffff9053e5e7
R10: fffffbfff20a7cbc R11: 6e696c6261736944 R12: 0000000000000078
R13: 1ffff92006cedeb5 R14: 0000000000000000 R15: ffffc9003676f870
FS: 000000004680f6c0(0000) GS:ffff888fa5c00000(0000) knlGS:0000000029910000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb854d6f010 CR3: 000000017b2d6000 CR4: 0000000000350ee0

Demonstration:
https://youtu.be/ysRc4TXuBQI

I would be happy to join in testing patches that would fix this.

I attached a full kernel log here.

--
Best Regards,
Mike Gavrilov.

Attachments:

BUG-KASAN-null-ptr-deref-in-drm_sched_job_cleanup+0x96.tar.xz (38.58 kB)
BUG-KASAN-null-ptr-deref-in-drm_sched_job_cleanup+0x96-2.tar.xz (35.43 kB)
BUG-KASAN-null-ptr-deref-in-drm_sched_job_cleanup+0x96-3.tar.xz (36.36 kB)
Download all attachments

2023-04-14 15:13:51

by Mikhail Gavrilov

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

On Tue, Apr 11, 2023 at 10:40 PM Mikhail Gavrilov
<[email protected]> wrote:
>
> Hi,
> KASAN continues to find problems in the drm_sched_job_cleanup code at 6.3rc6.
> I not got any feedback in the thread
> https://lore.kernel.org/lkml/CABXGCsMVUB2RA4D+k5CnA0_2521TOX++D4NmOukKi4X2-Q_RfQ@mail.gmail.com/
> Therefore, I decided to start a separate thread. Since the problems
> are different, the symptoms are also different.
>
> Reproduction scenario.
> After launching one of the listed games:
> - Cyberpunk 2077
> - Forza Horizon 4
> - Forza Horizon 5
> - Sackboy: A Big Adventure
>
> Firstly after some time (may be after several attempts) appears bug
> message from KASAN:
> ==================================================================
> BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
> Read of size 4 at addr 0000000000000078 by task ForzaHorizon4.e/31587
>
> CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: G W L
> ------- --- 6.3.0-0.rc6.49.fc39.x86_64+debug #1
> Hardware name: System manufacturer System Product Name/ROG STRIX
> X570-I GAMING, BIOS 4601 02/02/2023
> Call Trace:
> <TASK>
> dump_stack_lvl+0x72/0xc0
> kasan_report+0xa4/0xe0
> ? drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
> kasan_check_range+0x104/0x1b0
> drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
> ? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched]
> ? slab_free_freelist_hook+0x11e/0x1d0
> ? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu]
> amdgpu_job_free+0x40/0x1b0 [amdgpu]
> amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu]
> ? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu]
> amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu]
> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> ? __kmem_cache_free+0xbc/0x2e0
> ? mark_lock+0x101/0x16e0
> ? __lock_acquire+0xe54/0x59f0
> ? kasan_save_stack+0x3f/0x50
> ? __pfx_lock_release+0x10/0x10
> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> drm_ioctl_kernel+0x1f8/0x3d0
> ? __pfx_drm_ioctl_kernel+0x10/0x10
> drm_ioctl+0x4c1/0xaa0
> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> ? __pfx_drm_ioctl+0x10/0x10
> ? _raw_spin_unlock_irqrestore+0x62/0x80
> ? lockdep_hardirqs_on+0x7d/0x100
> ? _raw_spin_unlock_irqrestore+0x4b/0x80
> amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
> __x64_sys_ioctl+0x12d/0x1a0
> do_syscall_64+0x5c/0x90
> ? do_syscall_64+0x68/0x90
> ? lockdep_hardirqs_on+0x7d/0x100
> ? do_syscall_64+0x68/0x90
> ? do_syscall_64+0x68/0x90
> ? lockdep_hardirqs_on+0x7d/0x100
> ? do_syscall_64+0x68/0x90
> ? asm_exc_page_fault+0x22/0x30
> ? lockdep_hardirqs_on+0x7d/0x100
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7fb8a270881d
> Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
> 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
> 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
> RSP: 002b:00000000467ad060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 00000000467ad358 RCX: 00007fb8a270881d
> RDX: 00000000467ad140 RSI: 00000000c0186444 RDI: 000000000000005a
> RBP: 00000000467ad0b0 R08: 00007fb7f00d3eb0 R09: 00000000467ad100
> R10: 00007fb88c68fb20 R11: 0000000000000246 R12: 00000000467ad140
> R13: 00000000c0186444 R14: 000000000000005a R15: 00007fb7f00d3e50
> </TASK>
> ==================================================================
>
> Finally it ends up with the games listed above stopping working they
> stuck after a kernel warning:
> general protection fault, probably for non-canonical address
> 0xdffffc000000000f: 0000 [#1] PREEMPT SMP KASAN NOPTI
> KASAN: null-ptr-deref in range [0x0000000000000078-0x000000000000007f]
> CPU: 15 PID: 31587 Comm: ForzaHorizon4.e Tainted: G B W L
> ------- --- 6.3.0-0.rc6.49.fc39.x86_64+debug #1
> Hardware name: System manufacturer System Product Name/ROG STRIX
> X570-I GAMING, BIOS 4601 02/02/2023
> RIP: 0010:drm_sched_job_cleanup+0xa7/0x290 [gpu_sched]
> Code: d6 01 00 00 4c 8b 75 20 be 04 00 00 00 4d 8d 66 78 4c 89 e7 e8
> ba 4d 4e c9 4c 89 e2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6
> 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 8a
> RSP: 0018:ffffc9003676f5a8 EFLAGS: 00010216
> RAX: dffffc0000000000 RBX: ffff88816f81f020 RCX: 0000000000000001
> RDX: 000000000000000f RSI: 0000000000000008 RDI: ffffffff9053e5e0
> RBP: ffff88816f81f000 R08: 0000000000000001 R09: ffffffff9053e5e7
> R10: fffffbfff20a7cbc R11: 6e696c6261736944 R12: 0000000000000078
> R13: 1ffff92006cedeb5 R14: 0000000000000000 R15: ffffc9003676f870
> FS: 000000004680f6c0(0000) GS:ffff888fa5c00000(0000) knlGS:0000000029910000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fb854d6f010 CR3: 000000017b2d6000 CR4: 0000000000350ee0
> Call Trace:
> <TASK>
> ? __pfx_drm_sched_job_cleanup+0x10/0x10 [gpu_sched]
> ? slab_free_freelist_hook+0x11e/0x1d0
> ? amdgpu_cs_parser_fini+0x363/0x5a0 [amdgpu]
> amdgpu_job_free+0x40/0x1b0 [amdgpu]
> amdgpu_cs_parser_fini+0x3c9/0x5a0 [amdgpu]
> ? __pfx_amdgpu_cs_parser_fini+0x10/0x10 [amdgpu]
> amdgpu_cs_ioctl+0x3d9/0x5630 [amdgpu]
> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> ? __kmem_cache_free+0xbc/0x2e0
> ? mark_lock+0x101/0x16e0
> ? __lock_acquire+0xe54/0x59f0
> ? kasan_save_stack+0x3f/0x50
> ? __pfx_lock_release+0x10/0x10
> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> drm_ioctl_kernel+0x1f8/0x3d0
> ? __pfx_drm_ioctl_kernel+0x10/0x10
> drm_ioctl+0x4c1/0xaa0
> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> ? __pfx_drm_ioctl+0x10/0x10
> ? _raw_spin_unlock_irqrestore+0x62/0x80
> ? lockdep_hardirqs_on+0x7d/0x100
> ? _raw_spin_unlock_irqrestore+0x4b/0x80
> amdgpu_drm_ioctl+0xce/0x1b0 [amdgpu]
> __x64_sys_ioctl+0x12d/0x1a0
> do_syscall_64+0x5c/0x90
> ? do_syscall_64+0x68/0x90
> ? lockdep_hardirqs_on+0x7d/0x100
> ? do_syscall_64+0x68/0x90
> ? do_syscall_64+0x68/0x90
> ? lockdep_hardirqs_on+0x7d/0x100
> ? do_syscall_64+0x68/0x90
> ? asm_exc_page_fault+0x22/0x30
> ? lockdep_hardirqs_on+0x7d/0x100
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7fb8a270881d
> Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
> 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
> 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
> RSP: 002b:00000000467ad060 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 00000000467ad358 RCX: 00007fb8a270881d
> RDX: 00000000467ad140 RSI: 00000000c0186444 RDI: 000000000000005a
> RBP: 00000000467ad0b0 R08: 00007fb7f00d3eb0 R09: 00000000467ad100
> R10: 00007fb88c68fb20 R11: 0000000000000246 R12: 00000000467ad140
> R13: 00000000c0186444 R14: 000000000000005a R15: 00007fb7f00d3e50
> </TASK>
> Modules linked in: uinput rfcomm snd_seq_dummy snd_hrtimer
> nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
> nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack
> nf_defrag_ipv6 ip_set nf_tables nfnetlink qrtr bnep sunrpc binfmt_misc
> mt76x2u mt76x2_common mt76x02_usb iwlmvm snd_hda_codec_realtek
> mt76_usb intel_rapl_msr snd_hda_codec_generic snd_hda_codec_hdmi
> intel_rapl_common mt76x02_lib mt76 snd_hda_intel edac_mce_amd
> snd_intel_dspcfg cpi snd_usb_audio snd_hda_codec mac80211 kvm_amd
> snd_usbmidi_lib btusb snd_hda_core snd_rawmidi snd_hwdep mc btrtl kvm
> btbcm btintel snd_seq libarc4 iwlwifi btmtk snd_seq_device vfat
> eeepc_wmi fat bluetooth asus_ec_sensors snd_pcm asus_wmi irqbypass
> ledtrig_audio _keymap snd_timer xpad platform_profile wmi_bmof
> ff_memless rapl joydev pcspkr snd k10temp i2c_piix4 soundcore rfkill
> acpi_cpufreq loop zram amdgpu drm_ttm_helper ttm video iommu_v2
> drm_buddy gpu_sched drm_display_helper crct10dif_pclmul ucsi_ccg
> crc32_pclmul crc32c_intel typec_ucsi polyval_clmulni polyval_generic
> typec ghash_clmulni_intel cec ccp sha512_ssse3 sp5100_tco igb nvme
> nvme_core dca i2c_algo_bit nvme_common wmi ip6_tables ip_tables
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:drm_sched_job_cleanup+0xa7/0x290 [gpu_sched]
> Code: d6 01 00 00 4c 8b 75 20 be 04 00 00 00 4d 8d 66 78 4c 89 e7 e8
> ba 4d 4e c9 4c 89 e2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6
> 14 02 4c 89 e0 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 8a
> RSP: 0018:ffffc9003676f5a8 EFLAGS: 00010216
> RAX: dffffc0000000000 RBX: ffff88816f81f020 RCX: 0000000000000001
> RDX: 000000000000000f RSI: 0000000000000008 RDI: ffffffff9053e5e0
> RBP: ffff88816f81f000 R08: 0000000000000001 R09: ffffffff9053e5e7
> R10: fffffbfff20a7cbc R11: 6e696c6261736944 R12: 0000000000000078
> R13: 1ffff92006cedeb5 R14: 0000000000000000 R15: ffffc9003676f870
> FS: 000000004680f6c0(0000) GS:ffff888fa5c00000(0000) knlGS:0000000029910000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fb854d6f010 CR3: 000000017b2d6000 CR4: 0000000000350ee0
>
> Demonstration:
> https://youtu.be/ysRc4TXuBQI
>
> I would be happy to join in testing patches that would fix this.
>
> I attached a full kernel log here.
>

I think that the result of the problem that KASAN found out looks like
this if the kernel is built without KASAN:

BUG: kernel NULL pointer dereference, address: 0000000000000078
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD f975b1067 P4D f975b1067 PUD e3bdba067 PMD f94134067 PTE 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 31 PID: 40791 Comm: ForzaHorizon4.e Tainted: G L
------- --- 6.3.0-0.rc6.20230413gitde4664485abb.52.fc39.x86_64 #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023
RIP: 0010:drm_sched_job_cleanup+0x2a/0x130 [gpu_sched]
Code: 0f 1f 44 00 00 55 53 48 89 fb 48 83 ec 10 48 8b 7f 20 65 48 8b
04 25 28 00 00 00 48 89 44 24 08 31 c0 48 c7 04 24 00 00 00 00 <8b> 47
78 85 c0 0f 84 c2 00 00 00 48 83 ff c0 74 1f 48 8d 57 78 b8
RSP: 0018:ffffa69d5d33fa10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a617d87c000 RCX: 00000000b93d601f
RDX: 00000000b93d401f RSI: ad0811cd15498925 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffff8a55c4986018 R09: 0000000080080000
R10: 0000000000000001 R11: 0000000000000000 R12: 00000000ffffffff
R13: 0000000000000018 R14: 0000000000000000 R15: ffffa69d5d33faf8
FS: 0000000048b6f6c0(0000) GS:ffff8a64aa9c0000(0000) knlGS:000000003bc40000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000078 CR3: 000000015a164000 CR4: 0000000000350ee0
Call Trace:
<TASK>
amdgpu_job_free+0x15/0xc0 [amdgpu]
amdgpu_cs_parser_fini+0x137/0x1a0 [amdgpu]
amdgpu_cs_ioctl+0x176/0x2140 [amdgpu]
? kmem_cache_alloc+0xf1/0x310
? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
drm_ioctl_kernel+0xc9/0x170
drm_ioctl+0x269/0x4a0
? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
amdgpu_drm_ioctl+0x4a/0x80 [amdgpu]
__x64_sys_ioctl+0x90/0xd0
do_syscall_64+0x5c/0x90
? __x64_sys_ioctl+0xa8/0xd0
? syscall_exit_to_user_mode+0x17/0x40
? do_syscall_64+0x68/0x90
? exc_page_fault+0x78/0x180
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fe76290881d
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:0000000048b6c220 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000048b6c418 RCX: 00007fe76290881d
RDX: 0000000048b6c300 RSI: 00000000c0186444 RDI: 0000000000000059
RBP: 0000000048b6c270 R08: 00007fe6a80bedc0 R09: 0000000048b6c2c0
R10: 00007fe74c678770 R11: 0000000000000246 R12: 0000000048b6c300
R13: 00000000c0186444 R14: 0000000000000059 R15: 0000000000000001
</TASK>
Modules linked in: overlay tun uinput rfcomm snd_seq_dummy snd_hrtimer
nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet
nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
nft_reject nf_reject_ipv6 nft_ct nft_chain_nat nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr bnep
sunrpc binfmt_misc snd_hda_codec_realtek snd_hda_codec_generic
intel_rapl_msr snd_hda_codec_hdmi intel_rapl_common mt76x2u rapl
mt76x2_common snd_hda_intel mt76x02_usb iwlmvm snd_hda_codec
mt76x02_lib snd_usb_audio mt76_usb snd_hda_core kvm_amd mt76
snd_intel_dspcfg snd_intel_sdw_acpi vfat fat snd_hwdep mac80211
eeepc_wmi snd_usbmidi_lib asus_wmi kvm snd_rawmidi btusb snd_seq btrtl
snd_seq_device snd_pcm btbcm btintel ledtrig_audio iwlwifi irqbypass
snd_timer libarc4 btmtk sparse_keymap asus_ec_sensors bluetooth snd
edac_mce_amd platform_profile cfg80211 wmi_bmof pcspkr soundcore mc
i2c_piix4 k10temp rfkill joydev acpi_cpufreq loop zram amdgpu
drm_ttm_helper ttm iommu_v2 drm_buddy
gpu_sched crc32_pclmul drm_display_helper nvme ghash_clmulni_intel
ucsi_ccg polyval_clmulni igb typec_ucsi polyval_generic cec ccp
nvme_core sha512_ssse3 typec crct10dif_pclmul video crc32c_intel
sp5100_tco i2c_algo_bit dca nvme_common wmi ip6_tables ip_tables fuse
CR2: 0000000000000078
---[ end trace 0000000000000000 ]---
RIP: 0010:drm_sched_job_cleanup+0x2a/0x130 [gpu_sched]
Code: 0f 1f 44 00 00 55 53 48 89 fb 48 83 ec 10 48 8b 7f 20 65 48 8b
04 25 28 00 00 00 48 89 44 24 08 31 c0 48 c7 04 24 00 00 00 00 <8b> 47
78 85 c0 0f 84 c2 00 00 00 48 83 ff c0 74 1f 48 8d 57 78 b8
RSP: 0018:ffffa69d5d33fa10 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8a617d87c000 RCX: 00000000b93d601f
RDX: 00000000b93d401f RSI: ad0811cd15498925 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffff8a55c4986018 R09: 0000000080080000
R10: 0000000000000001 R11: 0000000000000000 R12: 00000000ffffffff
R13: 0000000000000018 R14: 0000000000000000 R15: ffffa69d5d33faf8
FS: 0000000048b6f6c0(0000) GS:ffff8a64aa9c0000(0000) knlGS:000000003bc40000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000078 CR3: 000000015a164000 CR4: 0000000000350ee0
note: ForzaHorizon4.e[40791] exited with irqs disabled

To reproduce it, you need to spend more time running Cyberpunk 2077,
Forza Horizon 4, Forza Horizon 5 in turn.

--
Best Regards,
Mike Gavrilov.

Attachments:

BUG-kernel-NULL-pointer-dereference-address-0000000000000078.tar.xz (24.47 kB)

2023-04-19 07:17:54

by Mikhail Gavrilov

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

Christian?

❯ /usr/src/kernels/6.3.0-0.rc7.56.fc39.x86_64/scripts/faddr2line
/lib/debug/lib/modules/6.3.0-0.rc7.56.fc39.x86_64/kernel/drivers/gpu/drm/scheduler/gpu-sched.ko.debug
drm_sched_job_cleanup+0x9a
drm_sched_job_cleanup+0x9a/0x130:
drm_sched_job_cleanup at
/usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c:808
(discriminator 3)

❯ cat -s -n /usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c
| head -818 | tail -20
799 /* drm_sched_job_arm() has been called */
800 dma_fence_put(&job->s_fence->finished);
801 } else {
802 /* aborted job before committing to run it */
803 drm_sched_fence_free(job->s_fence);
804 }
805
806 job->s_fence = NULL;
807
808 xa_for_each(&job->dependencies, index, fence) {
809 dma_fence_put(fence);
810 }
811 xa_destroy(&job->dependencies);
812
813 }
814 EXPORT_SYMBOL(drm_sched_job_cleanup);
815
816 /**
817 * drm_sched_ready - is the scheduler ready
818 *

> git blame drivers/gpu/drm/scheduler/sched_main.c -L 800,819
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-17 10:49:16 +0200 800)
dma_fence_put(&job->s_fence->finished);
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-17 10:49:16 +0200 801) } else {
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-17 10:49:16 +0200 802) /* aborted job
before committing to run it */
d4c16733e7960 drivers/gpu/drm/scheduler/sched_main.c (Boris
Brezillon 2021-09-03 14:05:54 +0200 803)
drm_sched_fence_free(job->s_fence);
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-17 10:49:16 +0200 804) }
dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-17 10:49:16 +0200 805)
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
Masetty 2018-10-29 15:02:28 +0530 806) job->s_fence = NULL;
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 807)
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 808)
xa_for_each(&job->dependencies, index, fence) {
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 809)
dma_fence_put(fence);
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 810) }
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 811)
xa_destroy(&job->dependencies);
ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
Vetter 2021-08-05 12:46:49 +0200 812)
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
Masetty 2018-10-29 15:02:28 +0530 813) }
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
Masetty 2018-10-29 15:02:28 +0530 814)
EXPORT_SYMBOL(drm_sched_job_cleanup);
26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
Masetty 2018-10-29 15:02:28 +0530 815)
e688b728228b9 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c (Christian
König 2015-08-20 17:01:01 +0200 816) /**
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh 2018-05-29 11:23:07 +0530 817) * drm_sched_ready - is the
scheduler ready
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh 2018-05-29 11:23:07 +0530 818) *
2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
Deshmukh 2018-05-29 11:23:07 +0530 819) * @sched: scheduler instance

Daniel, because Christian, looks a little busy. Can you help? The git
blame says that you are the author of code which KASAN mentions in its
report.
The issue is reproducible on all available AMD hardware: 6800M, 6900XT, 7900XTX.

--
Best Regards,
Mike Gavrilov.

2023-04-19 08:27:39

by Christian König

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

Am 19.04.23 um 09:00 schrieb Mikhail Gavrilov:
> Christian?

I'm already looking into this, but can't figure out why we run into
problems here.

What happens is that a CS is aborted without sending the job to the
scheduler and in this case the cleanup function doesn't seem to work.

Christian.

>
> ❯ /usr/src/kernels/6.3.0-0.rc7.56.fc39.x86_64/scripts/faddr2line
> /lib/debug/lib/modules/6.3.0-0.rc7.56.fc39.x86_64/kernel/drivers/gpu/drm/scheduler/gpu-sched.ko.debug
> drm_sched_job_cleanup+0x9a
> drm_sched_job_cleanup+0x9a/0x130:
> drm_sched_job_cleanup at
> /usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c:808
> (discriminator 3)
>
> ❯ cat -s -n /usr/src/debug/kernel-6.3-rc7/linux-6.3.0-0.rc7.56.fc39.x86_64/drivers/gpu/drm/scheduler/sched_main.c
> | head -818 | tail -20
> 799 /* drm_sched_job_arm() has been called */
> 800 dma_fence_put(&job->s_fence->finished);
> 801 } else {
> 802 /* aborted job before committing to run it */
> 803 drm_sched_fence_free(job->s_fence);
> 804 }
> 805
> 806 job->s_fence = NULL;
> 807
> 808 xa_for_each(&job->dependencies, index, fence) {
> 809 dma_fence_put(fence);
> 810 }
> 811 xa_destroy(&job->dependencies);
> 812
> 813 }
> 814 EXPORT_SYMBOL(drm_sched_job_cleanup);
> 815
> 816 /**
> 817 * drm_sched_ready - is the scheduler ready
> 818 *
>
>> git blame drivers/gpu/drm/scheduler/sched_main.c -L 800,819
> dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
> Vetter 2021-08-17 10:49:16 +0200 800)
> dma_fence_put(&job->s_fence->finished);
> dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
> Vetter 2021-08-17 10:49:16 +0200 801) } else {
> dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
> Vetter 2021-08-17 10:49:16 +0200 802) /* aborted job
> before committing to run it */
> d4c16733e7960 drivers/gpu/drm/scheduler/sched_main.c (Boris
> Brezillon 2021-09-03 14:05:54 +0200 803)
> drm_sched_fence_free(job->s_fence);
> dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
> Vetter 2021-08-17 10:49:16 +0200 804) }
> dbe48d030b285 drivers/gpu/drm/scheduler/sched_main.c (Daniel
> Vetter 2021-08-17 10:49:16 +0200 805)
> 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
> Masetty 2018-10-29 15:02:28 +0530 806) job->s_fence = NULL;
> ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
> Vetter 2021-08-05 12:46:49 +0200 807)
> ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
> Vetter 2021-08-05 12:46:49 +0200 808)
> xa_for_each(&job->dependencies, index, fence) {
> ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
> Vetter 2021-08-05 12:46:49 +0200 809)
> dma_fence_put(fence);
> ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
> Vetter 2021-08-05 12:46:49 +0200 810) }
> ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
> Vetter 2021-08-05 12:46:49 +0200 811)
> xa_destroy(&job->dependencies);
> ebd5f74255b9f drivers/gpu/drm/scheduler/sched_main.c (Daniel
> Vetter 2021-08-05 12:46:49 +0200 812)
> 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
> Masetty 2018-10-29 15:02:28 +0530 813) }
> 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
> Masetty 2018-10-29 15:02:28 +0530 814)
> EXPORT_SYMBOL(drm_sched_job_cleanup);
> 26efecf955889 drivers/gpu/drm/scheduler/sched_main.c (Sharat
> Masetty 2018-10-29 15:02:28 +0530 815)
> e688b728228b9 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c (Christian
> König 2015-08-20 17:01:01 +0200 816) /**
> 2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
> Deshmukh 2018-05-29 11:23:07 +0530 817) * drm_sched_ready - is the
> scheduler ready
> 2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
> Deshmukh 2018-05-29 11:23:07 +0530 818) *
> 2d33948e4e00b drivers/gpu/drm/scheduler/gpu_scheduler.c (Nayan
> Deshmukh 2018-05-29 11:23:07 +0530 819) * @sched: scheduler instance
>
> Daniel, because Christian, looks a little busy. Can you help? The git
> blame says that you are the author of code which KASAN mentions in its
> report.
> The issue is reproducible on all available AMD hardware: 6800M, 6900XT, 7900XTX.
>

2023-04-19 13:21:46

by Mikhail Gavrilov

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

On Wed, Apr 19, 2023 at 1:12 PM Christian König
<[email protected]> wrote:
>
> I'm already looking into this, but can't figure out why we run into
> problems here.
>
> What happens is that a CS is aborted without sending the job to the
> scheduler and in this case the cleanup function doesn't seem to work.
>
> Christian.

I can easily reproduce it on any AMD GPU hardware.
You can add more logs to debug and I return with new logs which explains this.
Thanks.

--
Best Regards,
Mike Gavrilov.

2023-04-19 13:25:00

by Christian König

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

Am 19.04.23 um 15:13 schrieb Mikhail Gavrilov:
> On Wed, Apr 19, 2023 at 1:12 PM Christian König
> <[email protected]> wrote:
>> I'm already looking into this, but can't figure out why we run into
>> problems here.
>>
>> What happens is that a CS is aborted without sending the job to the
>> scheduler and in this case the cleanup function doesn't seem to work.
>>
>> Christian.
> I can easily reproduce it on any AMD GPU hardware.

Well exactly that's the problem, I can't reproduce it.

Have you applied any local change or config which could explain that?

Christian.

> You can add more logs to debug and I return with new logs which explains this.
> Thanks.
>

2023-04-19 19:25:52

by Mikhail Gavrilov

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

On Wed, Apr 19, 2023 at 6:15 PM Christian König
<[email protected]> wrote:
>
> Well exactly that's the problem, I can't reproduce it.
>
> Have you applied any local change or config which could explain that?
>

I did not apply any local changes.
I just pulled all changes from drm-fixes and used the attached config.
After that I am lucky to reproduce the issue after 3 minutes.
I recorded this on video with first take: https://youtu.be/wqD_t8AFU3s
I able to reproduce it on various hardware:
Ryzen 3950X + 7900XTX
Ryzen 7950X + 6900XT
Ryzen 5900HX + 6800M

--
Best Regards,
Mike Gavrilov.

Attachments:

.config (251.79 kB)

2023-04-20 10:06:46

by Christian König

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

Am 19.04.23 um 21:17 schrieb Mikhail Gavrilov:
> On Wed, Apr 19, 2023 at 6:15 PM Christian König
> <[email protected]> wrote:
>> Well exactly that's the problem, I can't reproduce it.
>>
>> Have you applied any local change or config which could explain that?
>>
> I did not apply any local changes.
> I just pulled all changes from drm-fixes and used the attached config.
> After that I am lucky to reproduce the issue after 3 minutes.
> I recorded this on video with first take: https://youtu.be/wqD_t8AFU3s
> I able to reproduce it on various hardware:
> Ryzen 3950X + 7900XTX
> Ryzen 7950X + 6900XT
> Ryzen 5900HX + 6800M

Could you try drm-misc-next as well?

Going to give drm-fixes another round of testing.

Thanks,
Christian.

2023-04-20 10:47:22

by Mikhail Gavrilov

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

On Thu, Apr 20, 2023 at 2:59 PM Christian König
<[email protected]> wrote:
>
> Could you try drm-misc-next as well?
>
> Going to give drm-fixes another round of testing.
>
> Thanks,
> Christian.

Important don't give up.
https://youtu.be/25zhHBGIHJ8 [40 min]
https://youtu.be/utnDR26eYBY [50 min]
https://youtu.be/DJQ_tiimW6g [12 min]
https://youtu.be/Y6AH1oJKivA [6 min]
Yes the issue is everything reproducible, but time to time it not
happens at first attempt.
I also uploaded other videos which proves that the issue definitely
exists if someone will launch those games in turn.
Reproducibility is only a matter of time.

Anyway I didn't want you to spend so much time trying to reproduce it.
This monkey business fits me more than you.
It would be better if I could collect more useful info.

--
Best Regards,
Mike Gavrilov.

2023-04-20 21:28:24

by Mikhail Gavrilov

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

On Thu, Apr 20, 2023 at 2:59 PM Christian König
<[email protected]> wrote:
> Could you try drm-misc-next as well?

If as I assume I cloned right repo
$ git clone -b drm-misc-next
git://anongit.freedesktop.org/drm/drm-misc linux-drm-misc-next
for my hardware last commit on this branch is turned out completely unworking.
Instead of the GDM login screen I see a black screen and hear howls of GPU fans.

In the kernel logs I see general protection fault:
general protection fault, probably for non-canonical address
0xdffffc000000002b: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000158-0x000000000000015f]
CPU: 0 PID: 749 Comm: sdma0 Tainted: G W L
6.3.0-rc4-misc-next-91c249b2b9f6a80c744387b6713adf275ffd296b+ #1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023
RIP: 0010:drm_sched_get_cleanup_job+0x41b/0x5c0 [gpu_sched]
Code: fa 48 c1 ea 03 80 3c 02 00 75 5c 49 8b 9f 80 00 00 00 48 b8 00
00 00 00 00 fc ff df 48 8d bb 58 01 00 00 48 89 fa 48 c1 ea 03 <80> 3c
02 00 75 55 48 01 ab 58 01 00 00 e9 0c fd ff ff 48 89 ef e8
RSP: 0018:ffffc9000548fdb8 EFLAGS: 00010216
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 000000000000002b RSI: 0000000000000004 RDI: 0000000000000158
RBP: 000000000000085c R08: 0000000000000000 R09: ffff888170711783
R10: ffffed102e0e22f0 R11: ffffffff8da81678 R12: ffff8881707116b0
R13: ffff888170711780 R14: ffff888266f89820 R15: ffff888266f89808
FS: 0000000000000000(0000) GS:ffff888fa2000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000560cea4a8000 CR3: 0000000191602000 CR4: 0000000000350ef0
Call Trace:
<TASK>
drm_sched_main+0xc3/0x930 [gpu_sched]
? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
? __pfx_autoremove_wake_function+0x10/0x10
? __kthread_parkme+0xc1/0x1f0
? __pfx_drm_sched_main+0x10/0x10 [gpu_sched]
kthread+0x2a2/0x340
? __pfx_kthread+0x10/0x10
ret_from_fork+0x2c/0x50
</TASK>
Modules linked in: amdgpu(+) drm_ttm_helper ttm video crct10dif_pclmul
drm_suballoc_helper crc32_pclmul iommu_v2 crc32c_intel drm_buddy
polyval_clmulni gpu_sched polyval_generic ucsi_ccg drm_display_helper
typec_ucsi nvme ghash_clmulni_intel igb typec ccp sha512_ssse3 cec
nvme_core sp5100_tco dca i2c_algo_bit nvme_common wmi ip6_tables
ip_tables fuse
---[ end trace 0000000000000000 ]---
RIP: 0010:drm_sched_get_cleanup_job+0x41b/0x5c0 [gpu_sched]
Code: fa 48 c1 ea 03 80 3c 02 00 75 5c 49 8b 9f 80 00 00 00 48 b8 00
00 00 00 00 fc ff df 48 8d bb 58 01 00 00 48 89 fa 48 c1 ea 03 <80> 3c
02 00 75 55 48 01 ab 58 01 00 00 e9 0c fd ff ff 48 89 ef e8
RSP: 0018:ffffc9000548fdb8 EFLAGS: 00010216
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 000000000000002b RSI: 0000000000000004 RDI: 0000000000000158
RBP: 000000000000085c R08: 0000000000000000 R09: ffff888170711783
R10: ffffed102e0e22f0 R11: ffffffff8da81678 R12: ffff8881707116b0
R13: ffff888170711780 R14: ffff888266f89820 R15: ffff888266f89808
FS: 0000000000000000(0000) GS:ffff888fa2000000(0000) knlGS:0000000000000000

I also attached a full system log.

--
Best Regards,
Mike Gavrilov.

Attachments:

system-log.tar.xz (47.95 kB)

2023-04-25 13:21:11

by Mikhail Gavrilov

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

On Thu, Apr 20, 2023 at 3:32 PM Mikhail Gavrilov
<[email protected]> wrote:
>
> Important don't give up.
> https://youtu.be/25zhHBGIHJ8 [40 min]
> https://youtu.be/utnDR26eYBY [50 min]
> https://youtu.be/DJQ_tiimW6g [12 min]
> https://youtu.be/Y6AH1oJKivA [6 min]
> Yes the issue is everything reproducible, but time to time it not
> happens at first attempt.
> I also uploaded other videos which proves that the issue definitely
> exists if someone will launch those games in turn.
> Reproducibility is only a matter of time.
>
> Anyway I didn't want you to spend so much time trying to reproduce it.
> This monkey business fits me more than you.
> It would be better if I could collect more useful info.

Christian,
Did you manage to reproduce the problem?

At the weekend I faced with slab-use-after-free in amdgpu_vm_handle_moved.
I didn't play in the games at this time.
The Xwayland process was affected so it leads to desktop hang.

==================================================================
BUG: KASAN: slab-use-after-free in amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
Read of size 8 at addr ffff888295c66190 by task Xwayland:cs0/173185

CPU: 21 PID: 173185 Comm: Xwayland:cs0 Tainted: G W L
------- --- 6.3.0-0.rc7.20230420gitcb0856346a60.59.fc39.x86_64+debug
#1
Hardware name: System manufacturer System Product Name/ROG STRIX
X570-I GAMING, BIOS 4601 02/02/2023
Call Trace:
<TASK>
dump_stack_lvl+0x76/0xd0
print_report+0xcf/0x670
? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
kasan_report+0xa8/0xe0
? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
amdgpu_cs_ioctl+0x2b7e/0x5630 [amdgpu]
? __pfx___lock_acquire+0x10/0x10
? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
? mark_lock+0x101/0x16e0
? __lock_acquire+0xe54/0x59f0
? __pfx_lock_release+0x10/0x10
? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
drm_ioctl_kernel+0x1fc/0x3d0
? __pfx_drm_ioctl_kernel+0x10/0x10
drm_ioctl+0x4c5/0xaa0
? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
? __pfx_drm_ioctl+0x10/0x10
? _raw_spin_unlock_irqrestore+0x66/0x80
? lockdep_hardirqs_on+0x81/0x110
? _raw_spin_unlock_irqrestore+0x4f/0x80
amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
__x64_sys_ioctl+0x131/0x1a0
do_syscall_64+0x60/0x90
? do_syscall_64+0x6c/0x90
? lockdep_hardirqs_on+0x81/0x110
? do_syscall_64+0x6c/0x90
? lockdep_hardirqs_on+0x81/0x110
? do_syscall_64+0x6c/0x90
? lockdep_hardirqs_on+0x81/0x110
? do_syscall_64+0x6c/0x90
? lockdep_hardirqs_on+0x81/0x110
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7ffb71b0892d
Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2
3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
RSP: 002b:00007ffb677fe840 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007ffb677fe9f8 RCX: 00007ffb71b0892d
RDX: 00007ffb677fe900 RSI: 00000000c0186444 RDI: 000000000000000d
RBP: 00007ffb677fe890 R08: 00007ffb677fea50 R09: 00007ffb677fe8e0
R10: 0000556c4611bec0 R11: 0000000000000246 R12: 00007ffb677fe900
R13: 00000000c0186444 R14: 000000000000000d R15: 00007ffb677fe9f8
</TASK>

Allocated by task 173181:
kasan_save_stack+0x33/0x60
kasan_set_track+0x25/0x30
__kasan_kmalloc+0x8f/0xa0
__kmalloc_node+0x65/0x160
amdgpu_bo_create+0x31e/0xfb0 [amdgpu]
amdgpu_bo_create_user+0xca/0x160 [amdgpu]
amdgpu_gem_create_ioctl+0x398/0x980 [amdgpu]
drm_ioctl_kernel+0x1fc/0x3d0
drm_ioctl+0x4c5/0xaa0
amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
__x64_sys_ioctl+0x131/0x1a0
do_syscall_64+0x60/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

Freed by task 173185:
kasan_save_stack+0x33/0x60
kasan_set_track+0x25/0x30
kasan_save_free_info+0x2e/0x50
__kasan_slab_free+0x10b/0x1a0
slab_free_freelist_hook+0x11e/0x1d0
__kmem_cache_free+0xc0/0x2e0
ttm_bo_release+0x667/0x9e0 [ttm]
amdgpu_bo_unref+0x35/0x70 [amdgpu]
amdgpu_gem_object_free+0x73/0xb0 [amdgpu]
drm_gem_handle_delete+0xe3/0x150
drm_ioctl_kernel+0x1fc/0x3d0
drm_ioctl+0x4c5/0xaa0
amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
__x64_sys_ioctl+0x131/0x1a0
do_syscall_64+0x60/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

Last potentially related work creation:
kasan_save_stack+0x33/0x60
__kasan_record_aux_stack+0x97/0xb0
__call_rcu_common.constprop.0+0xf8/0x1af0
drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
dma_resv_reserve_fences+0x4dc/0x7f0
ttm_eu_reserve_buffers+0x3f6/0x1190 [ttm]
amdgpu_cs_ioctl+0x204d/0x5630 [amdgpu]
drm_ioctl_kernel+0x1fc/0x3d0
drm_ioctl+0x4c5/0xaa0
amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
__x64_sys_ioctl+0x131/0x1a0
do_syscall_64+0x60/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

Second to last potentially related work creation:
kasan_save_stack+0x33/0x60
__kasan_record_aux_stack+0x97/0xb0
__call_rcu_common.constprop.0+0xf8/0x1af0
drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
amdgpu_ctx_add_fence+0x2b1/0x390 [amdgpu]
amdgpu_cs_ioctl+0x44d0/0x5630 [amdgpu]
drm_ioctl_kernel+0x1fc/0x3d0
drm_ioctl+0x4c5/0xaa0
amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
__x64_sys_ioctl+0x131/0x1a0
do_syscall_64+0x60/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc

The buggy address belongs to the object at ffff888295c66000
which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 400 bytes inside of
freed 1024-byte region [ffff888295c66000, ffff888295c66400)

The buggy address belongs to the physical page:
page:00000000125ffbe3 refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x295c60
head:00000000125ffbe3 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
anon flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
raw: 0017ffffc0010200 ffff88810004cdc0 0000000000000000 dead000000000001
raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff888295c66080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888295c66100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888295c66180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888295c66200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888295c66280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================

--
Best Regards,
Mike Gavrilov.

2023-04-26 02:06:26

by Chen, Guchun

[permalink] [raw]

Subject: RE: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

After reviewing this whole history, maybe attached patch is able to fix your problem. Can you have a try please?

Regards,
Guchun

> -----Original Message-----
> From: amd-gfx <[email protected]> On Behalf Of
> Mikhail Gavrilov
> Sent: Tuesday, April 25, 2023 9:20 PM
> To: Koenig, Christian <[email protected]>
> Cc: Daniel Vetter <[email protected]>; dri-devel <dri-
> [email protected]>; amd-gfx list <[email protected]>;
> Linux List Kernel Mailing <[email protected]>
> Subject: Re: BUG: KASAN: null-ptr-deref in
> drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
>
> On Thu, Apr 20, 2023 at 3:32 PM Mikhail Gavrilov
> <[email protected]> wrote:
> >
> > Important don't give up.
> > https://youtu.be/25zhHBGIHJ8 [40 min]
> > https://youtu.be/utnDR26eYBY [50 min]
> > https://youtu.be/DJQ_tiimW6g [12 min]
> > https://youtu.be/Y6AH1oJKivA [6 min]
> > Yes the issue is everything reproducible, but time to time it not
> > happens at first attempt.
> > I also uploaded other videos which proves that the issue definitely
> > exists if someone will launch those games in turn.
> > Reproducibility is only a matter of time.
> >
> > Anyway I didn't want you to spend so much time trying to reproduce it.
> > This monkey business fits me more than you.
> > It would be better if I could collect more useful info.
>
> Christian,
> Did you manage to reproduce the problem?
>
> At the weekend I faced with slab-use-after-free in
> amdgpu_vm_handle_moved.
> I didn't play in the games at this time.
> The Xwayland process was affected so it leads to desktop hang.
>
> ================================================================
> ==
> BUG: KASAN: slab-use-after-free in
> amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] Read of size 8 at addr
> ffff888295c66190 by task Xwayland:cs0/173185
>
> CPU: 21 PID: 173185 Comm: Xwayland:cs0 Tainted: G W L
> ------- --- 6.3.0-0.rc7.20230420gitcb0856346a60.59.fc39.x86_64+debug
> #1
> Hardware name: System manufacturer System Product Name/ROG STRIX
> X570-I GAMING, BIOS 4601 02/02/2023 Call Trace:
> <TASK>
> dump_stack_lvl+0x76/0xd0
> print_report+0xcf/0x670
> ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] ?
> amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
> kasan_report+0xa8/0xe0
> ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
> amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
> amdgpu_cs_ioctl+0x2b7e/0x5630 [amdgpu]
> ? __pfx___lock_acquire+0x10/0x10
> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ? mark_lock+0x101/0x16e0 ?
> __lock_acquire+0xe54/0x59f0 ? __pfx_lock_release+0x10/0x10 ?
> __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
> drm_ioctl_kernel+0x1fc/0x3d0
> ? __pfx_drm_ioctl_kernel+0x10/0x10
> drm_ioctl+0x4c5/0xaa0
> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ?
> __pfx_drm_ioctl+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x66/0x80
> ? lockdep_hardirqs_on+0x81/0x110
> ? _raw_spin_unlock_irqrestore+0x4f/0x80
> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
> __x64_sys_ioctl+0x131/0x1a0
> do_syscall_64+0x60/0x90
> ? do_syscall_64+0x6c/0x90
> ? lockdep_hardirqs_on+0x81/0x110
> ? do_syscall_64+0x6c/0x90
> ? lockdep_hardirqs_on+0x81/0x110
> ? do_syscall_64+0x6c/0x90
> ? lockdep_hardirqs_on+0x81/0x110
> ? do_syscall_64+0x6c/0x90
> ? lockdep_hardirqs_on+0x81/0x110
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7ffb71b0892d
> Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
> 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00
> f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
> RSP: 002b:00007ffb677fe840 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> RAX: ffffffffffffffda RBX: 00007ffb677fe9f8 RCX: 00007ffb71b0892d
> RDX: 00007ffb677fe900 RSI: 00000000c0186444 RDI: 000000000000000d
> RBP: 00007ffb677fe890 R08: 00007ffb677fea50 R09: 00007ffb677fe8e0
> R10: 0000556c4611bec0 R11: 0000000000000246 R12: 00007ffb677fe900
> R13: 00000000c0186444 R14: 000000000000000d R15: 00007ffb677fe9f8
> </TASK>
>
> Allocated by task 173181:
> kasan_save_stack+0x33/0x60
> kasan_set_track+0x25/0x30
> __kasan_kmalloc+0x8f/0xa0
> __kmalloc_node+0x65/0x160
> amdgpu_bo_create+0x31e/0xfb0 [amdgpu]
> amdgpu_bo_create_user+0xca/0x160 [amdgpu]
> amdgpu_gem_create_ioctl+0x398/0x980 [amdgpu]
> drm_ioctl_kernel+0x1fc/0x3d0
> drm_ioctl+0x4c5/0xaa0
> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
> __x64_sys_ioctl+0x131/0x1a0
> do_syscall_64+0x60/0x90
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>
> Freed by task 173185:
> kasan_save_stack+0x33/0x60
> kasan_set_track+0x25/0x30
> kasan_save_free_info+0x2e/0x50
> __kasan_slab_free+0x10b/0x1a0
> slab_free_freelist_hook+0x11e/0x1d0
> __kmem_cache_free+0xc0/0x2e0
> ttm_bo_release+0x667/0x9e0 [ttm]
> amdgpu_bo_unref+0x35/0x70 [amdgpu]
> amdgpu_gem_object_free+0x73/0xb0 [amdgpu]
> drm_gem_handle_delete+0xe3/0x150
> drm_ioctl_kernel+0x1fc/0x3d0
> drm_ioctl+0x4c5/0xaa0
> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
> __x64_sys_ioctl+0x131/0x1a0
> do_syscall_64+0x60/0x90
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>
> Last potentially related work creation:
> kasan_save_stack+0x33/0x60
> __kasan_record_aux_stack+0x97/0xb0
> __call_rcu_common.constprop.0+0xf8/0x1af0
> drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
> dma_resv_reserve_fences+0x4dc/0x7f0
> ttm_eu_reserve_buffers+0x3f6/0x1190 [ttm]
> amdgpu_cs_ioctl+0x204d/0x5630 [amdgpu]
> drm_ioctl_kernel+0x1fc/0x3d0
> drm_ioctl+0x4c5/0xaa0
> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
> __x64_sys_ioctl+0x131/0x1a0
> do_syscall_64+0x60/0x90
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>
> Second to last potentially related work creation:
> kasan_save_stack+0x33/0x60
> __kasan_record_aux_stack+0x97/0xb0
> __call_rcu_common.constprop.0+0xf8/0x1af0
> drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
> amdgpu_ctx_add_fence+0x2b1/0x390 [amdgpu]
> amdgpu_cs_ioctl+0x44d0/0x5630 [amdgpu]
> drm_ioctl_kernel+0x1fc/0x3d0
> drm_ioctl+0x4c5/0xaa0
> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
> __x64_sys_ioctl+0x131/0x1a0
> do_syscall_64+0x60/0x90
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>
> The buggy address belongs to the object at ffff888295c66000 which belongs
> to the cache kmalloc-1k of size 1024 The buggy address is located 400 bytes
> inside of freed 1024-byte region [ffff888295c66000, ffff888295c66400)
>
> The buggy address belongs to the physical page:
> page:00000000125ffbe3 refcount:1 mapcount:0 mapping:0000000000000000
> index:0x0 pfn:0x295c60
> head:00000000125ffbe3 order:3 entire_mapcount:0 nr_pages_mapped:0
> pincount:0 anon flags:
> 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
> raw: 0017ffffc0010200 ffff88810004cdc0 0000000000000000
> dead000000000001
> raw: 0000000000000000 0000000000100010 00000001ffffffff
> 0000000000000000 page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
> ffff888295c66080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff888295c66100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> >ffff888295c66180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ^
> ffff888295c66200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff888295c66280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ================================================================
> ==
>
> --
> Best Regards,
> Mike Gavrilov.

Attachments:

0001-drm-amdgpu-drop-redudant-sched-job-cleanup-when-cs-i.patch (1.91 kB)
0001-drm-amdgpu-drop-redudant-sched-job-cleanup-when-cs-i.patch

2023-04-26 11:52:07

by Christian König

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

Sending that once more from my mailing list address since AMD internal
servers are blocking the mail.

Regards,
Christian.

Am 26.04.23 um 13:48 schrieb Christian König:
> WTF? I own you a beer!
>
> I've fixed exactly that problem during the review process of the
> cleanup patch and because of this didn't considered that the code is
> still there.
>
> It also explains why we don't see that in our testing.
>
> @Mikhail can you test that patch with drm-misc-next?
>
> Thanks,
> Christian.
>
> Am 26.04.23 um 04:00 schrieb Chen, Guchun:
>> After reviewing this whole history, maybe attached patch is able to
>> fix your problem. Can you have a try please?
>>
>> Regards,
>> Guchun
>>
>>> -----Original Message-----
>>> From: amd-gfx <[email protected]> On Behalf Of
>>> Mikhail Gavrilov
>>> Sent: Tuesday, April 25, 2023 9:20 PM
>>> To: Koenig, Christian <[email protected]>
>>> Cc: Daniel Vetter <[email protected]>; dri-devel <dri-
>>> [email protected]>; amd-gfx list
>>> <[email protected]>;
>>> Linux List Kernel Mailing <[email protected]>
>>> Subject: Re: BUG: KASAN: null-ptr-deref in
>>> drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
>>>
>>> On Thu, Apr 20, 2023 at 3:32 PM Mikhail Gavrilov
>>> <[email protected]> wrote:
>>>> Important don't give up.
>>>> https://youtu.be/25zhHBGIHJ8 [40 min]
>>>> https://youtu.be/utnDR26eYBY [50 min]
>>>> https://youtu.be/DJQ_tiimW6g [12 min]
>>>> https://youtu.be/Y6AH1oJKivA [6 min]
>>>> Yes the issue is everything reproducible, but time to time it not
>>>> happens at first attempt.
>>>> I also uploaded other videos which proves that the issue definitely
>>>> exists if someone will launch those games in turn.
>>>> Reproducibility is only a matter of time.
>>>>
>>>> Anyway I didn't want you to spend so much time trying to reproduce it.
>>>> This monkey business fits me more than you.
>>>> It would be better if I could collect more useful info.
>>> Christian,
>>> Did you manage to reproduce the problem?
>>>
>>> At the weekend I faced with slab-use-after-free in
>>> amdgpu_vm_handle_moved.
>>> I didn't play in the games at this time.
>>> The Xwayland process was affected so it leads to desktop hang.
>>>
>>> ================================================================
>>> ==
>>> BUG: KASAN: slab-use-after-free in
>>> amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] Read of size 8 at addr
>>> ffff888295c66190 by task Xwayland:cs0/173185
>>>
>>> CPU: 21 PID: 173185 Comm: Xwayland:cs0 Tainted: G W L
>>> ------- --- 6.3.0-0.rc7.20230420gitcb0856346a60.59.fc39.x86_64+debug
>>> #1
>>> Hardware name: System manufacturer System Product Name/ROG STRIX
>>> X570-I GAMING, BIOS 4601 02/02/2023 Call Trace:
>>> <TASK>
>>> dump_stack_lvl+0x76/0xd0
>>> print_report+0xcf/0x670
>>> ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] ?
>>> amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
>>> kasan_report+0xa8/0xe0
>>> ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
>>> amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
>>> amdgpu_cs_ioctl+0x2b7e/0x5630 [amdgpu]
>>> ? __pfx___lock_acquire+0x10/0x10
>>> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ?
>>> mark_lock+0x101/0x16e0 ?
>>> __lock_acquire+0xe54/0x59f0 ? __pfx_lock_release+0x10/0x10 ?
>>> __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>>> drm_ioctl_kernel+0x1fc/0x3d0
>>> ? __pfx_drm_ioctl_kernel+0x10/0x10
>>> drm_ioctl+0x4c5/0xaa0
>>> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ?
>>> __pfx_drm_ioctl+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x66/0x80
>>> ? lockdep_hardirqs_on+0x81/0x110
>>> ? _raw_spin_unlock_irqrestore+0x4f/0x80
>>> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
>>> __x64_sys_ioctl+0x131/0x1a0
>>> do_syscall_64+0x60/0x90
>>> ? do_syscall_64+0x6c/0x90
>>> ? lockdep_hardirqs_on+0x81/0x110
>>> ? do_syscall_64+0x6c/0x90
>>> ? lockdep_hardirqs_on+0x81/0x110
>>> ? do_syscall_64+0x6c/0x90
>>> ? lockdep_hardirqs_on+0x81/0x110
>>> ? do_syscall_64+0x6c/0x90
>>> ? lockdep_hardirqs_on+0x81/0x110
>>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>> RIP: 0033:0x7ffb71b0892d
>>> Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
>>> 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89>
>>> c2 3d 00
>>> f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
>>> RSP: 002b:00007ffb677fe840 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000010
>>> RAX: ffffffffffffffda RBX: 00007ffb677fe9f8 RCX: 00007ffb71b0892d
>>> RDX: 00007ffb677fe900 RSI: 00000000c0186444 RDI: 000000000000000d
>>> RBP: 00007ffb677fe890 R08: 00007ffb677fea50 R09: 00007ffb677fe8e0
>>> R10: 0000556c4611bec0 R11: 0000000000000246 R12: 00007ffb677fe900
>>> R13: 00000000c0186444 R14: 000000000000000d R15: 00007ffb677fe9f8
>>> </TASK>
>>>
>>> Allocated by task 173181:
>>> kasan_save_stack+0x33/0x60
>>> kasan_set_track+0x25/0x30
>>> __kasan_kmalloc+0x8f/0xa0
>>> __kmalloc_node+0x65/0x160
>>> amdgpu_bo_create+0x31e/0xfb0 [amdgpu]
>>> amdgpu_bo_create_user+0xca/0x160 [amdgpu]
>>> amdgpu_gem_create_ioctl+0x398/0x980 [amdgpu]
>>> drm_ioctl_kernel+0x1fc/0x3d0
>>> drm_ioctl+0x4c5/0xaa0
>>> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
>>> __x64_sys_ioctl+0x131/0x1a0
>>> do_syscall_64+0x60/0x90
>>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>>
>>> Freed by task 173185:
>>> kasan_save_stack+0x33/0x60
>>> kasan_set_track+0x25/0x30
>>> kasan_save_free_info+0x2e/0x50
>>> __kasan_slab_free+0x10b/0x1a0
>>> slab_free_freelist_hook+0x11e/0x1d0
>>> __kmem_cache_free+0xc0/0x2e0
>>> ttm_bo_release+0x667/0x9e0 [ttm]
>>> amdgpu_bo_unref+0x35/0x70 [amdgpu]
>>> amdgpu_gem_object_free+0x73/0xb0 [amdgpu]
>>> drm_gem_handle_delete+0xe3/0x150
>>> drm_ioctl_kernel+0x1fc/0x3d0
>>> drm_ioctl+0x4c5/0xaa0
>>> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
>>> __x64_sys_ioctl+0x131/0x1a0
>>> do_syscall_64+0x60/0x90
>>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>>
>>> Last potentially related work creation:
>>> kasan_save_stack+0x33/0x60
>>> __kasan_record_aux_stack+0x97/0xb0
>>> __call_rcu_common.constprop.0+0xf8/0x1af0
>>> drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
>>> dma_resv_reserve_fences+0x4dc/0x7f0
>>> ttm_eu_reserve_buffers+0x3f6/0x1190 [ttm]
>>> amdgpu_cs_ioctl+0x204d/0x5630 [amdgpu]
>>> drm_ioctl_kernel+0x1fc/0x3d0
>>> drm_ioctl+0x4c5/0xaa0
>>> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
>>> __x64_sys_ioctl+0x131/0x1a0
>>> do_syscall_64+0x60/0x90
>>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>>
>>> Second to last potentially related work creation:
>>> kasan_save_stack+0x33/0x60
>>> __kasan_record_aux_stack+0x97/0xb0
>>> __call_rcu_common.constprop.0+0xf8/0x1af0
>>> drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
>>> amdgpu_ctx_add_fence+0x2b1/0x390 [amdgpu]
>>> amdgpu_cs_ioctl+0x44d0/0x5630 [amdgpu]
>>> drm_ioctl_kernel+0x1fc/0x3d0
>>> drm_ioctl+0x4c5/0xaa0
>>> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
>>> __x64_sys_ioctl+0x131/0x1a0
>>> do_syscall_64+0x60/0x90
>>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>>
>>> The buggy address belongs to the object at ffff888295c66000 which
>>> belongs
>>> to the cache kmalloc-1k of size 1024 The buggy address is located
>>> 400 bytes
>>> inside of freed 1024-byte region [ffff888295c66000, ffff888295c66400)
>>>
>>> The buggy address belongs to the physical page:
>>> page:00000000125ffbe3 refcount:1 mapcount:0 mapping:0000000000000000
>>> index:0x0 pfn:0x295c60
>>> head:00000000125ffbe3 order:3 entire_mapcount:0 nr_pages_mapped:0
>>> pincount:0 anon flags:
>>> 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
>>> raw: 0017ffffc0010200 ffff88810004cdc0 0000000000000000
>>> dead000000000001
>>> raw: 0000000000000000 0000000000100010 00000001ffffffff
>>> 0000000000000000 page dumped because: kasan: bad access detected
>>>
>>> Memory state around the buggy address:
>>> ffff888295c66080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>> ffff888295c66100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>>> ffff888295c66180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>> ^
>>> ffff888295c66200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>> ffff888295c66280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>> ================================================================
>>> ==
>>>
>>> --
>>> Best Regards,
>>> Mike Gavrilov.
>

2023-04-26 16:11:02

by Christian König

[permalink] [raw]

Subject: Keyword Review - Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

WTF? I own you a beer!

I've fixed exactly that problem during the review process of the cleanup
patch and because of this didn't considered that the code is still there.

It also explains why we don't see that in our testing.

@Mikhail can you test that patch with drm-misc-next?

Thanks,
Christian.

Am 26.04.23 um 04:00 schrieb Chen, Guchun:
> After reviewing this whole history, maybe attached patch is able to fix your problem. Can you have a try please?
>
> Regards,
> Guchun
>
>> -----Original Message-----
>> From: amd-gfx <[email protected]> On Behalf Of
>> Mikhail Gavrilov
>> Sent: Tuesday, April 25, 2023 9:20 PM
>> To: Koenig, Christian <[email protected]>
>> Cc: Daniel Vetter <[email protected]>; dri-devel <dri-
>> [email protected]>; amd-gfx list <[email protected]>;
>> Linux List Kernel Mailing <[email protected]>
>> Subject: Re: BUG: KASAN: null-ptr-deref in
>> drm_sched_job_cleanup+0x96/0x290 [gpu_sched]
>>
>> On Thu, Apr 20, 2023 at 3:32 PM Mikhail Gavrilov
>> <[email protected]> wrote:
>>> Important don't give up.
>>> https://youtu.be/25zhHBGIHJ8 [40 min]
>>> https://youtu.be/utnDR26eYBY [50 min]
>>> https://youtu.be/DJQ_tiimW6g [12 min]
>>> https://youtu.be/Y6AH1oJKivA [6 min]
>>> Yes the issue is everything reproducible, but time to time it not
>>> happens at first attempt.
>>> I also uploaded other videos which proves that the issue definitely
>>> exists if someone will launch those games in turn.
>>> Reproducibility is only a matter of time.
>>>
>>> Anyway I didn't want you to spend so much time trying to reproduce it.
>>> This monkey business fits me more than you.
>>> It would be better if I could collect more useful info.
>> Christian,
>> Did you manage to reproduce the problem?
>>
>> At the weekend I faced with slab-use-after-free in
>> amdgpu_vm_handle_moved.
>> I didn't play in the games at this time.
>> The Xwayland process was affected so it leads to desktop hang.
>>
>> ================================================================
>> ==
>> BUG: KASAN: slab-use-after-free in
>> amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] Read of size 8 at addr
>> ffff888295c66190 by task Xwayland:cs0/173185
>>
>> CPU: 21 PID: 173185 Comm: Xwayland:cs0 Tainted: G W L
>> ------- --- 6.3.0-0.rc7.20230420gitcb0856346a60.59.fc39.x86_64+debug
>> #1
>> Hardware name: System manufacturer System Product Name/ROG STRIX
>> X570-I GAMING, BIOS 4601 02/02/2023 Call Trace:
>> <TASK>
>> dump_stack_lvl+0x76/0xd0
>> print_report+0xcf/0x670
>> ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu] ?
>> amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
>> kasan_report+0xa8/0xe0
>> ? amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
>> amdgpu_vm_handle_moved+0x286/0x2d0 [amdgpu]
>> amdgpu_cs_ioctl+0x2b7e/0x5630 [amdgpu]
>> ? __pfx___lock_acquire+0x10/0x10
>> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ? mark_lock+0x101/0x16e0 ?
>> __lock_acquire+0xe54/0x59f0 ? __pfx_lock_release+0x10/0x10 ?
>> __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu]
>> drm_ioctl_kernel+0x1fc/0x3d0
>> ? __pfx_drm_ioctl_kernel+0x10/0x10
>> drm_ioctl+0x4c5/0xaa0
>> ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu] ?
>> __pfx_drm_ioctl+0x10/0x10 ? _raw_spin_unlock_irqrestore+0x66/0x80
>> ? lockdep_hardirqs_on+0x81/0x110
>> ? _raw_spin_unlock_irqrestore+0x4f/0x80
>> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
>> __x64_sys_ioctl+0x131/0x1a0
>> do_syscall_64+0x60/0x90
>> ? do_syscall_64+0x6c/0x90
>> ? lockdep_hardirqs_on+0x81/0x110
>> ? do_syscall_64+0x6c/0x90
>> ? lockdep_hardirqs_on+0x81/0x110
>> ? do_syscall_64+0x6c/0x90
>> ? lockdep_hardirqs_on+0x81/0x110
>> ? do_syscall_64+0x6c/0x90
>> ? lockdep_hardirqs_on+0x81/0x110
>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>> RIP: 0033:0x7ffb71b0892d
>> Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00
>> 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00
>> f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00
>> RSP: 002b:00007ffb677fe840 EFLAGS: 00000246 ORIG_RAX:
>> 0000000000000010
>> RAX: ffffffffffffffda RBX: 00007ffb677fe9f8 RCX: 00007ffb71b0892d
>> RDX: 00007ffb677fe900 RSI: 00000000c0186444 RDI: 000000000000000d
>> RBP: 00007ffb677fe890 R08: 00007ffb677fea50 R09: 00007ffb677fe8e0
>> R10: 0000556c4611bec0 R11: 0000000000000246 R12: 00007ffb677fe900
>> R13: 00000000c0186444 R14: 000000000000000d R15: 00007ffb677fe9f8
>> </TASK>
>>
>> Allocated by task 173181:
>> kasan_save_stack+0x33/0x60
>> kasan_set_track+0x25/0x30
>> __kasan_kmalloc+0x8f/0xa0
>> __kmalloc_node+0x65/0x160
>> amdgpu_bo_create+0x31e/0xfb0 [amdgpu]
>> amdgpu_bo_create_user+0xca/0x160 [amdgpu]
>> amdgpu_gem_create_ioctl+0x398/0x980 [amdgpu]
>> drm_ioctl_kernel+0x1fc/0x3d0
>> drm_ioctl+0x4c5/0xaa0
>> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
>> __x64_sys_ioctl+0x131/0x1a0
>> do_syscall_64+0x60/0x90
>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>
>> Freed by task 173185:
>> kasan_save_stack+0x33/0x60
>> kasan_set_track+0x25/0x30
>> kasan_save_free_info+0x2e/0x50
>> __kasan_slab_free+0x10b/0x1a0
>> slab_free_freelist_hook+0x11e/0x1d0
>> __kmem_cache_free+0xc0/0x2e0
>> ttm_bo_release+0x667/0x9e0 [ttm]
>> amdgpu_bo_unref+0x35/0x70 [amdgpu]
>> amdgpu_gem_object_free+0x73/0xb0 [amdgpu]
>> drm_gem_handle_delete+0xe3/0x150
>> drm_ioctl_kernel+0x1fc/0x3d0
>> drm_ioctl+0x4c5/0xaa0
>> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
>> __x64_sys_ioctl+0x131/0x1a0
>> do_syscall_64+0x60/0x90
>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>
>> Last potentially related work creation:
>> kasan_save_stack+0x33/0x60
>> __kasan_record_aux_stack+0x97/0xb0
>> __call_rcu_common.constprop.0+0xf8/0x1af0
>> drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
>> dma_resv_reserve_fences+0x4dc/0x7f0
>> ttm_eu_reserve_buffers+0x3f6/0x1190 [ttm]
>> amdgpu_cs_ioctl+0x204d/0x5630 [amdgpu]
>> drm_ioctl_kernel+0x1fc/0x3d0
>> drm_ioctl+0x4c5/0xaa0
>> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
>> __x64_sys_ioctl+0x131/0x1a0
>> do_syscall_64+0x60/0x90
>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>
>> Second to last potentially related work creation:
>> kasan_save_stack+0x33/0x60
>> __kasan_record_aux_stack+0x97/0xb0
>> __call_rcu_common.constprop.0+0xf8/0x1af0
>> drm_sched_fence_release_scheduled+0xb8/0xe0 [gpu_sched]
>> amdgpu_ctx_add_fence+0x2b1/0x390 [amdgpu]
>> amdgpu_cs_ioctl+0x44d0/0x5630 [amdgpu]
>> drm_ioctl_kernel+0x1fc/0x3d0
>> drm_ioctl+0x4c5/0xaa0
>> amdgpu_drm_ioctl+0xd2/0x1b0 [amdgpu]
>> __x64_sys_ioctl+0x131/0x1a0
>> do_syscall_64+0x60/0x90
>> entry_SYSCALL_64_after_hwframe+0x72/0xdc
>>
>> The buggy address belongs to the object at ffff888295c66000 which belongs
>> to the cache kmalloc-1k of size 1024 The buggy address is located 400 bytes
>> inside of freed 1024-byte region [ffff888295c66000, ffff888295c66400)
>>
>> The buggy address belongs to the physical page:
>> page:00000000125ffbe3 refcount:1 mapcount:0 mapping:0000000000000000
>> index:0x0 pfn:0x295c60
>> head:00000000125ffbe3 order:3 entire_mapcount:0 nr_pages_mapped:0
>> pincount:0 anon flags:
>> 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
>> raw: 0017ffffc0010200 ffff88810004cdc0 0000000000000000
>> dead000000000001
>> raw: 0000000000000000 0000000000100010 00000001ffffffff
>> 0000000000000000 page dumped because: kasan: bad access detected
>>
>> Memory state around the buggy address:
>> ffff888295c66080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff888295c66100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>> ffff888295c66180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ^
>> ffff888295c66200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ffff888295c66280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> ================================================================
>> ==
>>
>> --
>> Best Regards,
>> Mike Gavrilov.

2023-05-02 19:34:08

by Mikhail Gavrilov

[permalink] [raw]

Subject: Re: BUG: KASAN: null-ptr-deref in drm_sched_job_cleanup+0x96/0x290 [gpu_sched]

On Wed, Apr 26, 2023 at 7:00 AM Chen, Guchun <[email protected]> wrote:
>
> After reviewing this whole history, maybe attached patch is able to fix your problem. Can you have a try please?
>
> Regards,
> Guchun
>

Thanks, I tested this patch for 6 days.
And the error "BUG: KASAN: null-ptr-deref in
drm_sched_job_cleanup+0x96" never appears any more.
But instead I began to note GPU hangs which happen randomly after
"[gfxhub] page fault".
Not sure if there is anything useful to seen in page fault message:

amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40
vmid:1 pasid:32779, for process steamwebhelper pid 15552 thread
steamwebhe:cs0 pid 15832)
amdgpu 0000:03:00.0: amdgpu: in page starting at address
0x00008001012c3000 from client 0x1b (UTCL2)
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00141051
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x5
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x1

amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24
vmid:2 pasid:32794, for process EvilDead-Win64- pid 12883 thread
EvilDead-W:cs0 pid 13035)
amdgpu 0000:03:00.0: amdgpu: in page starting at address
0x00008001e62a5000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00201030
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0

amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24
vmid:1 pasid:32770, for process Xwayland pid 3706 thread Xwayland:cs0
pid 3713)
amdgpu 0000:03:00.0: amdgpu: in page starting at address
0x0000800100c04000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00101031
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0

amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:40
vmid:2 pasid:32784, for process thedivision.exe pid 168608 thread
thedivision.exe pid 168733)
amdgpu 0000:03:00.0: amdgpu: in page starting at address
0x0000800000372000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00240C51
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPG (0x6)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x5
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x1

amdgpu 0000:03:00.0: amdgpu: [gfxhub] page fault (src_id:0 ring:24
vmid:5 pasid:32797, for process thedivision.exe pid 9902 thread
thedivision.exe pid 9962)
amdgpu 0000:03:00.0: amdgpu: in page starting at address
0x000080013b3cc000 from client 10
amdgpu 0000:03:00.0: amdgpu: GCVM_L2_PROTECTION_FAULT_STATUS:0x00500830
amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: CPF (0x4)
amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x0
amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
amdgpu 0000:03:00.0: amdgpu: RW: 0x0

Since the hangs have a random nature, it is very difficult to relate
them with any changes.

I really want to add Tested-by: Mikhail Gavrilov <[email protected]>
but I'm not sure if I have the right to do so if for some unknown
reason the GPU is not stable yet.

All full kernel logs are attached below.

On Wed, Apr 26, 2023 at 4:50 PM Christian König
<[email protected]> wrote:
>
> Sending that once more from my mailing list address since AMD internal
> servers are blocking the mail.
>
> Regards,
> Christian.
>
> Am 26.04.23 um 13:48 schrieb Christian König:
> > WTF? I own you a beer!
> >
> > I've fixed exactly that problem during the review process of the
> > cleanup patch and because of this didn't considered that the code is
> > still there.
> >
> > It also explains why we don't see that in our testing.
> >
> > @Mikhail can you test that patch with drm-misc-next?

Christian, in the drm-misc-next I should test the Guchun's patch or
something else?
I already tested Guchun's patch on top of 6.4-git58390c8ce1bd and
shared my result above.

--
Best Regards,
Mike Gavrilov.

Attachments:

dmesg-gfxhub-page-fault-7.tar.xz (41.60 kB)
dmesg-gfxhub-page-fault-6.tar.xz (47.86 kB)
dmesg-gfxhub-page-fault-5.tar.xz (43.77 kB)
dmesg-gfxhub-page-fault-4.tar.xz (47.11 kB)
dmesg-gfxhub-page-fault-3.tar.xz (53.55 kB)
Download all attachments