2021-06-05 20:01:59

by Ondrej Zary

[permalink] [raw]
Subject: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

Hello,
I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
Found various reports like this but that was back in februaryso that should be fixed now.

[ 21.003216] BUG: kernel NULL pointer dereference, address: 00000000
[ 21.003235] #PF: supervisor read access in kernel mode
[ 21.003243] #PF: error_code(0x0000) - not-present page
[ 21.003250] *pde = 00000000
[ 21.003258] Oops: 0000 [#1] SMP
[ 21.003268] CPU: 0 PID: 222 Comm: systemd-udevd Not tainted 5.13.0-rc4+ #327
[ 21.003278] Hardware name: /848P-ICH5, BIOS 6.00 PG 02/03/2005
[ 21.003285] EIP: nouveau_bo_sync_for_device+0x9e/0xbf [nouveau]
[ 21.003571] Code: 02 89 45 e8 01 d1 8b 19 89 5d ec bb 01 00 00 00 3b 5d e8 74 0d 89 d8 c1 e0 05 03 45 ec 39 04 99 74 1e 8b 46 10 89 d9 c1 e1 0c <8b> 14 10 8b 47 e0 8b 40 08 6a 01 e8 d5 03 55 df 01 5d f0 58 eb ae
[ 21.003588] EAX: 00000000 EBX: 00000010 ECX: 00010000 EDX: 00000000
[ 21.003597] ESI: c3e90280 EDI: c185a494 EBP: c2ed7c10 ESP: c2ed7bf8
[ 21.003606] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210206
[ 21.003615] CR0: 80050033 CR2: 00000000 CR3: 02ecb000 CR4: 00000690
[ 21.003625] Call Trace:
[ 21.003635] nouveau_bo_validate+0x3f/0x48 [nouveau]
[ 21.003911] nouveau_bo_pin+0xf0/0x187 [nouveau]
[ 21.004182] nouveau_channel_prep+0xc0/0x269 [nouveau]
[ 21.004454] nouveau_channel_new+0x3c/0x5f5 [nouveau]
[ 21.004725] ? slab_free_freelist_hook+0x3b/0xa7
[ 21.004740] ? kfree+0x9e/0x11a
[ 21.004749] ? nvif_object_sclass_put+0xd/0x16 [nouveau]
[ 21.004944] nouveau_drm_device_init+0x2e2/0x646 [nouveau]
[ 21.005186] ? pci_enable_device_flags+0x23/0x97
[ 21.005202] nouveau_drm_probe+0xe5/0x182 [nouveau]
[ 21.005443] ? nouveau_drm_device_init+0x646/0x646 [nouveau]
[ 21.005683] pci_device_probe+0x89/0xe9
[ 21.005696] really_probe+0x127/0x2b9
[ 21.005707] driver_probe_device+0x62/0x89
[ 21.005715] device_driver_attach+0x2e/0x41
[ 21.005724] __driver_attach+0x83/0x8a
[ 21.005732] bus_for_each_dev+0x4c/0x66
[ 21.005740] driver_attach+0x14/0x16
[ 21.005747] ? device_driver_attach+0x41/0x41
[ 21.005756] bus_add_driver+0xc5/0x16c
[ 21.005764] driver_register+0x87/0xb9
[ 21.005772] __pci_register_driver+0x38/0x3b
[ 21.005780] ? 0xf0be4000
[ 21.005787] nouveau_drm_init+0x14c/0x1000 [nouveau]
[ 21.005964] do_one_initcall+0x5a/0x134
[ 21.005975] ? __vunmap+0x124/0x12d
[ 21.005984] ? __vunmap+0x124/0x12d
[ 21.005992] ? kmem_cache_alloc+0xa8/0xb6
[ 21.006001] ? do_init_module+0x17/0x1cf
[ 21.006012] do_init_module+0x46/0x1cf
[ 21.006021] load_module+0x1799/0x1bcb
[ 21.006032] __ia32_sys_finit_module+0x72/0x7a
[ 21.006044] do_int80_syscall_32+0x53/0x62
[ 21.006054] entry_INT80_32+0xf0/0xf0
[ 21.006063] EIP: 0xb7f40092
[ 21.006071] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
[ 21.006086] EAX: ffffffda EBX: 00000010 ECX: b7e9bbdd EDX: 00000000
[ 21.006095] ESI: 008f27d0 EDI: 008f9e10 EBP: 00000000 ESP: bfa140b8
[ 21.006103] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200296
[ 21.006114] Modules linked in: nouveau(+) snd_intel8x0 snd_ac97_codec pcmcia wmi hwmon ac97_bus yenta_socket pcmcia_rsrc drm_ttm_helper snd_pcm ttm snd_timer pcmcia_core psmouse 8139cp snd sg soundcore serio_raw parport_pc intel_agp parport
[ 21.006165] CR2: 0000000000000000
[ 21.006201] ---[ end trace 02dc541683feafc6 ]---
[ 21.006211] EIP: nouveau_bo_sync_for_device+0x9e/0xbf [nouveau]
[ 21.006460] Code: 02 89 45 e8 01 d1 8b 19 89 5d ec bb 01 00 00 00 3b 5d e8 74 0d 89 d8 c1 e0 05 03 45 ec 39 04 99 74 1e 8b 46 10 89 d9 c1 e1 0c <8b> 14 10 8b 47 e0 8b 40 08 6a 01 e8 d5 03 55 df 01 5d f0 58 eb ae
[ 21.006476] EAX: 00000000 EBX: 00000010 ECX: 00010000 EDX: 00000000
[ 21.006485] ESI: c3e90280 EDI: c185a494 EBP: c2ed7c10 ESP: c2ed7bf8
[ 21.006494] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210206
[ 21.006503] CR0: 80050033 CR2: 00000000 CR3: 02ecb000 CR4: 00000690


--
Ondrej Zary


2021-06-05 21:25:08

by Ilia Mirkin

[permalink] [raw]
Subject: Re: [Nouveau] nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

Another instance of a report like this here:
https://gitlab.freedesktop.org/drm/nouveau/-/issues/92

On Sat, Jun 5, 2021 at 3:53 PM Ondrej Zary <[email protected]> wrote:
>
> Hello,
> I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> Found various reports like this but that was back in februaryso that should be fixed now.
>
> [ 21.003216] BUG: kernel NULL pointer dereference, address: 00000000
> [ 21.003235] #PF: supervisor read access in kernel mode
> [ 21.003243] #PF: error_code(0x0000) - not-present page
> [ 21.003250] *pde = 00000000
> [ 21.003258] Oops: 0000 [#1] SMP
> [ 21.003268] CPU: 0 PID: 222 Comm: systemd-udevd Not tainted 5.13.0-rc4+ #327
> [ 21.003278] Hardware name: /848P-ICH5, BIOS 6.00 PG 02/03/2005
> [ 21.003285] EIP: nouveau_bo_sync_for_device+0x9e/0xbf [nouveau]
> [ 21.003571] Code: 02 89 45 e8 01 d1 8b 19 89 5d ec bb 01 00 00 00 3b 5d e8 74 0d 89 d8 c1 e0 05 03 45 ec 39 04 99 74 1e 8b 46 10 89 d9 c1 e1 0c <8b> 14 10 8b 47 e0 8b 40 08 6a 01 e8 d5 03 55 df 01 5d f0 58 eb ae
> [ 21.003588] EAX: 00000000 EBX: 00000010 ECX: 00010000 EDX: 00000000
> [ 21.003597] ESI: c3e90280 EDI: c185a494 EBP: c2ed7c10 ESP: c2ed7bf8
> [ 21.003606] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210206
> [ 21.003615] CR0: 80050033 CR2: 00000000 CR3: 02ecb000 CR4: 00000690
> [ 21.003625] Call Trace:
> [ 21.003635] nouveau_bo_validate+0x3f/0x48 [nouveau]
> [ 21.003911] nouveau_bo_pin+0xf0/0x187 [nouveau]
> [ 21.004182] nouveau_channel_prep+0xc0/0x269 [nouveau]
> [ 21.004454] nouveau_channel_new+0x3c/0x5f5 [nouveau]
> [ 21.004725] ? slab_free_freelist_hook+0x3b/0xa7
> [ 21.004740] ? kfree+0x9e/0x11a
> [ 21.004749] ? nvif_object_sclass_put+0xd/0x16 [nouveau]
> [ 21.004944] nouveau_drm_device_init+0x2e2/0x646 [nouveau]
> [ 21.005186] ? pci_enable_device_flags+0x23/0x97
> [ 21.005202] nouveau_drm_probe+0xe5/0x182 [nouveau]
> [ 21.005443] ? nouveau_drm_device_init+0x646/0x646 [nouveau]
> [ 21.005683] pci_device_probe+0x89/0xe9
> [ 21.005696] really_probe+0x127/0x2b9
> [ 21.005707] driver_probe_device+0x62/0x89
> [ 21.005715] device_driver_attach+0x2e/0x41
> [ 21.005724] __driver_attach+0x83/0x8a
> [ 21.005732] bus_for_each_dev+0x4c/0x66
> [ 21.005740] driver_attach+0x14/0x16
> [ 21.005747] ? device_driver_attach+0x41/0x41
> [ 21.005756] bus_add_driver+0xc5/0x16c
> [ 21.005764] driver_register+0x87/0xb9
> [ 21.005772] __pci_register_driver+0x38/0x3b
> [ 21.005780] ? 0xf0be4000
> [ 21.005787] nouveau_drm_init+0x14c/0x1000 [nouveau]
> [ 21.005964] do_one_initcall+0x5a/0x134
> [ 21.005975] ? __vunmap+0x124/0x12d
> [ 21.005984] ? __vunmap+0x124/0x12d
> [ 21.005992] ? kmem_cache_alloc+0xa8/0xb6
> [ 21.006001] ? do_init_module+0x17/0x1cf
> [ 21.006012] do_init_module+0x46/0x1cf
> [ 21.006021] load_module+0x1799/0x1bcb
> [ 21.006032] __ia32_sys_finit_module+0x72/0x7a
> [ 21.006044] do_int80_syscall_32+0x53/0x62
> [ 21.006054] entry_INT80_32+0xf0/0xf0
> [ 21.006063] EIP: 0xb7f40092
> [ 21.006071] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> [ 21.006086] EAX: ffffffda EBX: 00000010 ECX: b7e9bbdd EDX: 00000000
> [ 21.006095] ESI: 008f27d0 EDI: 008f9e10 EBP: 00000000 ESP: bfa140b8
> [ 21.006103] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200296
> [ 21.006114] Modules linked in: nouveau(+) snd_intel8x0 snd_ac97_codec pcmcia wmi hwmon ac97_bus yenta_socket pcmcia_rsrc drm_ttm_helper snd_pcm ttm snd_timer pcmcia_core psmouse 8139cp snd sg soundcore serio_raw parport_pc intel_agp parport
> [ 21.006165] CR2: 0000000000000000
> [ 21.006201] ---[ end trace 02dc541683feafc6 ]---
> [ 21.006211] EIP: nouveau_bo_sync_for_device+0x9e/0xbf [nouveau]
> [ 21.006460] Code: 02 89 45 e8 01 d1 8b 19 89 5d ec bb 01 00 00 00 3b 5d e8 74 0d 89 d8 c1 e0 05 03 45 ec 39 04 99 74 1e 8b 46 10 89 d9 c1 e1 0c <8b> 14 10 8b 47 e0 8b 40 08 6a 01 e8 d5 03 55 df 01 5d f0 58 eb ae
> [ 21.006476] EAX: 00000000 EBX: 00000010 ECX: 00010000 EDX: 00000000
> [ 21.006485] ESI: c3e90280 EDI: c185a494 EBP: c2ed7c10 ESP: c2ed7bf8
> [ 21.006494] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210206
> [ 21.006503] CR0: 80050033 CR2: 00000000 CR3: 02ecb000 CR4: 00000690
>
>
> --
> Ondrej Zary
> _______________________________________________
> Nouveau mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/nouveau

2021-06-05 21:39:17

by Ondrej Zary

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> Hello,
> I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> Found various reports like this but that was back in februaryso that should be fixed now.

So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html

Added some debug printks to nouveau_bo_sync_for_device:
[ 22.225048] ttm_dma=fc33b500
[ 22.225066] ttm_dma->num_pages=18
[ 22.225071] i=0 num_pages=16
[ 22.225077] ttm_dma->dma_address=00000000
[ 22.225094] BUG: kernel NULL pointer dereference, address: 00000000

So ttm->dma_address is NULL.

--
Ondrej Zary

2021-06-06 21:17:51

by Ondrej Zary

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> > Hello,
> > I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> > Found various reports like this but that was back in februaryso that should be fixed now.
>
> So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html
>
> Added some debug printks to nouveau_bo_sync_for_device:
> [ 22.225048] ttm_dma=fc33b500
> [ 22.225066] ttm_dma->num_pages=18
> [ 22.225071] i=0 num_pages=16
> [ 22.225077] ttm_dma->dma_address=00000000
> [ 22.225094] BUG: kernel NULL pointer dereference, address: 00000000
>
> So ttm->dma_address is NULL.
>

Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
Not sure what I did before.

Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
As always with nouveau...

--
Ondrej Zary

2021-06-07 21:02:29

by Ondrej Zary

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
> On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> > On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> > > Hello,
> > > I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> > > Found various reports like this but that was back in februaryso that should be fixed now.
> >
> > So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> > https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html
> >
> > Added some debug printks to nouveau_bo_sync_for_device:
> > [ 22.225048] ttm_dma=fc33b500
> > [ 22.225066] ttm_dma->num_pages=18
> > [ 22.225071] i=0 num_pages=16
> > [ 22.225077] ttm_dma->dma_address=00000000
> > [ 22.225094] BUG: kernel NULL pointer dereference, address: 00000000
> >
> > So ttm->dma_address is NULL.
> >
>
> Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
> Not sure what I did before.
>
> Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
> As always with nouveau...

e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
Going back one commit makes it crash in a different way:

[ 55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
[ 55.444219] #PF: supervisor read access in kernel mode
[ 55.444222] #PF: error_code(0x0000) - not-present page
[ 55.444225] *pde = 00000000
[ 55.444231] Oops: 0000 [#1] SMP
[ 55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
[ 55.444240] Hardware name: /848P-ICH5, BIOS 6.00 PG 02/03/2005
[ 55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
[ 55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
[ 55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
[ 55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
[ 55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
[ 55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
[ 55.444344] Call Trace:
[ 55.444395] nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
[ 55.444442] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
[ 55.444451] drm_mode_cursor_common+0x13b/0x1ad
[ 55.444497] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
[ 55.444504] drm_mode_cursor_ioctl+0x2e/0x36
[ 55.444509] ? drm_mode_setplane+0x203/0x203
[ 55.444514] drm_ioctl_kernel+0x66/0x99
[ 55.444518] drm_ioctl+0x211/0x2d8
[ 55.444522] ? drm_mode_setplane+0x203/0x203
[ 55.444529] ? _cond_resched+0x1e/0x22
[ 55.444533] ? mutex_lock+0xb/0x24
[ 55.444582] ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
[ 55.444589] ? rpm_resume.part.13+0x72/0x365
[ 55.444594] ? ktime_get_mono_fast_ns+0x5e/0xf2
[ 55.444598] ? __pm_runtime_resume+0x5b/0x63
[ 55.444647] nouveau_drm_ioctl+0x65/0x81 [nouveau]
[ 55.444696] ? nouveau_cli_work+0xc3/0xc3 [nouveau]
[ 55.444702] vfs_ioctl+0x1a/0x24
[ 55.444706] __ia32_sys_ioctl+0x583/0x59d
[ 55.444711] ? doublefault_shim+0x120/0x120
[ 55.444717] ? exit_to_user_mode_prepare+0x71/0xba
[ 55.444721] do_int80_syscall_32+0x2c/0x39
[ 55.444725] entry_INT80_32+0xf0/0xf0
[ 55.444729] EIP: 0xb7fb2092
[ 55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
[ 55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
[ 55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
[ 55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
[ 55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
[ 55.444769] CR2: 00000000000001b0
[ 55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
[ 55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
[ 55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
[ 55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
[ 55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
[ 55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
[ 55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690


--
Ondrej Zary

2021-06-09 10:16:01

by Ondrej Zary

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:
> On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
> > On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> > > On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> > > > Hello,
> > > > I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> > > > Found various reports like this but that was back in februaryso that should be fixed now.
> > >
> > > So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> > > https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html
> > >
> > > Added some debug printks to nouveau_bo_sync_for_device:
> > > [ 22.225048] ttm_dma=fc33b500
> > > [ 22.225066] ttm_dma->num_pages=18
> > > [ 22.225071] i=0 num_pages=16
> > > [ 22.225077] ttm_dma->dma_address=00000000
> > > [ 22.225094] BUG: kernel NULL pointer dereference, address: 00000000
> > >
> > > So ttm->dma_address is NULL.
> > >
> >
> > Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
> > Not sure what I did before.
> >
> > Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
> > As always with nouveau...
>
> e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
> Going back one commit makes it crash in a different way:
>
> [ 55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
> [ 55.444219] #PF: supervisor read access in kernel mode
> [ 55.444222] #PF: error_code(0x0000) - not-present page
> [ 55.444225] *pde = 00000000
> [ 55.444231] Oops: 0000 [#1] SMP
> [ 55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
> [ 55.444240] Hardware name: /848P-ICH5, BIOS 6.00 PG 02/03/2005
> [ 55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> [ 55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> [ 55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> [ 55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> [ 55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> [ 55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> [ 55.444344] Call Trace:
> [ 55.444395] nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
> [ 55.444442] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> [ 55.444451] drm_mode_cursor_common+0x13b/0x1ad
> [ 55.444497] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> [ 55.444504] drm_mode_cursor_ioctl+0x2e/0x36
> [ 55.444509] ? drm_mode_setplane+0x203/0x203
> [ 55.444514] drm_ioctl_kernel+0x66/0x99
> [ 55.444518] drm_ioctl+0x211/0x2d8
> [ 55.444522] ? drm_mode_setplane+0x203/0x203
> [ 55.444529] ? _cond_resched+0x1e/0x22
> [ 55.444533] ? mutex_lock+0xb/0x24
> [ 55.444582] ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
> [ 55.444589] ? rpm_resume.part.13+0x72/0x365
> [ 55.444594] ? ktime_get_mono_fast_ns+0x5e/0xf2
> [ 55.444598] ? __pm_runtime_resume+0x5b/0x63
> [ 55.444647] nouveau_drm_ioctl+0x65/0x81 [nouveau]
> [ 55.444696] ? nouveau_cli_work+0xc3/0xc3 [nouveau]
> [ 55.444702] vfs_ioctl+0x1a/0x24
> [ 55.444706] __ia32_sys_ioctl+0x583/0x59d
> [ 55.444711] ? doublefault_shim+0x120/0x120
> [ 55.444717] ? exit_to_user_mode_prepare+0x71/0xba
> [ 55.444721] do_int80_syscall_32+0x2c/0x39
> [ 55.444725] entry_INT80_32+0xf0/0xf0
> [ 55.444729] EIP: 0xb7fb2092
> [ 55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> [ 55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
> [ 55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
> [ 55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
> [ 55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
> [ 55.444769] CR2: 00000000000001b0
> [ 55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
> [ 55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> [ 55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> [ 55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> [ 55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> [ 55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> [ 55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690

Bisected this crash:
# first bad commit: [141b15e59175aa174ca1f7596188bd15a7ca17ba] drm/nouveau: move io_reserve_lru handling into the driver v5

Adding Christian K?nig to CC.


--
Ondrej Zary

2021-06-09 11:42:40

by Ondrej Zary

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

On Tuesday 08 June 2021 20:47:42 Ondrej Zary wrote:
> On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:
> > On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
> > > On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> > > > On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> > > > > Hello,
> > > > > I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> > > > > Found various reports like this but that was back in februaryso that should be fixed now.
> > > >
> > > > So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> > > > https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html
> > > >
> > > > Added some debug printks to nouveau_bo_sync_for_device:
> > > > [ 22.225048] ttm_dma=fc33b500
> > > > [ 22.225066] ttm_dma->num_pages=18
> > > > [ 22.225071] i=0 num_pages=16
> > > > [ 22.225077] ttm_dma->dma_address=00000000
> > > > [ 22.225094] BUG: kernel NULL pointer dereference, address: 00000000
> > > >
> > > > So ttm->dma_address is NULL.
> > > >
> > >
> > > Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
> > > Not sure what I did before.
> > >
> > > Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
> > > As always with nouveau...
> >
> > e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
> > Going back one commit makes it crash in a different way:
> >
> > [ 55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
> > [ 55.444219] #PF: supervisor read access in kernel mode
> > [ 55.444222] #PF: error_code(0x0000) - not-present page
> > [ 55.444225] *pde = 00000000
> > [ 55.444231] Oops: 0000 [#1] SMP
> > [ 55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
> > [ 55.444240] Hardware name: /848P-ICH5, BIOS 6.00 PG 02/03/2005
> > [ 55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> > [ 55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> > [ 55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> > [ 55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> > [ 55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> > [ 55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> > [ 55.444344] Call Trace:
> > [ 55.444395] nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
> > [ 55.444442] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> > [ 55.444451] drm_mode_cursor_common+0x13b/0x1ad
> > [ 55.444497] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> > [ 55.444504] drm_mode_cursor_ioctl+0x2e/0x36
> > [ 55.444509] ? drm_mode_setplane+0x203/0x203
> > [ 55.444514] drm_ioctl_kernel+0x66/0x99
> > [ 55.444518] drm_ioctl+0x211/0x2d8
> > [ 55.444522] ? drm_mode_setplane+0x203/0x203
> > [ 55.444529] ? _cond_resched+0x1e/0x22
> > [ 55.444533] ? mutex_lock+0xb/0x24
> > [ 55.444582] ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
> > [ 55.444589] ? rpm_resume.part.13+0x72/0x365
> > [ 55.444594] ? ktime_get_mono_fast_ns+0x5e/0xf2
> > [ 55.444598] ? __pm_runtime_resume+0x5b/0x63
> > [ 55.444647] nouveau_drm_ioctl+0x65/0x81 [nouveau]
> > [ 55.444696] ? nouveau_cli_work+0xc3/0xc3 [nouveau]
> > [ 55.444702] vfs_ioctl+0x1a/0x24
> > [ 55.444706] __ia32_sys_ioctl+0x583/0x59d
> > [ 55.444711] ? doublefault_shim+0x120/0x120
> > [ 55.444717] ? exit_to_user_mode_prepare+0x71/0xba
> > [ 55.444721] do_int80_syscall_32+0x2c/0x39
> > [ 55.444725] entry_INT80_32+0xf0/0xf0
> > [ 55.444729] EIP: 0xb7fb2092
> > [ 55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> > [ 55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
> > [ 55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
> > [ 55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
> > [ 55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
> > [ 55.444769] CR2: 00000000000001b0
> > [ 55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
> > [ 55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> > [ 55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> > [ 55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> > [ 55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> > [ 55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> > [ 55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
>
> Bisected this crash:
> # first bad commit: [141b15e59175aa174ca1f7596188bd15a7ca17ba] drm/nouveau: move io_reserve_lru handling into the driver v5
>
> Adding Christian K?nig to CC.

Tracked it down to an uninitialized variable bug.
I see now that this was fixed by aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.

--
Ondrej Zary

2021-06-09 11:43:12

by Ondrej Zary

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

On Tuesday 08 June 2021 22:01:56 Ondrej Zary wrote:
> On Tuesday 08 June 2021 20:47:42 Ondrej Zary wrote:
> > On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:
> > > On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
> > > > On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> > > > > On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> > > > > > Hello,
> > > > > > I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> > > > > > Found various reports like this but that was back in februaryso that should be fixed now.
> > > > >
> > > > > So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> > > > > https://lists.freedesktop.org/archives/dri-devel/2021-February/298531.html
> > > > >
> > > > > Added some debug printks to nouveau_bo_sync_for_device:
> > > > > [ 22.225048] ttm_dma=fc33b500
> > > > > [ 22.225066] ttm_dma->num_pages=18
> > > > > [ 22.225071] i=0 num_pages=16
> > > > > [ 22.225077] ttm_dma->dma_address=00000000
> > > > > [ 22.225094] BUG: kernel NULL pointer dereference, address: 00000000
> > > > >
> > > > > So ttm->dma_address is NULL.
> > > > >
> > > >
> > > > Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
> > > > Not sure what I did before.
> > > >
> > > > Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
> > > > As always with nouveau...
> > >
> > > e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
> > > Going back one commit makes it crash in a different way:
> > >
> > > [ 55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
> > > [ 55.444219] #PF: supervisor read access in kernel mode
> > > [ 55.444222] #PF: error_code(0x0000) - not-present page
> > > [ 55.444225] *pde = 00000000
> > > [ 55.444231] Oops: 0000 [#1] SMP
> > > [ 55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
> > > [ 55.444240] Hardware name: /848P-ICH5, BIOS 6.00 PG 02/03/2005
> > > [ 55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> > > [ 55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> > > [ 55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> > > [ 55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> > > [ 55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> > > [ 55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> > > [ 55.444344] Call Trace:
> > > [ 55.444395] nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
> > > [ 55.444442] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> > > [ 55.444451] drm_mode_cursor_common+0x13b/0x1ad
> > > [ 55.444497] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> > > [ 55.444504] drm_mode_cursor_ioctl+0x2e/0x36
> > > [ 55.444509] ? drm_mode_setplane+0x203/0x203
> > > [ 55.444514] drm_ioctl_kernel+0x66/0x99
> > > [ 55.444518] drm_ioctl+0x211/0x2d8
> > > [ 55.444522] ? drm_mode_setplane+0x203/0x203
> > > [ 55.444529] ? _cond_resched+0x1e/0x22
> > > [ 55.444533] ? mutex_lock+0xb/0x24
> > > [ 55.444582] ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
> > > [ 55.444589] ? rpm_resume.part.13+0x72/0x365
> > > [ 55.444594] ? ktime_get_mono_fast_ns+0x5e/0xf2
> > > [ 55.444598] ? __pm_runtime_resume+0x5b/0x63
> > > [ 55.444647] nouveau_drm_ioctl+0x65/0x81 [nouveau]
> > > [ 55.444696] ? nouveau_cli_work+0xc3/0xc3 [nouveau]
> > > [ 55.444702] vfs_ioctl+0x1a/0x24
> > > [ 55.444706] __ia32_sys_ioctl+0x583/0x59d
> > > [ 55.444711] ? doublefault_shim+0x120/0x120
> > > [ 55.444717] ? exit_to_user_mode_prepare+0x71/0xba
> > > [ 55.444721] do_int80_syscall_32+0x2c/0x39
> > > [ 55.444725] entry_INT80_32+0xf0/0xf0
> > > [ 55.444729] EIP: 0xb7fb2092
> > > [ 55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> > > [ 55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
> > > [ 55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
> > > [ 55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
> > > [ 55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
> > > [ 55.444769] CR2: 00000000000001b0
> > > [ 55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
> > > [ 55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> > > [ 55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> > > [ 55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> > > [ 55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> > > [ 55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> > > [ 55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> >
> > Bisected this crash:
> > # first bad commit: [141b15e59175aa174ca1f7596188bd15a7ca17ba] drm/nouveau: move io_reserve_lru handling into the driver v5
> >
> > Adding Christian K?nig to CC.
>
> Tracked it down to an uninitialized variable bug.
> I see now that this was fixed by aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.

So the first bad commit for the original bug is e34b8feeaa4b65725b25f49c9b08a0f8707e8e86
(as bisected before).
Going one commit back and fixing the uninitialized variable and endian bugs manually makes nouveau work.

--
Ondrej Zary

2021-06-09 13:24:52

by Ondrej Zary

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

On Wednesday 09 June 2021, Christian König wrote:
> Am 08.06.21 um 23:59 schrieb Ondrej Zary:
> > On Tuesday 08 June 2021 22:01:56 Ondrej Zary wrote:
> >> On Tuesday 08 June 2021 20:47:42 Ondrej Zary wrote:
> >>> On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:
> >>>> On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
> >>>>> On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
> >>>>>> On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
> >>>>>>> Hello,
> >>>>>>> I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
> >>>>>>> Found various reports like this but that was back in februaryso that should be fixed now.
> >>>>>> So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
> >>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2021-February%2F298531.html&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C605d2e3757ba466bb02a08d92ac8a895%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637587864017853132%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=M5KXSwD%2Fnro3cnCo8Nx4llFu%2Fj2T%2FGQAaMBLeGl0XMc%3D&amp;reserved=0
> >>>>>>
> >>>>>> Added some debug printks to nouveau_bo_sync_for_device:
> >>>>>> [ 22.225048] ttm_dma=fc33b500
> >>>>>> [ 22.225066] ttm_dma->num_pages=18
> >>>>>> [ 22.225071] i=0 num_pages=16
> >>>>>> [ 22.225077] ttm_dma->dma_address=00000000
> >>>>>> [ 22.225094] BUG: kernel NULL pointer dereference, address: 00000000
> >>>>>>
> >>>>>> So ttm->dma_address is NULL.
> >>>>>>
> >>>>> Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
> >>>>> Not sure what I did before.
> >>>>>
> >>>>> Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
> >>>>> As always with nouveau...
> >>>> e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
> >>>> Going back one commit makes it crash in a different way:
> >>>>
> >>>> [ 55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
> >>>> [ 55.444219] #PF: supervisor read access in kernel mode
> >>>> [ 55.444222] #PF: error_code(0x0000) - not-present page
> >>>> [ 55.444225] *pde = 00000000
> >>>> [ 55.444231] Oops: 0000 [#1] SMP
> >>>> [ 55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
> >>>> [ 55.444240] Hardware name: /848P-ICH5, BIOS 6.00 PG 02/03/2005
> >>>> [ 55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> >>>> [ 55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> >>>> [ 55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> >>>> [ 55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> >>>> [ 55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> >>>> [ 55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> >>>> [ 55.444344] Call Trace:
> >>>> [ 55.444395] nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
> >>>> [ 55.444442] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> >>>> [ 55.444451] drm_mode_cursor_common+0x13b/0x1ad
> >>>> [ 55.444497] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
> >>>> [ 55.444504] drm_mode_cursor_ioctl+0x2e/0x36
> >>>> [ 55.444509] ? drm_mode_setplane+0x203/0x203
> >>>> [ 55.444514] drm_ioctl_kernel+0x66/0x99
> >>>> [ 55.444518] drm_ioctl+0x211/0x2d8
> >>>> [ 55.444522] ? drm_mode_setplane+0x203/0x203
> >>>> [ 55.444529] ? _cond_resched+0x1e/0x22
> >>>> [ 55.444533] ? mutex_lock+0xb/0x24
> >>>> [ 55.444582] ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
> >>>> [ 55.444589] ? rpm_resume.part.13+0x72/0x365
> >>>> [ 55.444594] ? ktime_get_mono_fast_ns+0x5e/0xf2
> >>>> [ 55.444598] ? __pm_runtime_resume+0x5b/0x63
> >>>> [ 55.444647] nouveau_drm_ioctl+0x65/0x81 [nouveau]
> >>>> [ 55.444696] ? nouveau_cli_work+0xc3/0xc3 [nouveau]
> >>>> [ 55.444702] vfs_ioctl+0x1a/0x24
> >>>> [ 55.444706] __ia32_sys_ioctl+0x583/0x59d
> >>>> [ 55.444711] ? doublefault_shim+0x120/0x120
> >>>> [ 55.444717] ? exit_to_user_mode_prepare+0x71/0xba
> >>>> [ 55.444721] do_int80_syscall_32+0x2c/0x39
> >>>> [ 55.444725] entry_INT80_32+0xf0/0xf0
> >>>> [ 55.444729] EIP: 0xb7fb2092
> >>>> [ 55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
> >>>> [ 55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
> >>>> [ 55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
> >>>> [ 55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
> >>>> [ 55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
> >>>> [ 55.444769] CR2: 00000000000001b0
> >>>> [ 55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
> >>>> [ 55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
> >>>> [ 55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
> >>>> [ 55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
> >>>> [ 55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
> >>>> [ 55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
> >>>> [ 55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
> >>> Bisected this crash:
> >>> # first bad commit: [141b15e59175aa174ca1f7596188bd15a7ca17ba] drm/nouveau: move io_reserve_lru handling into the driver v5
> >>>
> >>> Adding Christian König to CC.
> >> Tracked it down to an uninitialized variable bug.
> >> I see now that this was fixed by aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> > So the first bad commit for the original bug is e34b8feeaa4b65725b25f49c9b08a0f8707e8e86
> > (as bisected before).
> > Going one commit back and fixing the uninitialized variable and endian bugs manually makes nouveau work.
>
> Thanks for the heads up. So the problem with my patch is already fixed,
> isn't it?

The NULL pointer dereference in nouveau_bo_wr16 introduced in
141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.

That's the bug I hit when bisecting the original problem:
NULL pointer dereference in nouveau_bo_sync_for_device
It's caused by:
# first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt

--
Ondrej Zary

2021-06-09 13:28:02

by Ondrej Zary

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

On Wednesday 09 June 2021, Christian König wrote:
> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
> > [SNIP]
> >> Thanks for the heads up. So the problem with my patch is already fixed,
> >> isn't it?
> > The NULL pointer dereference in nouveau_bo_wr16 introduced in
> > 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
> > aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> >
> > That's the bug I hit when bisecting the original problem:
> > NULL pointer dereference in nouveau_bo_sync_for_device
> > It's caused by:
> > # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
>
> Good that I've asked :)
>
> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
> created mostly automated.
>
> Do you have the original backtrace of that NULL pointer deref once more?

The original backtrace is here: https://lkml.org/lkml/2021/6/5/350

--
Ondrej Zary

2021-06-09 14:06:57

by Christian König

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

Am 09.06.21 um 09:10 schrieb Ondrej Zary:
> On Wednesday 09 June 2021, Christian König wrote:
>> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
>>> [SNIP]
>>>> Thanks for the heads up. So the problem with my patch is already fixed,
>>>> isn't it?
>>> The NULL pointer dereference in nouveau_bo_wr16 introduced in
>>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
>>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
>>>
>>> That's the bug I hit when bisecting the original problem:
>>> NULL pointer dereference in nouveau_bo_sync_for_device
>>> It's caused by:
>>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
>> Good that I've asked :)
>>
>> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
>> created mostly automated.
>>
>> Do you have the original backtrace of that NULL pointer deref once more?
> The original backtrace is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce905b6bd2aa842ace15508d92b15b96d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588195000729460%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=zFqheBbJcOHtYgqG%2Fs63AT1dwuk4REmUDJWHvzaLAlc%3D&amp;reserved=0

And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I
don't see how that can happen since nouveau is using ttm_sg_tt_init().

Apart from that what nouveau does here is rather questionable since you
need a coherent architecture for most things anyway, but that's not what
we are trying to fix here.

Can you try to narrow down if ttm_sg_tt_init is called before calling
this function for the tt object in question?

Thanks,
Christian.

2021-06-09 17:10:15

by Christian König

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

Am 08.06.21 um 23:59 schrieb Ondrej Zary:
> On Tuesday 08 June 2021 22:01:56 Ondrej Zary wrote:
>> On Tuesday 08 June 2021 20:47:42 Ondrej Zary wrote:
>>> On Monday 07 June 2021 22:58:43 Ondrej Zary wrote:
>>>> On Sunday 06 June 2021 23:16:03 Ondrej Zary wrote:
>>>>> On Saturday 05 June 2021 23:34:23 Ondrej Zary wrote:
>>>>>> On Saturday 05 June 2021 21:43:52 Ondrej Zary wrote:
>>>>>>> Hello,
>>>>>>> I'm testing 5.13.0-rc4 and nouveau crashes with NULL pointer dereference in nouveau_bo_sync_for_device.
>>>>>>> Found various reports like this but that was back in februaryso that should be fixed now.
>>>>>> So it is the same bug. Broken since 5.11. This revert fixes it in 5.11:
>>>>>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2021-February%2F298531.html&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C605d2e3757ba466bb02a08d92ac8a895%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637587864017853132%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=M5KXSwD%2Fnro3cnCo8Nx4llFu%2Fj2T%2FGQAaMBLeGl0XMc%3D&amp;reserved=0
>>>>>>
>>>>>> Added some debug printks to nouveau_bo_sync_for_device:
>>>>>> [ 22.225048] ttm_dma=fc33b500
>>>>>> [ 22.225066] ttm_dma->num_pages=18
>>>>>> [ 22.225071] i=0 num_pages=16
>>>>>> [ 22.225077] ttm_dma->dma_address=00000000
>>>>>> [ 22.225094] BUG: kernel NULL pointer dereference, address: 00000000
>>>>>>
>>>>>> So ttm->dma_address is NULL.
>>>>>>
>>>>> Tested reverting f295c8cfec833c2707ff1512da10d65386dde7af again and it does not work...
>>>>> Not sure what I did before.
>>>>>
>>>>> Bisecting between 5.10 and 5.11 is impossible - I keep hitting neverending stream of bugs.
>>>>> As always with nouveau...
>>>> e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 seems to be the first bad commit
>>>> Going back one commit makes it crash in a different way:
>>>>
>>>> [ 55.444208] BUG: kernel NULL pointer dereference, address: 000001b0
>>>> [ 55.444219] #PF: supervisor read access in kernel mode
>>>> [ 55.444222] #PF: error_code(0x0000) - not-present page
>>>> [ 55.444225] *pde = 00000000
>>>> [ 55.444231] Oops: 0000 [#1] SMP
>>>> [ 55.444237] CPU: 0 PID: 1740 Comm: Xorg Not tainted 5.9.0-rc5+ #361
>>>> [ 55.444240] Hardware name: /848P-ICH5, BIOS 6.00 PG 02/03/2005
>>>> [ 55.444321] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
>>>> [ 55.444326] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
>>>> [ 55.444330] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
>>>> [ 55.444334] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
>>>> [ 55.444338] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
>>>> [ 55.444341] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
>>>> [ 55.444344] Call Trace:
>>>> [ 55.444395] nv04_crtc_cursor_set+0x148/0x1d8 [nouveau]
>>>> [ 55.444442] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
>>>> [ 55.444451] drm_mode_cursor_common+0x13b/0x1ad
>>>> [ 55.444497] ? ttm_bo_reserve.constprop.15+0x1c/0x1c [nouveau]
>>>> [ 55.444504] drm_mode_cursor_ioctl+0x2e/0x36
>>>> [ 55.444509] ? drm_mode_setplane+0x203/0x203
>>>> [ 55.444514] drm_ioctl_kernel+0x66/0x99
>>>> [ 55.444518] drm_ioctl+0x211/0x2d8
>>>> [ 55.444522] ? drm_mode_setplane+0x203/0x203
>>>> [ 55.444529] ? _cond_resched+0x1e/0x22
>>>> [ 55.444533] ? mutex_lock+0xb/0x24
>>>> [ 55.444582] ? nouveau_bo_add_io_reserve_lru+0x53/0x58 [nouveau]
>>>> [ 55.444589] ? rpm_resume.part.13+0x72/0x365
>>>> [ 55.444594] ? ktime_get_mono_fast_ns+0x5e/0xf2
>>>> [ 55.444598] ? __pm_runtime_resume+0x5b/0x63
>>>> [ 55.444647] nouveau_drm_ioctl+0x65/0x81 [nouveau]
>>>> [ 55.444696] ? nouveau_cli_work+0xc3/0xc3 [nouveau]
>>>> [ 55.444702] vfs_ioctl+0x1a/0x24
>>>> [ 55.444706] __ia32_sys_ioctl+0x583/0x59d
>>>> [ 55.444711] ? doublefault_shim+0x120/0x120
>>>> [ 55.444717] ? exit_to_user_mode_prepare+0x71/0xba
>>>> [ 55.444721] do_int80_syscall_32+0x2c/0x39
>>>> [ 55.444725] entry_INT80_32+0xf0/0xf0
>>>> [ 55.444729] EIP: 0xb7fb2092
>>>> [ 55.444733] Code: 00 00 00 e9 90 ff ff ff ff a3 24 00 00 00 68 30 00 00 00 e9 80 ff ff ff ff a3 e8 ff ff ff 66 90 00 00 00 00 00 00 00 00 cd 80 <c3> 8d b4 26 00 00 00 00 8d b6 00 00 00 00 8b 1c 24 c3 8d b4 26 00
>>>> [ 55.444737] EAX: ffffffda EBX: 0000000e ECX: c01c64a3 EDX: bfe89750
>>>> [ 55.444741] ESI: 02580b40 EDI: c01c64a3 EBP: 0000000e ESP: bfe89704
>>>> [ 55.444744] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00200292
>>>> [ 55.444748] Modules linked in: i2c_dev nouveau serial_cs snd_intel8x0 snd_ac97_codec wmi hwmon ttm ac97_bus 8139cp snd_pcm pcmcia snd_timer snd sg soundcore psmouse yenta_socket serio_raw pcmcia_rsrc pcmcia_core intel_agp parport_pc parport
>>>> [ 55.444769] CR2: 00000000000001b0
>>>> [ 55.444774] ---[ end trace e2b0d4c3c2e4e488 ]---
>>>> [ 55.444827] EIP: nouveau_bo_wr16+0x8/0x27 [nouveau]
>>>> [ 55.444831] Code: 85 ff 74 0d 80 7d f3 00 74 07 80 a6 f4 01 00 00 fe 89 f0 e8 0c ef ff ff 8d 65 f4 89 f8 5b 5e 5f 5d c3 55 01 d2 89 e5 53 89 c3 <03> 93 b0 01 00 00 0f b7 c1 f6 83 b8 01 00 00 80 74 07 e8 40 49 69
>>>> [ 55.444835] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000000
>>>> [ 55.444838] ESI: 00000020 EDI: e7a14400 EBP: e786fd98 ESP: e786fd94
>>>> [ 55.444842] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 EFLAGS: 00210246
>>>> [ 55.444845] CR0: 80050033 CR2: 000001b0 CR3: 27896000 CR4: 00000690
>>> Bisected this crash:
>>> # first bad commit: [141b15e59175aa174ca1f7596188bd15a7ca17ba] drm/nouveau: move io_reserve_lru handling into the driver v5
>>>
>>> Adding Christian König to CC.
>> Tracked it down to an uninitialized variable bug.
>> I see now that this was fixed by aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> So the first bad commit for the original bug is e34b8feeaa4b65725b25f49c9b08a0f8707e8e86
> (as bisected before).
> Going one commit back and fixing the uninitialized variable and endian bugs manually makes nouveau work.

Thanks for the heads up. So the problem with my patch is already fixed,
isn't it?

Regards,
Christian.

2021-06-09 17:11:00

by Christian König

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

Am 09.06.21 um 08:57 schrieb Ondrej Zary:
> [SNIP]
>> Thanks for the heads up. So the problem with my patch is already fixed,
>> isn't it?
> The NULL pointer dereference in nouveau_bo_wr16 introduced in
> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
>
> That's the bug I hit when bisecting the original problem:
> NULL pointer dereference in nouveau_bo_sync_for_device
> It's caused by:
> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt

Good that I've asked :)

Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
created mostly automated.

Do you have the original backtrace of that NULL pointer deref once more?

Thanks,
Christian.

2021-06-09 20:04:31

by Ondrej Zary

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

On Wednesday 09 June 2021 11:21:05 Christian König wrote:
> Am 09.06.21 um 09:10 schrieb Ondrej Zary:
> > On Wednesday 09 June 2021, Christian König wrote:
> >> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
> >>> [SNIP]
> >>>> Thanks for the heads up. So the problem with my patch is already fixed,
> >>>> isn't it?
> >>> The NULL pointer dereference in nouveau_bo_wr16 introduced in
> >>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
> >>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> >>>
> >>> That's the bug I hit when bisecting the original problem:
> >>> NULL pointer dereference in nouveau_bo_sync_for_device
> >>> It's caused by:
> >>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
> >> Good that I've asked :)
> >>
> >> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
> >> created mostly automated.
> >>
> >> Do you have the original backtrace of that NULL pointer deref once more?
> > The original backtrace is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7Ce905b6bd2aa842ace15508d92b15b96d%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588195000729460%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=zFqheBbJcOHtYgqG%2Fs63AT1dwuk4REmUDJWHvzaLAlc%3D&amp;reserved=0
>
> And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I
> don't see how that can happen since nouveau is using ttm_sg_tt_init().
>
> Apart from that what nouveau does here is rather questionable since you
> need a coherent architecture for most things anyway, but that's not what
> we are trying to fix here.
>
> Can you try to narrow down if ttm_sg_tt_init is called before calling
> this function for the tt object in question?

ttm_sg_tt_init is not called:
[ 12.150124] nouveau 0000:01:00.0: DRM: VRAM: 31 MiB
[ 12.150133] nouveau 0000:01:00.0: DRM: GART: 128 MiB
[ 12.150143] nouveau 0000:01:00.0: DRM: BMP version 5.6
[ 12.150151] nouveau 0000:01:00.0: DRM: No DCB data found in VBIOS
[ 12.151362] ttm_tt_init
[ 12.151370] ttm_tt_init_fields
[ 12.151374] ttm_tt_alloc_page_directory
[ 12.151615] BUG: kernel NULL pointer dereference, address: 00000000



--
Ondrej Zary

2021-06-10 06:44:47

by Christian König

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device



Am 09.06.21 um 22:00 schrieb Ondrej Zary:
> On Wednesday 09 June 2021 11:21:05 Christian König wrote:
>> Am 09.06.21 um 09:10 schrieb Ondrej Zary:
>>> On Wednesday 09 June 2021, Christian König wrote:
>>>> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
>>>>> [SNIP]
>>>>>> Thanks for the heads up. So the problem with my patch is already fixed,
>>>>>> isn't it?
>>>>> The NULL pointer dereference in nouveau_bo_wr16 introduced in
>>>>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
>>>>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
>>>>>
>>>>> That's the bug I hit when bisecting the original problem:
>>>>> NULL pointer dereference in nouveau_bo_sync_for_device
>>>>> It's caused by:
>>>>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
>>>> Good that I've asked :)
>>>>
>>>> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
>>>> created mostly automated.
>>>>
>>>> Do you have the original backtrace of that NULL pointer deref once more?
>>> The original backtrace is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4309ff021d5e4cbe948b08d92b813106%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588657045383056%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=t70c9ktzPJzDaEAcO4wpQMv3TUo5b53cUy66AkLeVwE%3D&amp;reserved=0
>> And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I
>> don't see how that can happen since nouveau is using ttm_sg_tt_init().
>>
>> Apart from that what nouveau does here is rather questionable since you
>> need a coherent architecture for most things anyway, but that's not what
>> we are trying to fix here.
>>
>> Can you try to narrow down if ttm_sg_tt_init is called before calling
>> this function for the tt object in question?
> ttm_sg_tt_init is not called:
> [ 12.150124] nouveau 0000:01:00.0: DRM: VRAM: 31 MiB
> [ 12.150133] nouveau 0000:01:00.0: DRM: GART: 128 MiB
> [ 12.150143] nouveau 0000:01:00.0: DRM: BMP version 5.6
> [ 12.150151] nouveau 0000:01:00.0: DRM: No DCB data found in VBIOS
> [ 12.151362] ttm_tt_init
> [ 12.151370] ttm_tt_init_fields
> [ 12.151374] ttm_tt_alloc_page_directory
> [ 12.151615] BUG: kernel NULL pointer dereference, address: 00000000

Please add dump_stack(); to ttm_tt_init() and report back with the
backtrace.

I can't see how this is called from the nouveau code, only possibility I
see is that it is maybe called through the AGP code somehow.

Christian.

2021-06-10 17:52:04

by Ondrej Zary

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

On Thursday 10 June 2021 08:43:06 Christian König wrote:
>
> Am 09.06.21 um 22:00 schrieb Ondrej Zary:
> > On Wednesday 09 June 2021 11:21:05 Christian König wrote:
> >> Am 09.06.21 um 09:10 schrieb Ondrej Zary:
> >>> On Wednesday 09 June 2021, Christian König wrote:
> >>>> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
> >>>>> [SNIP]
> >>>>>> Thanks for the heads up. So the problem with my patch is already fixed,
> >>>>>> isn't it?
> >>>>> The NULL pointer dereference in nouveau_bo_wr16 introduced in
> >>>>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
> >>>>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
> >>>>>
> >>>>> That's the bug I hit when bisecting the original problem:
> >>>>> NULL pointer dereference in nouveau_bo_sync_for_device
> >>>>> It's caused by:
> >>>>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
> >>>> Good that I've asked :)
> >>>>
> >>>> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
> >>>> created mostly automated.
> >>>>
> >>>> Do you have the original backtrace of that NULL pointer deref once more?
> >>> The original backtrace is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C4309ff021d5e4cbe948b08d92b813106%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637588657045383056%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=t70c9ktzPJzDaEAcO4wpQMv3TUo5b53cUy66AkLeVwE%3D&amp;reserved=0
> >> And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I
> >> don't see how that can happen since nouveau is using ttm_sg_tt_init().
> >>
> >> Apart from that what nouveau does here is rather questionable since you
> >> need a coherent architecture for most things anyway, but that's not what
> >> we are trying to fix here.
> >>
> >> Can you try to narrow down if ttm_sg_tt_init is called before calling
> >> this function for the tt object in question?
> > ttm_sg_tt_init is not called:
> > [ 12.150124] nouveau 0000:01:00.0: DRM: VRAM: 31 MiB
> > [ 12.150133] nouveau 0000:01:00.0: DRM: GART: 128 MiB
> > [ 12.150143] nouveau 0000:01:00.0: DRM: BMP version 5.6
> > [ 12.150151] nouveau 0000:01:00.0: DRM: No DCB data found in VBIOS
> > [ 12.151362] ttm_tt_init
> > [ 12.151370] ttm_tt_init_fields
> > [ 12.151374] ttm_tt_alloc_page_directory
> > [ 12.151615] BUG: kernel NULL pointer dereference, address: 00000000
>
> Please add dump_stack(); to ttm_tt_init() and report back with the
> backtrace.
>
> I can't see how this is called from the nouveau code, only possibility I
> see is that it is maybe called through the AGP code somehow.

Yes, you're right:
[ 13.192663] Call Trace:
[ 13.192678] dump_stack+0x54/0x68
[ 13.192690] ttm_tt_init+0x11/0x8a [ttm]
[ 13.192699] ttm_agp_tt_create+0x39/0x51 [ttm]
[ 13.192840] nouveau_ttm_tt_create+0x17/0x22 [nouveau]
[ 13.192856] ttm_tt_create+0x78/0x8c [ttm]
[ 13.192864] ttm_bo_handle_move_mem+0x7d/0xca [ttm]
[ 13.192873] ttm_bo_validate+0x92/0xc8 [ttm]
[ 13.192883] ttm_bo_init_reserved+0x216/0x243 [ttm]
[ 13.192892] ttm_bo_init+0x45/0x65 [ttm]
[ 13.193018] ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
[ 13.193150] nouveau_bo_init+0x8c/0x94 [nouveau]
[ 13.193273] ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
[ 13.193407] nouveau_bo_new+0x44/0x57 [nouveau]
[ 13.193537] nouveau_channel_prep+0xa3/0x269 [nouveau]
[ 13.193665] nouveau_channel_new+0x3c/0x5f7 [nouveau]
[ 13.193679] ? slab_free_freelist_hook+0x3b/0xa7
[ 13.193686] ? kfree+0x9e/0x11a
[ 13.193781] ? nvif_object_sclass_put+0xd/0x16 [nouveau]
[ 13.193908] nouveau_drm_device_init+0x2e2/0x646 [nouveau]
[ 13.193924] ? pci_enable_device_flags+0x1e/0xac
[ 13.194052] nouveau_drm_probe+0xeb/0x188 [nouveau]
[ 13.194182] ? nouveau_drm_device_init+0x646/0x646 [nouveau]
[ 13.194195] pci_device_probe+0x89/0xe9
[ 13.194205] really_probe+0x127/0x2a7
[ 13.194212] driver_probe_device+0x5b/0x87
[ 13.194219] device_driver_attach+0x2e/0x41
[ 13.194226] __driver_attach+0x7c/0x83
[ 13.194232] bus_for_each_dev+0x4c/0x66
[ 13.194238] driver_attach+0x14/0x16
[ 13.194244] ? device_driver_attach+0x41/0x41
[ 13.194251] bus_add_driver+0xc5/0x16c
[ 13.194258] driver_register+0x87/0xb9
[ 13.194265] __pci_register_driver+0x38/0x3b
[ 13.194271] ? 0xf0c0d000
[ 13.194362] nouveau_drm_init+0x14c/0x1000 [nouveau]

How is ttm_dma_tt->dma_address allocated? I cannot find any assignment
executed (in the working code):

$ git grep dma_address\ = drivers/gpu/
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c: sg->sgl->dma_address = addr;
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = &dma->dma_address[offset >> PAGE_SHIFT];
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = (mm_node->start << PAGE_SHIFT) + offset;
drivers/gpu/drm/i915/gvt/scheduler.c: sg->dma_address = addr;
drivers/gpu/drm/i915/i915_gpu_error.c: sg->dma_address = it;
drivers/gpu/drm/ttm/ttm_tt.c: ttm->dma_address = (void *) (ttm->ttm.pages + ttm->ttm.num_pages);
drivers/gpu/drm/ttm/ttm_tt.c: ttm->dma_address = kvmalloc_array(ttm->ttm.num_pages,
drivers/gpu/drm/ttm/ttm_tt.c: ttm_dma->dma_address = NULL;
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_phys_addr;
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_dma_addr;
drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_sg_addr;

The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
ttm_sg_tt_alloc_page_directory().
Confirmed by adding printk()s that they're NOT called.


--
Ondrej Zary

2021-06-10 18:01:08

by Christian König

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

Am 10.06.21 um 19:50 schrieb Ondrej Zary:
> On Thursday 10 June 2021 08:43:06 Christian König wrote:
>> Am 09.06.21 um 22:00 schrieb Ondrej Zary:
>>> On Wednesday 09 June 2021 11:21:05 Christian König wrote:
>>>> Am 09.06.21 um 09:10 schrieb Ondrej Zary:
>>>>> On Wednesday 09 June 2021, Christian König wrote:
>>>>>> Am 09.06.21 um 08:57 schrieb Ondrej Zary:
>>>>>>> [SNIP]
>>>>>>>> Thanks for the heads up. So the problem with my patch is already fixed,
>>>>>>>> isn't it?
>>>>>>> The NULL pointer dereference in nouveau_bo_wr16 introduced in
>>>>>>> 141b15e59175aa174ca1f7596188bd15a7ca17ba was fixed by
>>>>>>> aea656b0d05ec5b8ed5beb2f94c4dd42ea834e9d.
>>>>>>>
>>>>>>> That's the bug I hit when bisecting the original problem:
>>>>>>> NULL pointer dereference in nouveau_bo_sync_for_device
>>>>>>> It's caused by:
>>>>>>> # first bad commit: [e34b8feeaa4b65725b25f49c9b08a0f8707e8e86] drm/ttm: merge ttm_dma_tt back into ttm_tt
>>>>>> Good that I've asked :)
>>>>>>
>>>>>> Ok that's a bit strange. e34b8feeaa4b65725b25f49c9b08a0f8707e8e86 was
>>>>>> created mostly automated.
>>>>>>
>>>>>> Do you have the original backtrace of that NULL pointer deref once more?
>>>>> The original backtrace is here: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2021%2F6%2F5%2F350&amp;data=04%7C01%7Cchristian.koenig%40amd.com%7C657222345e3242e7a6a608d92c383f66%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637589442963348551%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=ZkJs%2FR8MeQKUxwhJUC%2FG4Hi3T%2FMIftt%2FWRh%2B1%2BU5rUE%3D&amp;reserved=0
>>>> And the problem is that ttm_dma->dma_address is NULL, right? Mhm, I
>>>> don't see how that can happen since nouveau is using ttm_sg_tt_init().
>>>>
>>>> Apart from that what nouveau does here is rather questionable since you
>>>> need a coherent architecture for most things anyway, but that's not what
>>>> we are trying to fix here.
>>>>
>>>> Can you try to narrow down if ttm_sg_tt_init is called before calling
>>>> this function for the tt object in question?
>>> ttm_sg_tt_init is not called:
>>> [ 12.150124] nouveau 0000:01:00.0: DRM: VRAM: 31 MiB
>>> [ 12.150133] nouveau 0000:01:00.0: DRM: GART: 128 MiB
>>> [ 12.150143] nouveau 0000:01:00.0: DRM: BMP version 5.6
>>> [ 12.150151] nouveau 0000:01:00.0: DRM: No DCB data found in VBIOS
>>> [ 12.151362] ttm_tt_init
>>> [ 12.151370] ttm_tt_init_fields
>>> [ 12.151374] ttm_tt_alloc_page_directory
>>> [ 12.151615] BUG: kernel NULL pointer dereference, address: 00000000
>> Please add dump_stack(); to ttm_tt_init() and report back with the
>> backtrace.
>>
>> I can't see how this is called from the nouveau code, only possibility I
>> see is that it is maybe called through the AGP code somehow.
> Yes, you're right:
> [ 13.192663] Call Trace:
> [ 13.192678] dump_stack+0x54/0x68
> [ 13.192690] ttm_tt_init+0x11/0x8a [ttm]
> [ 13.192699] ttm_agp_tt_create+0x39/0x51 [ttm]
> [ 13.192840] nouveau_ttm_tt_create+0x17/0x22 [nouveau]
> [ 13.192856] ttm_tt_create+0x78/0x8c [ttm]
> [ 13.192864] ttm_bo_handle_move_mem+0x7d/0xca [ttm]
> [ 13.192873] ttm_bo_validate+0x92/0xc8 [ttm]
> [ 13.192883] ttm_bo_init_reserved+0x216/0x243 [ttm]
> [ 13.192892] ttm_bo_init+0x45/0x65 [ttm]
> [ 13.193018] ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
> [ 13.193150] nouveau_bo_init+0x8c/0x94 [nouveau]
> [ 13.193273] ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
> [ 13.193407] nouveau_bo_new+0x44/0x57 [nouveau]
> [ 13.193537] nouveau_channel_prep+0xa3/0x269 [nouveau]
> [ 13.193665] nouveau_channel_new+0x3c/0x5f7 [nouveau]
> [ 13.193679] ? slab_free_freelist_hook+0x3b/0xa7
> [ 13.193686] ? kfree+0x9e/0x11a
> [ 13.193781] ? nvif_object_sclass_put+0xd/0x16 [nouveau]
> [ 13.193908] nouveau_drm_device_init+0x2e2/0x646 [nouveau]
> [ 13.193924] ? pci_enable_device_flags+0x1e/0xac
> [ 13.194052] nouveau_drm_probe+0xeb/0x188 [nouveau]
> [ 13.194182] ? nouveau_drm_device_init+0x646/0x646 [nouveau]
> [ 13.194195] pci_device_probe+0x89/0xe9
> [ 13.194205] really_probe+0x127/0x2a7
> [ 13.194212] driver_probe_device+0x5b/0x87
> [ 13.194219] device_driver_attach+0x2e/0x41
> [ 13.194226] __driver_attach+0x7c/0x83
> [ 13.194232] bus_for_each_dev+0x4c/0x66
> [ 13.194238] driver_attach+0x14/0x16
> [ 13.194244] ? device_driver_attach+0x41/0x41
> [ 13.194251] bus_add_driver+0xc5/0x16c
> [ 13.194258] driver_register+0x87/0xb9
> [ 13.194265] __pci_register_driver+0x38/0x3b
> [ 13.194271] ? 0xf0c0d000
> [ 13.194362] nouveau_drm_init+0x14c/0x1000 [nouveau]
>
> How is ttm_dma_tt->dma_address allocated?

Mhm, I need to double check how AGP is supposed to work.

Since barely anybody is using it these days it is something which breaks
from time to time.

Thanks for the backtrace,
Christian.

> I cannot find any assignment
> executed (in the working code):
>
> $ git grep dma_address\ = drivers/gpu/
> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c: sg->sgl->dma_address = addr;
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = &dma->dma_address[offset >> PAGE_SHIFT];
> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = (mm_node->start << PAGE_SHIFT) + offset;
> drivers/gpu/drm/i915/gvt/scheduler.c: sg->dma_address = addr;
> drivers/gpu/drm/i915/i915_gpu_error.c: sg->dma_address = it;
> drivers/gpu/drm/ttm/ttm_tt.c: ttm->dma_address = (void *) (ttm->ttm.pages + ttm->ttm.num_pages);
> drivers/gpu/drm/ttm/ttm_tt.c: ttm->dma_address = kvmalloc_array(ttm->ttm.num_pages,
> drivers/gpu/drm/ttm/ttm_tt.c: ttm_dma->dma_address = NULL;
> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_phys_addr;
> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_dma_addr;
> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_sg_addr;
>
> The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
> ttm_sg_tt_alloc_page_directory().
> Confirmed by adding printk()s that they're NOT called.
>
>

2021-06-11 12:43:02

by Christian König

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device



Am 10.06.21 um 19:59 schrieb Christian König:
> Am 10.06.21 um 19:50 schrieb Ondrej Zary:
>> [SNIP]
>>> I can't see how this is called from the nouveau code, only
>>> possibility I
>>> see is that it is maybe called through the AGP code somehow.
>> Yes, you're right:
>> [   13.192663] Call Trace:
>> [   13.192678]  dump_stack+0x54/0x68
>> [   13.192690]  ttm_tt_init+0x11/0x8a [ttm]
>> [   13.192699]  ttm_agp_tt_create+0x39/0x51 [ttm]
>> [   13.192840]  nouveau_ttm_tt_create+0x17/0x22 [nouveau]
>> [   13.192856]  ttm_tt_create+0x78/0x8c [ttm]
>> [   13.192864]  ttm_bo_handle_move_mem+0x7d/0xca [ttm]
>> [   13.192873]  ttm_bo_validate+0x92/0xc8 [ttm]
>> [   13.192883]  ttm_bo_init_reserved+0x216/0x243 [ttm]
>> [   13.192892]  ttm_bo_init+0x45/0x65 [ttm]
>> [   13.193018]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
>> [   13.193150]  nouveau_bo_init+0x8c/0x94 [nouveau]
>> [   13.193273]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
>> [   13.193407]  nouveau_bo_new+0x44/0x57 [nouveau]
>> [   13.193537]  nouveau_channel_prep+0xa3/0x269 [nouveau]
>> [   13.193665]  nouveau_channel_new+0x3c/0x5f7 [nouveau]
>> [   13.193679]  ? slab_free_freelist_hook+0x3b/0xa7
>> [   13.193686]  ? kfree+0x9e/0x11a
>> [   13.193781]  ? nvif_object_sclass_put+0xd/0x16 [nouveau]
>> [   13.193908]  nouveau_drm_device_init+0x2e2/0x646 [nouveau]
>> [   13.193924]  ? pci_enable_device_flags+0x1e/0xac
>> [   13.194052]  nouveau_drm_probe+0xeb/0x188 [nouveau]
>> [   13.194182]  ? nouveau_drm_device_init+0x646/0x646 [nouveau]
>> [   13.194195]  pci_device_probe+0x89/0xe9
>> [   13.194205]  really_probe+0x127/0x2a7
>> [   13.194212]  driver_probe_device+0x5b/0x87
>> [   13.194219]  device_driver_attach+0x2e/0x41
>> [   13.194226]  __driver_attach+0x7c/0x83
>> [   13.194232]  bus_for_each_dev+0x4c/0x66
>> [   13.194238]  driver_attach+0x14/0x16
>> [   13.194244]  ? device_driver_attach+0x41/0x41
>> [   13.194251]  bus_add_driver+0xc5/0x16c
>> [   13.194258]  driver_register+0x87/0xb9
>> [   13.194265]  __pci_register_driver+0x38/0x3b
>> [   13.194271]  ? 0xf0c0d000
>> [   13.194362]  nouveau_drm_init+0x14c/0x1000 [nouveau]
>>
>> How is ttm_dma_tt->dma_address allocated?
>
> Mhm, I need to double check how AGP is supposed to work.
>
> Since barely anybody is using it these days it is something which
> breaks from time to time.

I have no idea how that ever worked in the first place since AGP isn't
supposed to sync between CPU/GPU. Everything is coherent for that case.

Anyway here is a patch which adds a check to those functions if the
dma_address array is allocated in the first place. Please test it.

Thanks,
Christian.

>
> Thanks for the backtrace,
> Christian.
>
>>   I cannot find any assignment
>> executed (in the working code):
>>
>> $ git grep dma_address\ = drivers/gpu/
>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:
>> sg->sgl->dma_address = addr;
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
>> &dma->dma_address[offset >> PAGE_SHIFT];
>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
>> (mm_node->start << PAGE_SHIFT) + offset;
>> drivers/gpu/drm/i915/gvt/scheduler.c:   sg->dma_address = addr;
>> drivers/gpu/drm/i915/i915_gpu_error.c:  sg->dma_address = it;
>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = (void *)
>> (ttm->ttm.pages + ttm->ttm.num_pages);
>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address =
>> kvmalloc_array(ttm->ttm.num_pages,
>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm_dma->dma_address = NULL;
>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
>> &__vmw_piter_phys_addr;
>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
>> &__vmw_piter_dma_addr;
>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
>> &__vmw_piter_sg_addr;
>>
>> The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
>> ttm_sg_tt_alloc_page_directory().
>> Confirmed by adding printk()s that they're NOT called.
>>
>>
>


Attachments:
0001-drm-nouveau-check-dma_address-array-for-CPU-GPU-sync.patch (1.37 kB)

2021-06-11 18:24:37

by Ondrej Zary

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device

On Friday 11 June 2021 14:38:18 Christian König wrote:
>
> Am 10.06.21 um 19:59 schrieb Christian König:
> > Am 10.06.21 um 19:50 schrieb Ondrej Zary:
> >> [SNIP]
> >>> I can't see how this is called from the nouveau code, only
> >>> possibility I
> >>> see is that it is maybe called through the AGP code somehow.
> >> Yes, you're right:
> >> [   13.192663] Call Trace:
> >> [   13.192678]  dump_stack+0x54/0x68
> >> [   13.192690]  ttm_tt_init+0x11/0x8a [ttm]
> >> [   13.192699]  ttm_agp_tt_create+0x39/0x51 [ttm]
> >> [   13.192840]  nouveau_ttm_tt_create+0x17/0x22 [nouveau]
> >> [   13.192856]  ttm_tt_create+0x78/0x8c [ttm]
> >> [   13.192864]  ttm_bo_handle_move_mem+0x7d/0xca [ttm]
> >> [   13.192873]  ttm_bo_validate+0x92/0xc8 [ttm]
> >> [   13.192883]  ttm_bo_init_reserved+0x216/0x243 [ttm]
> >> [   13.192892]  ttm_bo_init+0x45/0x65 [ttm]
> >> [   13.193018]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
> >> [   13.193150]  nouveau_bo_init+0x8c/0x94 [nouveau]
> >> [   13.193273]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
> >> [   13.193407]  nouveau_bo_new+0x44/0x57 [nouveau]
> >> [   13.193537]  nouveau_channel_prep+0xa3/0x269 [nouveau]
> >> [   13.193665]  nouveau_channel_new+0x3c/0x5f7 [nouveau]
> >> [   13.193679]  ? slab_free_freelist_hook+0x3b/0xa7
> >> [   13.193686]  ? kfree+0x9e/0x11a
> >> [   13.193781]  ? nvif_object_sclass_put+0xd/0x16 [nouveau]
> >> [   13.193908]  nouveau_drm_device_init+0x2e2/0x646 [nouveau]
> >> [   13.193924]  ? pci_enable_device_flags+0x1e/0xac
> >> [   13.194052]  nouveau_drm_probe+0xeb/0x188 [nouveau]
> >> [   13.194182]  ? nouveau_drm_device_init+0x646/0x646 [nouveau]
> >> [   13.194195]  pci_device_probe+0x89/0xe9
> >> [   13.194205]  really_probe+0x127/0x2a7
> >> [   13.194212]  driver_probe_device+0x5b/0x87
> >> [   13.194219]  device_driver_attach+0x2e/0x41
> >> [   13.194226]  __driver_attach+0x7c/0x83
> >> [   13.194232]  bus_for_each_dev+0x4c/0x66
> >> [   13.194238]  driver_attach+0x14/0x16
> >> [   13.194244]  ? device_driver_attach+0x41/0x41
> >> [   13.194251]  bus_add_driver+0xc5/0x16c
> >> [   13.194258]  driver_register+0x87/0xb9
> >> [   13.194265]  __pci_register_driver+0x38/0x3b
> >> [   13.194271]  ? 0xf0c0d000
> >> [   13.194362]  nouveau_drm_init+0x14c/0x1000 [nouveau]
> >>
> >> How is ttm_dma_tt->dma_address allocated?
> >
> > Mhm, I need to double check how AGP is supposed to work.
> >
> > Since barely anybody is using it these days it is something which
> > breaks from time to time.
>
> I have no idea how that ever worked in the first place since AGP isn't
> supposed to sync between CPU/GPU. Everything is coherent for that case.
>
> Anyway here is a patch which adds a check to those functions if the
> dma_address array is allocated in the first place. Please test it.

Thanks, the patch fixes the problem and nouveau now works!
Should be applied to 5.12-stable too (5.11 is affected too but EOL).

It's weird that it worked before.
Looks like dma_address was used uninitialized - it contained some random
crap:
[ 12.293304] nouveau_bo_sync_for_device: ttm_dma->dma_address=3e055971 ttm_dma->ttm.num_pages=18
[ 12.293321] ttm_dma->dma_address[0]=0x0
[ 12.293341] ttm_dma->dma_address[1]=0x0
[ 12.293360] ttm_dma->dma_address[2]=0xee728980
[ 12.293379] ttm_dma->dma_address[3]=0xed1cb120
[ 12.293397] ttm_dma->dma_address[4]=0x12
[ 12.293416] ttm_dma->dma_address[5]=0x0
[ 12.293434] ttm_dma->dma_address[6]=0x1
[ 12.293453] ttm_dma->dma_address[7]=0x0
[ 12.293471] ttm_dma->dma_address[8]=0x10000
[ 12.293490] ttm_dma->dma_address[9]=0x0
[ 12.293510] ttm_dma->dma_address[10]=0x101
[ 12.293528] ttm_dma->dma_address[11]=0xee7289ec
[ 12.293546] ttm_dma->dma_address[12]=0xee7289ec
[ 12.293564] ttm_dma->dma_address[13]=0x0
[ 12.293581] ttm_dma->dma_address[14]=0x0
[ 12.293599] ttm_dma->dma_address[15]=0x0
[ 12.293616] ttm_dma->dma_address[16]=0x0
[ 12.293634] ttm_dma->dma_address[17]=0x0
But it did not matter as dma_sync_single_for_device is a no-op here.
When dma_address is properly initialized to NULL, it crashes...

> Thanks,
> Christian.
>
> >
> > Thanks for the backtrace,
> > Christian.
> >
> >>   I cannot find any assignment
> >> executed (in the working code):
> >>
> >> $ git grep dma_address\ = drivers/gpu/
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:
> >> sg->sgl->dma_address = addr;
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
> >> &dma->dma_address[offset >> PAGE_SHIFT];
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
> >> (mm_node->start << PAGE_SHIFT) + offset;
> >> drivers/gpu/drm/i915/gvt/scheduler.c:   sg->dma_address = addr;
> >> drivers/gpu/drm/i915/i915_gpu_error.c:  sg->dma_address = it;
> >> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = (void *)
> >> (ttm->ttm.pages + ttm->ttm.num_pages);
> >> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address =
> >> kvmalloc_array(ttm->ttm.num_pages,
> >> drivers/gpu/drm/ttm/ttm_tt.c:   ttm_dma->dma_address = NULL;
> >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
> >> &__vmw_piter_phys_addr;
> >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
> >> &__vmw_piter_dma_addr;
> >> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
> >> &__vmw_piter_sg_addr;
> >>
> >> The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
> >> ttm_sg_tt_alloc_page_directory().
> >> Confirmed by adding printk()s that they're NOT called.
> >>
> >>
> >
>
>


--
Ondrej Zary

2021-06-14 11:21:48

by Christian König

[permalink] [raw]
Subject: Re: nouveau broken on Riva TNT2 in 5.13.0-rc4: NULL pointer dereference in nouveau_bo_sync_for_device



Am 11.06.21 um 20:23 schrieb Ondrej Zary:
> On Friday 11 June 2021 14:38:18 Christian König wrote:
>> Am 10.06.21 um 19:59 schrieb Christian König:
>>> Am 10.06.21 um 19:50 schrieb Ondrej Zary:
>>>> [SNIP]
>>>>> I can't see how this is called from the nouveau code, only
>>>>> possibility I
>>>>> see is that it is maybe called through the AGP code somehow.
>>>> Yes, you're right:
>>>> [   13.192663] Call Trace:
>>>> [   13.192678]  dump_stack+0x54/0x68
>>>> [   13.192690]  ttm_tt_init+0x11/0x8a [ttm]
>>>> [   13.192699]  ttm_agp_tt_create+0x39/0x51 [ttm]
>>>> [   13.192840]  nouveau_ttm_tt_create+0x17/0x22 [nouveau]
>>>> [   13.192856]  ttm_tt_create+0x78/0x8c [ttm]
>>>> [   13.192864]  ttm_bo_handle_move_mem+0x7d/0xca [ttm]
>>>> [   13.192873]  ttm_bo_validate+0x92/0xc8 [ttm]
>>>> [   13.192883]  ttm_bo_init_reserved+0x216/0x243 [ttm]
>>>> [   13.192892]  ttm_bo_init+0x45/0x65 [ttm]
>>>> [   13.193018]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
>>>> [   13.193150]  nouveau_bo_init+0x8c/0x94 [nouveau]
>>>> [   13.193273]  ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau]
>>>> [   13.193407]  nouveau_bo_new+0x44/0x57 [nouveau]
>>>> [   13.193537]  nouveau_channel_prep+0xa3/0x269 [nouveau]
>>>> [   13.193665]  nouveau_channel_new+0x3c/0x5f7 [nouveau]
>>>> [   13.193679]  ? slab_free_freelist_hook+0x3b/0xa7
>>>> [   13.193686]  ? kfree+0x9e/0x11a
>>>> [   13.193781]  ? nvif_object_sclass_put+0xd/0x16 [nouveau]
>>>> [   13.193908]  nouveau_drm_device_init+0x2e2/0x646 [nouveau]
>>>> [   13.193924]  ? pci_enable_device_flags+0x1e/0xac
>>>> [   13.194052]  nouveau_drm_probe+0xeb/0x188 [nouveau]
>>>> [   13.194182]  ? nouveau_drm_device_init+0x646/0x646 [nouveau]
>>>> [   13.194195]  pci_device_probe+0x89/0xe9
>>>> [   13.194205]  really_probe+0x127/0x2a7
>>>> [   13.194212]  driver_probe_device+0x5b/0x87
>>>> [   13.194219]  device_driver_attach+0x2e/0x41
>>>> [   13.194226]  __driver_attach+0x7c/0x83
>>>> [   13.194232]  bus_for_each_dev+0x4c/0x66
>>>> [   13.194238]  driver_attach+0x14/0x16
>>>> [   13.194244]  ? device_driver_attach+0x41/0x41
>>>> [   13.194251]  bus_add_driver+0xc5/0x16c
>>>> [   13.194258]  driver_register+0x87/0xb9
>>>> [   13.194265]  __pci_register_driver+0x38/0x3b
>>>> [   13.194271]  ? 0xf0c0d000
>>>> [   13.194362]  nouveau_drm_init+0x14c/0x1000 [nouveau]
>>>>
>>>> How is ttm_dma_tt->dma_address allocated?
>>> Mhm, I need to double check how AGP is supposed to work.
>>>
>>> Since barely anybody is using it these days it is something which
>>> breaks from time to time.
>> I have no idea how that ever worked in the first place since AGP isn't
>> supposed to sync between CPU/GPU. Everything is coherent for that case.
>>
>> Anyway here is a patch which adds a check to those functions if the
>> dma_address array is allocated in the first place. Please test it.
> Thanks, the patch fixes the problem and nouveau now works!
> Should be applied to 5.12-stable too (5.11 is affected too but EOL).

I will just add a CC stable tag before pushing.

>
> It's weird that it worked before.
> Looks like dma_address was used uninitialized - it contained some random
> crap:
> [ 12.293304] nouveau_bo_sync_for_device: ttm_dma->dma_address=3e055971 ttm_dma->ttm.num_pages=18
> [ 12.293321] ttm_dma->dma_address[0]=0x0
> [ 12.293341] ttm_dma->dma_address[1]=0x0
> [ 12.293360] ttm_dma->dma_address[2]=0xee728980
> [ 12.293379] ttm_dma->dma_address[3]=0xed1cb120
> [ 12.293397] ttm_dma->dma_address[4]=0x12
> [ 12.293416] ttm_dma->dma_address[5]=0x0
> [ 12.293434] ttm_dma->dma_address[6]=0x1
> [ 12.293453] ttm_dma->dma_address[7]=0x0
> [ 12.293471] ttm_dma->dma_address[8]=0x10000
> [ 12.293490] ttm_dma->dma_address[9]=0x0
> [ 12.293510] ttm_dma->dma_address[10]=0x101
> [ 12.293528] ttm_dma->dma_address[11]=0xee7289ec
> [ 12.293546] ttm_dma->dma_address[12]=0xee7289ec
> [ 12.293564] ttm_dma->dma_address[13]=0x0
> [ 12.293581] ttm_dma->dma_address[14]=0x0
> [ 12.293599] ttm_dma->dma_address[15]=0x0
> [ 12.293616] ttm_dma->dma_address[16]=0x0
> [ 12.293634] ttm_dma->dma_address[17]=0x0
> But it did not matter as dma_sync_single_for_device is a no-op here.
> When dma_address is properly initialized to NULL, it crashes...

Ok that explains things, but essentially means that this only worked by
coincident.

Just send out the patch to Ben, the list and you once more. Please reply
with a rb, ak-by and/or tested-by so that I can push it ASAP.

Thanks,
Christian.

>
>> Thanks,
>> Christian.
>>
>>> Thanks for the backtrace,
>>> Christian.
>>>
>>>>   I cannot find any assignment
>>>> executed (in the working code):
>>>>
>>>> $ git grep dma_address\ = drivers/gpu/
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c:
>>>> sg->sgl->dma_address = addr;
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
>>>> &dma->dma_address[offset >> PAGE_SHIFT];
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address =
>>>> (mm_node->start << PAGE_SHIFT) + offset;
>>>> drivers/gpu/drm/i915/gvt/scheduler.c:   sg->dma_address = addr;
>>>> drivers/gpu/drm/i915/i915_gpu_error.c:  sg->dma_address = it;
>>>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address = (void *)
>>>> (ttm->ttm.pages + ttm->ttm.num_pages);
>>>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm->dma_address =
>>>> kvmalloc_array(ttm->ttm.num_pages,
>>>> drivers/gpu/drm/ttm/ttm_tt.c:   ttm_dma->dma_address = NULL;
>>>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
>>>> &__vmw_piter_phys_addr;
>>>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
>>>> &__vmw_piter_dma_addr;
>>>> drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address =
>>>> &__vmw_piter_sg_addr;
>>>>
>>>> The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and
>>>> ttm_sg_tt_alloc_page_directory().
>>>> Confirmed by adding printk()s that they're NOT called.
>>>>
>>>>
>>
>