2023-07-07 01:28:22

by Chen, Guchun

[permalink] [raw]
Subject: RE: [regression][6.5] KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x555/0x670 [amdgpu] on Radeon 7900XTX

[Public]

Hi Mike,

Yes, we are aware of this problem, and we are working on that. The problem is caused by recent code stores xcp_id to amdgpu bo for accounting memory usage and so on. However, not all VMs are attached to that like the case in amdgpu_mes_self_test.

Regards,
Guchun

> -----Original Message-----
> From: Mikhail Gavrilov <[email protected]>
> Sent: Friday, July 7, 2023 5:34 AM
> To: amd-gfx list <[email protected]>; Koenig, Christian
> <[email protected]>; Deucher, Alexander
> <[email protected]>; Chen, Guchun <[email protected]>;
> Linux List Kernel Mailing <[email protected]>
> Subject: [regression][6.5] KASAN: slab-out-of-bounds in
> amdgpu_vm_pt_create+0x555/0x670 [amdgpu] on Radeon 7900XTX
>
> Hi,
> On Radeon 7900XTX appeared issue "slab-out-of-bounds in
> amdgpu_vm_pt_create+0x555/0x670" between commits 3a8a670eeeaa and
> e55e5df193d2.
> Graphics cards with chips 6800M and 6900XT are unaffected.
>
> [ 12.562762]
> ================================================================
> ==
> [ 12.562775] BUG: KASAN: slab-out-of-bounds in
> amdgpu_vm_pt_create+0x555/0x670 [amdgpu]
> [ 12.563173] Read of size 4 at addr ffff8881347a8dc8 by task (udev-
> worker)/660
>
> [ 12.563183] CPU: 0 PID: 660 Comm: (udev-worker) Tainted: G W
> L ------- ---
> 6.5.0-0.rc0.20230630gite55e5df193d2.5.fc39.x86_64+debug #1
> [ 12.563192] Hardware name: Micro-Star International Co., Ltd.
> MS-7D73/MPG B650I EDGE WIFI (MS-7D73), BIOS 1.30 05/24/2023
> [ 12.563199] Call Trace:
> [ 12.563203] <TASK>
> [ 12.563206] dump_stack_lvl+0x76/0xd0
> [ 12.563213] print_report+0xcf/0x670
> [ 12.563220] ? amdgpu_vm_pt_create+0x555/0x670 [amdgpu]
> [ 12.563433] kasan_report+0xa6/0xe0
> [ 12.563436] ? amdgpu_vm_pt_create+0x555/0x670 [amdgpu]
> [ 12.563637] amdgpu_vm_pt_create+0x555/0x670 [amdgpu]
> [ 12.563835] ? __pfx_amdgpu_vm_pt_create+0x10/0x10 [amdgpu]
> [ 12.564030] ? __module_address+0x95/0x240
> [ 12.564035] ? lockdep_init_map_type+0x1a5/0x840
> [ 12.564040] ? __raw_spin_lock_init+0x3f/0x110
> [ 12.564044] amdgpu_vm_init+0x749/0x10c0 [amdgpu]
> [ 12.564240] ? __pfx_amdgpu_vm_init+0x10/0x10 [amdgpu]
> [ 12.564441] amdgpu_mes_self_test+0x16e/0x9e0 [amdgpu]
> [ 12.564661] ? lock_acquire+0x1a6/0x4f0
> [ 12.564664] ? __pfx_amdgpu_mes_self_test+0x10/0x10 [amdgpu]
> [ 12.564871] ? local_clock_noinstr+0xd/0xc0
> [ 12.564876] ? find_held_lock+0x34/0x120
> [ 12.564882] ? _raw_spin_unlock_irqrestore+0x4f/0x80
> [ 12.564886] ? amdgpu_irq_update+0x1b2/0x2c0 [amdgpu]
> [ 12.565094] mes_v11_0_late_init+0xb8/0xe0 [amdgpu]
> [ 12.565304] amdgpu_device_ip_late_init+0x100/0x7b0 [amdgpu]
> [ 12.565509] amdgpu_device_init+0x7569/0x8660 [amdgpu]
> [ 12.565721] ? __pfx_amdgpu_device_init+0x10/0x10 [amdgpu]
> [ 12.565920] ? __pfx_pci_bus_read_config_word+0x10/0x10
> [ 12.565925] ? do_pci_enable_device+0x22d/0x2a0
> [ 12.565928] ? pci_wait_for_pending+0xa1/0x110
> [ 12.565933] amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu]
> [ 12.566131] amdgpu_pci_probe+0x287/0x9e0 [amdgpu]
> [ 12.566337] ? __pfx_amdgpu_pci_probe+0x10/0x10 [amdgpu]
> [ 12.566536] local_pci_probe+0xda/0x190
> [ 12.566540] pci_device_probe+0x23a/0x770
> [ 12.566544] ? kernfs_add_one+0x326/0x490
> [ 12.566548] ? kernfs_get.part.0+0x4c/0x70
> [ 12.566552] ? __pfx_pci_device_probe+0x10/0x10
> [ 12.566555] ? kernfs_create_link+0x16b/0x230
> [ 12.566559] ? kernfs_put+0x1c/0x40
> [ 12.566562] ? sysfs_do_create_link_sd+0x8e/0x100
> [ 12.566566] really_probe+0x3df/0xb80
> [ 12.566570] __driver_probe_device+0x18c/0x450
> [ 12.566573] driver_probe_device+0x4a/0x120
> [ 12.566576] __driver_attach+0x1e5/0x4a0
> [ 12.566579] ? __pfx___driver_attach+0x10/0x10
> [ 12.566582] bus_for_each_dev+0x106/0x190
> [ 12.566586] ? __pfx_bus_for_each_dev+0x10/0x10
> [ 12.566591] bus_add_driver+0x2a1/0x570
> [ 12.566594] ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
> [ 12.566794] ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
> [ 12.566993] driver_register+0x134/0x460
> [ 12.566996] ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
> [ 12.567193] do_one_initcall+0xd2/0x430
> [ 12.567197] ? __pfx_do_one_initcall+0x10/0x10
> [ 12.567202] ? kasan_unpoison+0x44/0x70
> [ 12.567206] do_init_module+0x238/0x770
> [ 12.567210] load_module+0x5581/0x6f10
> [ 12.567216] ? __pfx_load_module+0x10/0x10
> [ 12.567220] ? find_held_lock+0x34/0x120
> [ 12.567223] ? local_clock_noinstr+0xd/0xc0
> [ 12.567227] ? __pfx___might_resched+0x10/0x10
> [ 12.567232] ? __do_sys_init_module+0x1f2/0x220
> [ 12.567235] __do_sys_init_module+0x1f2/0x220
> [ 12.567238] ? __pfx___do_sys_init_module+0x10/0x10
> [ 12.567243] do_syscall_64+0x5d/0x90
> [ 12.567247] ? asm_exc_page_fault+0x26/0x30
> [ 12.567251] ? lockdep_hardirqs_on+0x81/0x110
> [ 12.567255] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> [ 12.567258] RIP: 0033:0x7fdb4e92b5de
> [ 12.567267] Code: 48 8b 0d 55 08 12 00 f7 d8 64 89 01 48 83 c8 ff
> c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 22 08 12 00 f7 d8 64 89
> 01 48
> [ 12.567274] RSP: 002b:00007ffe9ef35008 EFLAGS: 00000246 ORIG_RAX:
> 00000000000000af
> [ 12.567279] RAX: ffffffffffffffda RBX: 000055d8c8acb440 RCX:
> 00007fdb4e92b5de
> [ 12.567282] RDX: 000055d8c8af3840 RSI: 0000000003c829ee RDI:
> 00007fdb46c16010
> [ 12.567285] RBP: 00007ffe9ef350c0 R08: 000055d8c8ad5bd0 R09:
> ffffffdcab967160
> [ 12.567289] R10: 000055dd95219e95 R11: 0000000000000246 R12:
> 000055d8c8af3840
> [ 12.567292] R13: 0000000000020000 R14: 000055d8c8af0d30 R15:
> 000055d8c8af2740
> [ 12.567297] </TASK>
>
> [ 12.567300] Allocated by task 660:
> [ 12.567302] kasan_save_stack+0x33/0x60
> [ 12.567306] kasan_set_track+0x25/0x30
> [ 12.567309] __kasan_kmalloc+0x8f/0xa0
> [ 12.567312] amdgpu_mes_self_test+0x157/0x9e0 [amdgpu]
> [ 12.567529] mes_v11_0_late_init+0xb8/0xe0 [amdgpu]
> [ 12.567738] amdgpu_device_ip_late_init+0x100/0x7b0 [amdgpu]
> [ 12.567942] amdgpu_device_init+0x7569/0x8660 [amdgpu]
> [ 12.568142] amdgpu_driver_load_kms+0x1d/0x4b0 [amdgpu]
> [ 12.568343] amdgpu_pci_probe+0x287/0x9e0 [amdgpu]
> [ 12.568543] local_pci_probe+0xda/0x190
> [ 12.568546] pci_device_probe+0x23a/0x770
> [ 12.568550] really_probe+0x3df/0xb80
> [ 12.568552] __driver_probe_device+0x18c/0x450
> [ 12.568555] driver_probe_device+0x4a/0x120
> [ 12.568557] __driver_attach+0x1e5/0x4a0
> [ 12.568560] bus_for_each_dev+0x106/0x190
> [ 12.568563] bus_add_driver+0x2a1/0x570
> [ 12.568566] driver_register+0x134/0x460
> [ 12.568569] do_one_initcall+0xd2/0x430
> [ 12.568572] do_init_module+0x238/0x770
> [ 12.568574] load_module+0x5581/0x6f10
> [ 12.568577] __do_sys_init_module+0x1f2/0x220
> [ 12.568580] do_syscall_64+0x5d/0x90
> [ 12.568582] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>
> [ 12.568587] The buggy address belongs to the object at ffff8881347a8000
> which belongs to the cache kmalloc-4k of size 4096
> [ 12.568593] The buggy address is located 608 bytes to the right of
> allocated 2920-byte region [ffff8881347a8000, ffff8881347a8b68)
>
> [ 12.568600] The buggy address belongs to the physical page:
> [ 12.568602] page:000000001bdef670 refcount:1 mapcount:0
> mapping:0000000000000000 index:0x0 pfn:0x1347a8
> [ 12.568607] head:000000001bdef670 order:3 entire_mapcount:0
> nr_pages_mapped:0 pincount:0
> [ 12.568611] flags:
> 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
> [ 12.568616] page_type: 0xffffffff()
> [ 12.568619] raw: 0017ffffc0010200 ffff88810004d040 dead000000000122
> 0000000000000000
> [ 12.568622] raw: 0000000000000000 0000000080040004 00000001ffffffff
> 0000000000000000
> [ 12.568626] page dumped because: kasan: bad access detected
>
> [ 12.568630] Memory state around the buggy address:
> [ 12.568632] ffff8881347a8c80: fc fc fc fc fc fc fc fc fc fc fc fc
> fc fc fc fc
> [ 12.568635] ffff8881347a8d00: fc fc fc fc fc fc fc fc fc fc fc fc
> fc fc fc fc
> [ 12.568639] >ffff8881347a8d80: fc fc fc fc fc fc fc fc fc fc fc fc
> fc fc fc fc
> [ 12.568642] ^
> [ 12.568644] ffff8881347a8e00: fc fc fc fc fc fc fc fc fc fc fc fc
> fc fc fc fc
> [ 12.568648] ffff8881347a8e80: fc fc fc fc fc fc fc fc fc fc fc fc
> fc fc fc fc
> [ 12.568651]
> ================================================================
> ==
>
> I spended 6 day for bisecting this issue.
> But result it turned out not satisfact due to the fact on most commits the
> video card did not switch to graphics mode, and instead of "slab-out-of-
> bounds in amdgpu_vm_pt_create+0x555/0x670" I got error
> "KASAN: null-ptr-deref in range [0x00000000000003f0-
> 0x00000000000003f7]" because of this, all these commits were marked as
> "skip".
>
> The bisect results can be found in the attached file "bisect-log-slab-out-of-
> bounds-in-amdgpu _vm_pt_create.txt" all corresponding kernel logs of each
> bisect step packed in archive "dmesg-slab-out-of-bounds-in-
> amdgpu_vm_pt_create.zip".
>
> How else can I help here?
>
> --
> Best Regards,
> Mike Gavrilov.


2023-07-07 22:33:08

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: [regression][6.5] KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x555/0x670 [amdgpu] on Radeon 7900XTX

On Fri, Jul 7, 2023 at 6:01 AM Chen, Guchun <[email protected]> wrote:
>
> [Public]
>
> Hi Mike,
>
> Yes, we are aware of this problem, and we are working on that. The problem is caused by recent code stores xcp_id to amdgpu bo for accounting memory usage and so on. However, not all VMs are attached to that like the case in amdgpu_mes_self_test.
>

I would like to take part in testing the fix.

--
Best Regards,
Mike Gavrilov.

2023-07-14 11:42:48

by Chen, Guchun

[permalink] [raw]
Subject: RE: [regression][6.5] KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x555/0x670 [amdgpu] on Radeon 7900XTX

[Public]

> -----Original Message-----
> From: Mikhail Gavrilov <[email protected]>
> Sent: Saturday, July 8, 2023 6:27 AM
> To: Chen, Guchun <[email protected]>
> Cc: amd-gfx list <[email protected]>; Koenig, Christian
> <[email protected]>; Deucher, Alexander
> <[email protected]>; Linux List Kernel Mailing <linux-
> [email protected]>
> Subject: Re: [regression][6.5] KASAN: slab-out-of-bounds in
> amdgpu_vm_pt_create+0x555/0x670 [amdgpu] on Radeon 7900XTX
>
> On Fri, Jul 7, 2023 at 6:01 AM Chen, Guchun <[email protected]>
> wrote:
> >
> > [Public]
> >
> > Hi Mike,
> >
> > Yes, we are aware of this problem, and we are working on that. The
> problem is caused by recent code stores xcp_id to amdgpu bo for accounting
> memory usage and so on. However, not all VMs are attached to that like the
> case in amdgpu_mes_self_test.
> >
>
> I would like to take part in testing the fix.

Thanks for your patience on this, Mike. I think https://patchwork.freedesktop.org/patch/547592/ can help this, please take a try.

Regards,
Guchun

> --
> Best Regards,
> Mike Gavrilov.

2023-07-16 21:54:43

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: [regression][6.5] KASAN: slab-out-of-bounds in amdgpu_vm_pt_create+0x555/0x670 [amdgpu] on Radeon 7900XTX

On Fri, Jul 14, 2023 at 4:09 PM Chen, Guchun <[email protected]> wrote:
>
> Thanks for your patience on this, Mike. I think https://patchwork.freedesktop.org/patch/547592/ can help this, please take a try.

Tested-by: Mikhail Gavrilov <[email protected]>
Thanks it looks good. I spent the whole weekend with these patches on
top of 3f01e9fed845 and didn't notice any regressions.

--
Best Regards,
Mike Gavrilov.