Hi, I occasionally see the below trace with Linus' master on an
AMD Milan system:
[ 25.657322] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 25.665097] #PF: supervisor instruction fetch in kernel mode
[ 25.671448] #PF: error_code(0x0010) - not-present page
[ 25.677178] PGD 0 P4D 0
[ 25.680000] Oops: 0010 [#1] PREEMPT SMP NOPTI
[ 25.684857] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-rc3+ #71
[ 25.692043] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM1006B 08/20/2021
[ 25.701069] RIP: 0010:0x0
[ 25.703993] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 25.711661] RSP: 0018:ffffc90000003eb8 EFLAGS: 00010002
[ 25.717488] RAX: 0000000000000101 RBX: 0000000000000101 RCX: 0000000000000000
[ 25.725449] RDX: 00000000fffef080 RSI: 0000000000000000 RDI: ffff88817751e350
[ 25.733408] RBP: ffffc90000003ed8 R08: 0000000000000000 R09: 00000000000002e6
[ 25.741368] R10: ffff88810dc8b2d0 R11: ffff88bf4de290b0 R12: ffff88817751e350
[ 25.749327] R13: 0000000000000000 R14: ffff88bf4de18dc0 R15: ffff88817751e350
[ 25.757286] FS: 0000000000000000(0000) GS:ffff88bf4de00000(0000) knlGS:0000000000000000
[ 25.766312] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 25.772720] CR2: ffffffffffffffd6 CR3: 0000000166650002 CR4: 0000000000770ef0
[ 25.780754] PKRU: 55555554
[ 25.783768] Call Trace:
[ 25.786492] <IRQ>
[ 25.788729] call_timer_fn+0x2e/0x130
[ 25.792814] run_timer_softirq+0x361/0x4a0
[ 25.797381] ? lapic_next_event+0x21/0x30
[ 25.801852] ? clockevents_program_event+0x8f/0xe0
[ 25.807195] __do_softirq+0xfb/0x2db
[ 25.811183] irq_exit_rcu+0x98/0xd0
[ 25.815073] sysvec_apic_timer_interrupt+0xac/0xd0
[ 25.820415] </IRQ>
[ 25.822751] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 25.828481] RIP: 0010:native_safe_halt+0xb/0x10
[ 25.833533] Code: 08 74 81 eb c0 cc cc cc cc cc cc cc cc cc cc eb 07 0f 00 2d 69 68 60 00 f4 c3 0f 1f 44 00 00 eb 07 0f 00 2d 59 68 60 00 fb f4 <c3> cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 53 e8 b1 17 ff ff 66 90
[ 25.854485] RSP: 0018:ffffffff82803e10 EFLAGS: 00000206
[ 25.860311] RAX: ffffffff81c06c60 RBX: 0000000000000000 RCX: 0000000000000000
[ 25.868270] RDX: 0000000000000000 RSI: ffffffff825afb2f RDI: ffffffff825bda91
[ 25.876229] RBP: ffffffff82803e18 R08: 00000066a1710f94 R09: 0000000000000001
[ 25.884188] R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff82813940
[ 25.892146] R13: ffffffff82813940 R14: 0000000000000000 R15: 0000000000000000
[ 25.900098] ? __sched_text_end+0x6/0x6
[ 25.904462] ? default_idle+0xe/0x20
[ 25.908446] arch_cpu_idle+0x15/0x20
[ 25.912432] default_idle_call+0x33/0xf0
[ 25.916804] do_idle+0x209/0x270
[ 25.920402] cpu_startup_entry+0x20/0x30
[ 25.924774] rest_init+0xd4/0xe0
[ 25.928370] arch_call_rest_init+0xe/0x1b
[ 25.932840] start_kernel+0x6bc/0x6e2
[ 25.936922] x86_64_start_reservations+0x24/0x26
[ 25.942069] x86_64_start_kernel+0x75/0x79
[ 25.946635] secondary_startup_64_no_verify+0xb0/0xbb
[ 25.952270] Modules linked in: intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm ipmi_si(+) ipmi_devintf wmi_bmof rapl ipmi_msghandler input_leds ccp k10temp mac_hid sch_fq_codel msr ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm crct10dif_pclmul drm_kms_helper crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect hid_generic aesni_intel sysimgblt crypto_simd fb_sys_fops cryptd cec usbhid nvme ahci drm e1000e nvme_core libahci hid i2c_piix4 wmi
[ 26.017199] CR2: 0000000000000000
[ 26.020893] ---[ end trace 4ffc93f5c298732e ]---
[ 26.026040] RIP: 0010:0x0
[ 26.028958] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 26.036626] RSP: 0018:ffffc90000003eb8 EFLAGS: 00010002
[ 26.042557] RAX: 0000000000000101 RBX: 0000000000000101 RCX: 0000000000000000
[ 26.050506] RDX: 00000000fffef080 RSI: 0000000000000000 RDI: ffff88817751e350
[ 26.058464] RBP: ffffc90000003ed8 R08: 0000000000000000 R09: 00000000000002e6
[ 26.066423] R10: ffff88810dc8b2d0 R11: ffff88bf4de290b0 R12: ffff88817751e350
[ 26.074380] R13: 0000000000000000 R14: ffff88bf4de18dc0 R15: ffff88817751e350
[ 26.082330] FS: 0000000000000000(0000) GS:ffff88bf4de00000(0000) knlGS:0000000000000000
[ 26.091356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 26.097765] CR2: ffffffffffffffd6 CR3: 0000000166650002 CR4: 0000000000770ef0
[ 26.105722] PKRU: 55555554
[ 26.108736] Kernel panic - not syncing: Fatal exception in interrupt
[ 26.116349] Kernel Offset: disabled
[ 26.120237] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
Adding 'nokaslr' and CONFIG_DEBUG_OBJECTS_TIMERS=y didn't make
it more reproducible, but CONFIG_KASAN=y changed the signature
to this use-after-free that I used to drive bisect:
[ 40.503396] ==================================================================
[ 40.514471] input: HID 0b1f:03e9 as /devices/pci0000:20/0000:20:08.1/0000:22:00.3/usb1/1-2/1-2.2/1-2.2:1.1/0003:0B1F:03E9.0004/input/input6
[ 40.521823] BUG: KASAN: use-after-free in enqueue_timer+0x4f/0x1e0
[ 40.536634] hid-generic 0003:0B1F:03E9.0004: input,hidraw1: USB HID v1.00 Mouse [HID 0b1f:03e9] on usb-0000:22:00.3-2.2/input1
[ 40.542696] Write of size 8 at addr ffff8881feb80358 by task kworker/0:2/786
[ 40.542705] CPU: 0 PID: 786 Comm: kworker/0:2 Not tainted 5.15.0-rc4+ #105
[ 40.572608] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM1006B 08/20/2021
[ 40.581639] Workqueue: events work_for_cpu_fn
[ 40.586507] Call Trace:
[ 40.589234] dump_stack_lvl+0x4a/0x5f
[ 40.593324] print_address_description.constprop.0+0x28/0x150
[ 40.599750] ? enqueue_timer+0x4f/0x1e0
[ 40.604032] kasan_report.cold+0x82/0xdb
[ 40.608435] ? ast_pci_probe+0xb1/0x100 [ast]
[ 40.613305] ? enqueue_timer+0x4f/0x1e0
[ 40.617582] __asan_store8+0x6c/0x90
[ 40.621570] enqueue_timer+0x4f/0x1e0
[ 40.625654] internal_add_timer+0x9b/0xd0
[ 40.630127] ? enqueue_timer+0x1e0/0x1e0
[ 40.634501] ? debug_smp_processor_id+0x17/0x20
[ 40.639557] ? get_nohz_timer_target+0x6f/0x290
[ 40.644610] ? lock_timer_base+0xb6/0xe0
[ 40.648986] add_timer+0x29a/0x3a0
[ 40.652781] ? run_timer_softirq+0xa40/0xa40
[ 40.657547] ? preempt_count_sub+0x18/0xc0
[ 40.662120] ? _raw_spin_unlock_irqrestore+0x22/0x40
[ 40.667670] ? drm_connector_list_iter_next+0x1bb/0x230 [drm]
[ 40.674163] __queue_delayed_work+0xdc/0x140
[ 40.678932] queue_delayed_work_on+0x6c/0x90
[ 40.683698] drm_kms_helper_poll_enable.part.0+0xf5/0x140 [drm_kms_helper]
[ 40.691411] ? drm_kms_helper_poll_fini+0x40/0x40 [drm_kms_helper]
[ 40.698344] ? __queue_work+0x640/0x640
[ 40.702623] ? debug_object_init+0x16/0x20
[ 40.707195] drm_kms_helper_poll_init+0xb8/0xc0 [drm_kms_helper]
[ 40.713968] ast_mode_config_init+0x5f8/0x800 [ast]
[ 40.719419] ? ast_crtc_reset+0xa0/0xa0 [ast]
[ 40.724287] ast_device_create.cold+0xab6/0xbdd [ast]
[ 40.729931] ast_pci_probe+0xba/0x100 [ast]
[ 40.734602] ? __pm_runtime_resume+0x65/0xb0
[ 40.739366] ? ast_pm_suspend+0x70/0x70 [ast]
[ 40.739887] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 40.744231] local_pci_probe+0x80/0xd0
[ 40.744237] ? pci_device_shutdown+0x90/0x90
[ 40.753180] ata7.00: supports DRM functions and may not be fully accessible
[ 40.755316] work_for_cpu_fn+0x2e/0x50
[ 40.760086] ata7.00: ATA-11: Samsung SSD 860 PRO 256GB, RVM01B6Q, max UDMA/133
[ 40.767845] process_one_work+0x4b9/0x7f0
[ 40.767852] worker_thread+0x47a/0x6b0
[ 40.772971] ata7.00: disabling queued TRIM support
[ 40.780081] kthread+0x209/0x240
[ 40.780085] ? process_one_work+0x7f0/0x7f0
[ 40.780089] ? set_kthread_struct+0x80/0x80
[ 40.784562] ata7.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 32), AA
[ 40.788736] ret_from_fork+0x22/0x30
[ 40.796196] ata7.00: Features: Trust Dev-Sleep NCQ-sndrcv
[ 40.802735] ata7.00: supports DRM functions and may not be fully accessible
[ 40.807011] Allocated by task 786:
[ 40.807015] kasan_save_stack+0x23/0x50
[ 40.815459] ata7.00: disabling queued TRIM support
[ 40.818577] __kasan_kmalloc+0x83/0xa0
[ 40.818581] __kmalloc+0x2f7/0x4c0
[ 40.818585] __devm_drm_dev_alloc+0x2f/0xb0 [drm]
[ 40.826718] ata7.00: configured for UDMA/133
[ 40.834032] ast_device_create+0x32/0x2c0 [ast]
[ 40.838646] scsi 6:0:0:0: Direct-Access ATA Samsung SSD 860 1B6Q PQ: 0 ANSI: 5
[ 40.842107] ast_pci_probe+0xba/0x100 [ast]
[ 40.842117] local_pci_probe+0x80/0xd0
[ 40.842121] work_for_cpu_fn+0x2e/0x50
[ 40.848592] ata7.00: Enabling discard_zeroes_data
[ 40.851648] process_one_work+0x4b9/0x7f0
[ 40.855629] sd 6:0:0:0: [sda] 500118192 512-byte logical blocks: (256 GB/238 GiB)
[ 40.860689] worker_thread+0x47a/0x6b0
[ 40.860694] kthread+0x209/0x240
[ 40.860696] ret_from_fork+0x22/0x30
[ 40.865913] sd 6:0:0:0: Attached scsi generic sg0 type 0
[ 40.870516] Freed by task 1190:
[ 40.870519] kasan_save_stack+0x23/0x50
[ 40.870523] kasan_set_track+0x20/0x30
[ 40.871945] sd 6:0:0:0: [sda] Write Protect is off
[ 40.871968] sd 6:0:0:0: [sda] Mode Sense: 00 3a 00 00
[ 40.872098] sd 6:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 40.879881] ata7.00: disabled
[ 40.884218] kasan_set_free_info+0x24/0x40
[ 40.884222] __kasan_slab_free+0xef/0x120
[ 40.884227] kfree+0xa1/0x3e0
[ 40.978619] drm_dev_release+0x58/0x60 [drm]
[ 40.983446] devm_drm_dev_init_release+0x46/0x60 [drm]
[ 40.989235] devm_action_release+0x2c/0x40
[ 40.993803] release_nodes+0x62/0x150
[ 40.997887] devres_release_all+0x10a/0x150
[ 41.002551] really_probe+0x177/0x660
[ 41.006636] __driver_probe_device+0x19e/0x230
[ 41.011592] driver_probe_device+0x4e/0xf0
[ 41.016161] __driver_attach+0x102/0x1d0
[ 41.020536] bus_for_each_dev+0xfb/0x150
[ 41.024911] driver_attach+0x2d/0x40
[ 41.028898] bus_add_driver+0x23f/0x300
[ 41.033177] driver_register+0xdc/0x170
[ 41.037455] __pci_register_driver+0xfc/0x110
[ 41.042314] ast_init+0x48/0x1000 [ast]
[ 41.046598] do_one_initcall+0x99/0x2d0
[ 41.050877] do_init_module+0x10f/0x3c0
[ 41.055154] load_module+0x43fa/0x4710
[ 41.059336] __do_sys_finit_module+0x12a/0x1b0
[ 41.064291] __x64_sys_finit_module+0x43/0x50
[ 41.069149] do_syscall_64+0x3b/0xc0
[ 41.073137] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 41.080433] Last potentially related work creation:
[ 41.085873] kasan_save_stack+0x23/0x50
[ 41.090151] kasan_record_aux_stack+0xac/0xc0
[ 41.095010] insert_work+0x37/0x170
[ 41.098899] __queue_work+0x208/0x640
[ 41.102982] queue_work_on+0x62/0x80
[ 41.106969] __drm_connector_put_safe+0x80/0xa0 [drm]
[ 41.112675] drm_connector_list_iter_next+0x1a0/0x230 [drm]
[ 41.118961] drm_mode_config_cleanup+0x12e/0x500 [drm]
[ 41.124760] drm_mode_config_init_release+0xe/0x10 [drm]
[ 41.130753] drm_managed_release+0x11c/0x240 [drm]
[ 41.136164] drm_dev_release+0x46/0x60 [drm]
[ 41.140992] devm_drm_dev_init_release+0x46/0x60 [drm]
[ 41.146790] devm_action_release+0x2c/0x40
[ 41.151358] release_nodes+0x62/0x150
[ 41.155441] devres_release_all+0x10a/0x150
[ 41.160106] really_probe+0x177/0x660
[ 41.164190] __driver_probe_device+0x19e/0x230
[ 41.169147] driver_probe_device+0x4e/0xf0
[ 41.173716] __driver_attach+0x102/0x1d0
[ 41.178091] bus_for_each_dev+0xfb/0x150
[ 41.182466] driver_attach+0x2d/0x40
[ 41.186452] bus_add_driver+0x23f/0x300
[ 41.190729] driver_register+0xdc/0x170
[ 41.195006] __pci_register_driver+0xfc/0x110
[ 41.199868] ast_init+0x48/0x1000 [ast]
[ 41.204151] do_one_initcall+0x99/0x2d0
[ 41.208429] do_init_module+0x10f/0x3c0
[ 41.212707] load_module+0x43fa/0x4710
[ 41.216888] __do_sys_finit_module+0x12a/0x1b0
[ 41.221843] __x64_sys_finit_module+0x43/0x50
[ 41.226704] do_syscall_64+0x3b/0xc0
[ 41.230691] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 41.237985] The buggy address belongs to the object at ffff8881feb80000
which belongs to the cache kmalloc-8k of size 8192
[ 41.251959] The buggy address is located 856 bytes inside of
8192-byte region [ffff8881feb80000, ffff8881feb82000)
[ 41.265157] The buggy address belongs to the page:
[ 41.270501] page:00000000eecae4f9 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1feb80
[ 41.280985] head:00000000eecae4f9 order:3 compound_mapcount:0 compound_pincount:0
[ 41.289336] flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
[ 41.297595] raw: 0017ffffc0010200 0000000000000000 dead000000000122 ffff88810004d180
[ 41.306236] raw: 0000000000000000 0000000000020002 00000001ffffffff 0000000000000000
[ 41.314875] page dumped because: kasan: bad access detected
[ 41.322749] Memory state around the buggy address:
[ 41.328093] ffff8881feb80200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 41.336152] ffff8881feb80280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 41.344211] >ffff8881feb80300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 41.352267] ^
[ 41.359066] ffff8881feb80380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 41.367123] ffff8881feb80400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 41.375181] ==================================================================
That bisection led to this commit:
commit aae74ff9caa8de9a45ae2e46068c417817392a26
Author: Ainux <[email protected]>
Date: Wed May 26 19:15:15 2021 +0800
drm/ast: Add detect function support
The existence of the connector cannot be detected,
so add the detect function to support.
Signed-off-by: Ainux <[email protected]>
Signed-off-by: Thomas Zimmermann <[email protected]>
Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
drivers/gpu/drm/ast/ast_mode.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
I confirmed that if I revert it from v5.15-rc4 (after reverting
its dependent 572994bf18ff "drm/ast: Zero is missing in detect
function"), the problem goes away.
Full .config, dmesg attached.
I can test any possible fixes...
Thanks,
Kim
Hi,
On 10/5/21 1:10 PM, Kim Phillips wrote:
> Hi, I occasionally see the below trace with Linus' master on an
> AMD Milan system:
>
> [ 25.657322] BUG: kernel NULL pointer dereference, address: 0000000000000000
> [ 25.665097] #PF: supervisor instruction fetch in kernel mode
> [ 25.671448] #PF: error_code(0x0010) - not-present page
<snip>
> That bisection led to this commit:
>
> commit aae74ff9caa8de9a45ae2e46068c417817392a26
> Author: Ainux <[email protected]>
> Date: Wed May 26 19:15:15 2021 +0800
> drm/ast: Add detect function support
<snip>
> I confirmed that if I revert it from v5.15-rc4 (after reverting
> its dependent 572994bf18ff "drm/ast: Zero is missing in detect
> function"), the problem goes away.
>
> Full .config, dmesg attached.
>
> I can test any possible fixes...
Ping - if no fixes are in the works, can the offending commit(s)
be reverted?
Thanks,
Kim
On Tue, Oct 12, 2021 at 2:07 AM Kim Phillips <[email protected]> wrote:
>
> Hi,
>
> On 10/5/21 1:10 PM, Kim Phillips wrote:
> > Hi, I occasionally see the below trace with Linus' master on an
> > AMD Milan system:
> >
> > [ 25.657322] BUG: kernel NULL pointer dereference, address: 0000000000000000
> > [ 25.665097] #PF: supervisor instruction fetch in kernel mode
> > [ 25.671448] #PF: error_code(0x0010) - not-present page
> <snip>
> > That bisection led to this commit:
> >
> > commit aae74ff9caa8de9a45ae2e46068c417817392a26
> > Author: Ainux <[email protected]>
> > Date: Wed May 26 19:15:15 2021 +0800
> > drm/ast: Add detect function support
> <snip>
> > I confirmed that if I revert it from v5.15-rc4 (after reverting
> > its dependent 572994bf18ff "drm/ast: Zero is missing in detect
> > function"), the problem goes away.
> >
> > Full .config, dmesg attached.
> >
> > I can test any possible fixes...
>
>
> Ping - if no fixes are in the works, can the offending commit(s)
> be reverted?
tzimmermann?
Dave.