2023-01-10 22:28:42

by Matt Fagnani

[permalink] [raw]
Subject: Re: [regression, bisected, pci/iommu] Bug  216865 - Black screen when amdgpu started during 6.2-rc1 boot w ith AMD IOMMU enabled

Baolu,

I ran git stash and git checkout v6.2-rc3 to reset to a fresh 6.2-rc3. I
checked that the previous change had been removed by looking at
drivers/pci/ats.c and gitk. I ran git revert 201007ef707a with v6.2-rc3
and built that. 6.2-rc3 with 201007ef707a reverted booted normally
without the problem.

I reset to 6.2-rc3 and checked the change was removed as before. I
applied your second patch with git apply
0001-for-debug-purpose-only.patch and built that. 6.2-rc3 with
0001-for-debug-purpose-only.patch had the black screen problem. I booted
it a second time with rd.driver.blacklist=amdgpu on the kernel command
line so amdgpu wouldn't be started while the initramfs was in use and
the journal would be saved. The black screen happened later in the boot
as before. I pressed sysrq+alt+s,u,b. The journal of that boot didn't
have the two warnings I reported before. A different null pointer
dereference happened with pci_acs_enabled at the top of the trace which
made amdgpu crash as follows.

Jan 10 16:32:31 kernel: [drm] amdgpu kernel modesetting enabled.
Jan 10 16:32:31 kernel: amdgpu: Topology: Add APU node [0x0:0x0]
Jan 10 16:32:31 kernel: Console: switching to colour dummy device 80x25
Jan 10 16:32:31 kernel: amdgpu 0000:00:01.0: vgaarb: deactivate vga console
Jan 10 16:32:31 kernel: [drm] initializing kernel modesetting (CARRIZO
0x1002:0x9874 0x103C:0x8332 0xCA).
Jan 10 16:32:31 kernel: [drm] register mmio base: 0xF0400000
Jan 10 16:32:31 kernel: [drm] register mmio size: 262144
Jan 10 16:32:31 kernel: [drm] add ip block number 0 <vi_common>
Jan 10 16:32:31 kernel: [drm] add ip block number 1 <gmc_v8_0>
Jan 10 16:32:31 kernel: [drm] add ip block number 2 <cz_ih>
Jan 10 16:32:31 kernel: [drm] add ip block number 3 <gfx_v8_0>
Jan 10 16:32:31 kernel: [drm] add ip block number 4 <sdma_v3_0>
Jan 10 16:32:31 kernel: [drm] add ip block number 5 <powerplay>
Jan 10 16:32:31 kernel: [drm] add ip block number 6 <dm>
Jan 10 16:32:31 kernel: [drm] add ip block number 7 <uvd_v6_0>
Jan 10 16:32:31 kernel: [drm] add ip block number 8 <vce_v3_0>
Jan 10 16:32:31 kernel: [drm] add ip block number 9 <acp_ip>
Jan 10 16:32:31 kernel: amdgpu 0000:00:01.0: amdgpu: Fetched VBIOS from VFCT
Jan 10 16:32:31 kernel: amdgpu: ATOM BIOS: 113-C75100-031
Jan 10 16:32:31 kernel: [drm] UVD is enabled in physical mode
Jan 10 16:32:31 kernel: [drm] VCE enabled in physical mode
Jan 10 16:32:31 kernel: amdgpu 0000:00:01.0: amdgpu: Trusted Memory Zone
(TMZ) feature not supported
Jan 10 16:32:31 kernel: [drm] vm size is 64 GB, 2 levels, block size is
10-bit, fragment size is 9-bit
Jan 10 16:32:31 kernel: amdgpu 0000:00:01.0: amdgpu: VRAM: 512M
0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
Jan 10 16:32:31 kernel: amdgpu 0000:00:01.0: amdgpu: GART: 1024M
0x000000FF00000000 - 0x000000FF3FFFFFFF
Jan 10 16:32:31 kernel: [drm] Detected VRAM RAM=512M, BAR=512M
Jan 10 16:32:31 kernel: [drm] RAM width 64bits UNKNOWN
Jan 10 16:32:31 kernel: [drm] amdgpu: 512M of VRAM memory ready
Jan 10 16:32:31 kernel: [drm] amdgpu: 3704M of GTT memory ready.
Jan 10 16:32:31 kernel: [drm] GART: num cpu pages 262144, num gpu pages
262144
Jan 10 16:32:31 kernel: [drm] PCIE GART of 1024M enabled (table at
0x000000F400600000).
Jan 10 16:32:31 kernel: RPC: Registered named UNIX socket transport module.
Jan 10 16:32:31 kernel: RPC: Registered udp transport module.
Jan 10 16:32:31 kernel: RPC: Registered tcp transport module.
Jan 10 16:32:31 kernel: RPC: Registered tcp NFSv4.1 backchannel
transport module.
Jan 10 16:32:31 kernel: amdgpu: hwmgr_sw_init smu backed is smu8_smu
Jan 10 16:32:31 kernel: [drm] Found UVD firmware Version: 1.91 Family ID: 11
Jan 10 16:32:31 kernel: [drm] UVD ENC is disabled
Jan 10 16:32:31 kernel: [drm] Found VCE firmware Version: 52.4 Binary ID: 3
Jan 10 16:32:31 kernel: amdgpu: smu version 27.18.00
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: values for Engine clock
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         300000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         480000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         533340
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         576000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         626090
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         685720
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         720000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         757900
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: Validation clocks:
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    level           : 8
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: values for Display clock
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         300000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         400000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         496560
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         626090
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         685720
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         757900
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         800000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         847060
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: Validation clocks:
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    level           : 8
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: values for Memory clock
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         667000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         933000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: Validation clocks:
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    level           : 8
Jan 10 16:32:31 kernel: [drm] Display Core initialized with v3.2.215!
Jan 10 16:32:31 kernel: snd_hda_intel 0000:00:01.1: bound 0000:00:01.0
(ops amdgpu_dm_audio_component_bind_ops [amdgpu])
Jan 10 16:32:31 kernel: [drm] UVD initialized successfully.
Jan 10 16:32:31 kernel: [drm] VCE initialized successfully.
Jan 10 16:32:31 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Jan 10 16:32:31 kernel: amdgpu: sdma_bitmap: f
Jan 10 16:32:31 kernel: BUG: kernel NULL pointer dereference, address:
000000000000003c
Jan 10 16:32:31 kernel: #PF: supervisor read access in kernel mode
Jan 10 16:32:31 kernel: #PF: error_code(0x0000) - not-present page
Jan 10 16:32:31 kernel: PGD 0 P4D 0
Jan 10 16:32:31 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 10 16:32:31 kernel: CPU: 0 PID: 645 Comm: systemd-udevd Not tainted
6.2.0-rc3+ #92
Jan 10 16:32:31 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS
F.52 12/03/2019
Jan 10 16:32:31 kernel: RIP: 0010:pci_dev_specific_acs_enabled+0x36/0x80
Jan 10 16:32:31 kernel: Code: 6d a9 44 0f b7 e6 55 48 89 fd 53 48 c7 c3
a0 0a 0d aa eb 13 66 83 f8 ff 74 16 48 8b 53 18 48 83 c3 10 48 85 d2 74
31 0f b7 03 <66> 39 45 3c 75 e4 0f b7 43 02 66 39 45 3e 74 06 66 83 f8
ff 75 da
Jan 10 16:32:31 kernel: RSP: 0018:ffffa8e9806ef938 EFLAGS: 00010046
Jan 10 16:32:31 kernel: RAX: 0000000000001002 RBX: ffffffffaa0d0aa0 RCX:
0000000000000000
Jan 10 16:32:31 kernel: RDX: ffffffffa96d1590 RSI: 0000000000000014 RDI:
0000000000000000
Jan 10 16:32:31 kernel: RBP: 0000000000000000 R08: 0000000000000002 R09:
0000000000000000
Jan 10 16:32:31 kernel: R10: 0000000000000000 R11: ffffffffa9bf4220 R12:
0000000000000014
Jan 10 16:32:31 kernel: R13: ffff938f90643800 R14: ffff938f41366100 R15:
ffff938f90643960
Jan 10 16:32:31 kernel: FS:  00007feff3f6cb40(0000)
GS:ffff939037400000(0000) knlGS:0000000000000000
Jan 10 16:32:31 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 10 16:32:31 kernel: CR2: 000000000000003c CR3: 000000010b8a8000 CR4:
00000000001506f0
Jan 10 16:32:31 kernel: Call Trace:
Jan 10 16:32:31 kernel:  <TASK>
Jan 10 16:32:31 kernel:  pci_acs_enabled+0x14/0x80
Jan 10 16:32:31 kernel:  pci_acs_path_enabled+0x35/0x60
Jan 10 16:32:31 kernel:  pci_enable_pasid+0x5d/0xe0
Jan 10 16:32:31 kernel:  amd_iommu_attach_device+0x26a/0x300
Jan 10 16:32:31 kernel:  __iommu_attach_device+0x1b/0x90
Jan 10 16:32:31 kernel:  iommu_attach_group+0x65/0xa0
Jan 10 16:32:31 kernel:  amd_iommu_init_device+0x16b/0x250 [iommu_v2]
Jan 10 16:32:31 kernel:  kfd_iommu_resume+0x4c/0x1a0 [amdgpu]
Jan 10 16:32:31 kernel:  kgd2kfd_resume_iommu+0x12/0x30 [amdgpu]
Jan 10 16:32:31 kernel:  kgd2kfd_device_init.cold+0x346/0x49a [amdgpu]
Jan 10 16:32:31 kernel:  amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu]
Jan 10 16:32:31 kernel:  amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu]
Jan 10 16:32:31 kernel:  ? _raw_spin_lock_irqsave+0x23/0x50
Jan 10 16:32:31 kernel:  amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
Jan 10 16:32:31 kernel:  amdgpu_pci_probe+0x161/0x370 [amdgpu]
Jan 10 16:32:31 kernel:  local_pci_probe+0x41/0x80
Jan 10 16:32:31 kernel:  pci_device_probe+0xb3/0x220
Jan 10 16:32:31 kernel:  really_probe+0xde/0x380
Jan 10 16:32:31 kernel:  ? pm_runtime_barrier+0x50/0x90
Jan 10 16:32:31 kernel:  __driver_probe_device+0x78/0x170
Jan 10 16:32:31 kernel:  driver_probe_device+0x1f/0x90
Jan 10 16:32:31 kernel:  __driver_attach+0xce/0x1c0
Jan 10 16:32:31 kernel:  ? __pfx___driver_attach+0x10/0x10
Jan 10 16:32:31 kernel:  bus_for_each_dev+0x73/0xa0
Jan 10 16:32:31 kernel:  bus_add_driver+0x1ae/0x200
Jan 10 16:32:31 kernel:  driver_register+0x89/0xe0
Jan 10 16:32:31 kernel:  ? __pfx_init_module+0x10/0x10 [amdgpu]
Jan 10 16:32:31 kernel:  do_one_initcall+0x59/0x230
Jan 10 16:32:31 kernel:  do_init_module+0x4a/0x200
Jan 10 16:32:31 kernel:  __do_sys_init_module+0x157/0x180
Jan 10 16:32:31 kernel:  do_syscall_64+0x3a/0x90
Jan 10 16:32:31 kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Jan 10 16:32:31 kernel: RIP: 0033:0x7feff3aede4e
Jan 10 16:32:31 kernel: Code: 48 8b 0d e5 5f 0c 00 f7 d8 64 89 01 48 83
c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00
00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 5f 0c 00 f7 d8 64
89 01 48
Jan 10 16:32:31 kernel: RSP: 002b:00007ffcfa200958 EFLAGS: 00000246
ORIG_RAX: 00000000000000af
Jan 10 16:32:31 kernel: RAX: ffffffffffffffda RBX: 0000556204a64420 RCX:
00007feff3aede4e
Jan 10 16:32:31 kernel: RDX: 00007feff3fa7453 RSI: 0000000016ba2751 RDI:
00007fefc4192010
Jan 10 16:32:31 kernel: RBP: 00007feff3fa7453 R08: 27d4eb2f165667c5 R09:
85ebca77c2b2ae63
Jan 10 16:32:31 kernel: R10: 0000000000070121 R11: 0000000000000246 R12:
0000000000020000
Jan 10 16:32:31 kernel: R13: 0000556204960ef0 R14: 0000000000000000 R15:
0000556204a52ef0
Jan 10 16:32:31 kernel:  </TASK>
Jan 10 16:32:31 kernel: Modules linked in: ip_set nf_tables nfnetlink
sunrpc amdgpu(+) iwlmvm mac80211 nls_ascii vfat fat libarc4 uvcvideo
iwlwifi videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videodev btusb
btrtl snd_ctl_led snd_hda_codec_realtek btbcm snd_hda_codec_generic
btintel i2c_algo_bit snd_hda_codec_hdmi ledtrig_audio videobuf2_common
drm_ttm_helper bluetooth ttm snd_hda_intel mc snd_intel_dspcfg cfg80211
snd_hda_codec edac_mce_amd iommu_v2 snd_hwdep mfd_core snd_hda_core
drm_buddy gpu_sched wmi_bmof snd_seq pcspkr fam15h_power k10temp rfkill
drm_display_helper snd_seq_device snd_pcm cec snd_timer drm_kms_helper
i2c_scmi snd soundcore acpi_cpufreq drm zram hid_logitech_hidpp
crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sd_mod
r8169 t10_pi sha512_ssse3 crc64_rocksoft_generic wdat_wdt crc64_rocksoft
hid_logitech_dj crc64 sp5100_tco video wmi fuse dm_multipath
Jan 10 16:32:31 kernel: CR2: 000000000000003c
Jan 10 16:32:31 kernel: ---[ end trace 0000000000000000 ]---
Jan 10 16:32:31 kernel: RIP: 0010:pci_dev_specific_acs_enabled+0x36/0x80
Jan 10 16:32:31 kernel: Code: 6d a9 44 0f b7 e6 55 48 89 fd 53 48 c7 c3
a0 0a 0d aa eb 13 66 83 f8 ff 74 16 48 8b 53 18 48 83 c3 10 48 85 d2 74
31 0f b7 03 <66> 39 45 3c 75 e4 0f b7 43 02 66 39 45 3e 74 06 66 83 f8
ff 75 da
Jan 10 16:32:31 kernel: RSP: 0018:ffffa8e9806ef938 EFLAGS: 00010046
Jan 10 16:32:31 kernel: RAX: 0000000000001002 RBX: ffffffffaa0d0aa0 RCX:
0000000000000000
Jan 10 16:32:31 kernel: RDX: ffffffffa96d1590 RSI: 0000000000000014 RDI:
0000000000000000
Jan 10 16:32:31 kernel: RBP: 0000000000000000 R08: 0000000000000002 R09:
0000000000000000
Jan 10 16:32:31 kernel: R10: 0000000000000000 R11: ffffffffa9bf4220 R12:
0000000000000014
Jan 10 16:32:31 kernel: R13: ffff938f90643800 R14: ffff938f41366100 R15:
ffff938f90643960
Jan 10 16:32:31 kernel: FS:  00007feff3f6cb40(0000)
GS:ffff939037400000(0000) knlGS:0000000000000000
Jan 10 16:32:31 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 10 16:32:31 kernel: CR2: 000000000000003c CR3: 000000010b8a8000 CR4:
00000000001506f0

This trace looked similar to those of the previous warnings from
amd_iommu_attach_device downwards. I'm attaching the full kernel log
from that boot with 6.2-rc3 with 0001-for-debug-purpose-only.patch. I'm
ccing the others involved in case this might be relevant to them.

Thanks,

Matt

On 1/10/23 03:41, Baolu Lu wrote:
> [offlist]
>
> Can you please try below tests?
>
> 1. with a fresh v6.2-rc3, git revert 201007ef707a
>
> 2. With a fresh v6.2-rc3, apply attached patch.
>
> --
> Best regards,
> baolu
>
> On 2023/1/10 16:06, Matt Fagnani wrote:
>> Baolu,
>>
>> I tried to apply your patch after checking out 6.2-rc3 and
>> origin/master but there were there the following errors.
>>
>> git apply amd-iommu-amdgpu-boot-crash-2.patch
>> error: patch failed: drivers/pci/ats.c:382
>> error: drivers/pci/ats.c: patch does not apply
>>
>> I manually changed drivers/pci/ats.c as shown in the patch. I built
>> 6.2-rc3 + the patch. 6.2-rc3 with the patch had the same black screen
>> problem when booting. I added rd.driver.blacklist=amdgpu on the
>> kernel command line to prevent amdgpu from being started while the
>> initramfs was in use, and the black screen happened later in the boot
>> as I described in my previous email. The journal showed the same two
>> warnings and null pointer dereference which made amdgpu crash as I
>> reported.
>>
>> Thanks,
>>
>> Matt
>>
>>
>>


Attachments:
6.2-rc3-0001-for-debug-purpose-only.patch-journalctl-b-1-k.txt (99.46 kB)