All,
There appears to be a bug (regression maybe?) in the amdgpu driver
resulting in a Fatal error during GPU init. This began with the 5.17 kernel
and is still present in the current 5.18 kernel. However, the
effect/consequence on the kernel due to the NULL pointer dereference seems to
be getting worse and not causes the machine to hang at the end of the shutdown
procedure. (tough for boxes that are remote adminned).
I have two servers with old AMD cards that have this exact problem. lspci -v
(as user) reports the card as:
01:00.1 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] RV370
[Radeon X300 SE]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 0f03
Flags: fast devsel, NUMA node 0
Memory at fea20000 (32-bit, non-prefetchable) [size=64K]
Capabilities: <access denied>
Kernel modules: amdgpu
The host is:
Host: valkyrie Kernel: 5.18.7-arch1-1 arch: x86_64 bits: 64 compiler: gcc
v: 12.1.0 parameters: BOOT_IMAGE=/vmlinuz-linux
root=UUID=515ef9dc-769f-4548-9a08-3a92fa83d86b rw iommu=soft
amd_iommu_dump= quiet audit=0
Console: pty pts/0 DM: LightDM v: 1.30.0 Distro: Arch Linux
Machine:
Type: Desktop Mobo: Gigabyte model: 990FXA-UD3 v: x.x serial: N/A
BIOS: American Megatrends v: F3 date: 05/28/2015
Memory:
RAM: total: 31.31 GiB used: 1012.9 MiB (3.2%)
CPU:
Info: model: AMD FX-8350 socket: AM3 bits: 64 type: MT MCP arch: Piledriver
built: 2012-13 process: GF 32nm family: 0x15 (21) model-id: 2 stepping: 0
microcode: 0x6000852
Graphics:
Device-1: AMD RV370 [Radeon X300] driver: radeon v: kernel
alternate: amdgpu arch: Rage 9 code: R360-R400 process: TSMC 110nm
built: 2003-08 pcie: gen: 1 speed: 2.5 GT/s lanes: 16 ports:
active: DVI-I-1 empty: SVIDEO-1 bus-ID: 01:00.0 chip-ID: 1002:5b60
class-ID: 0300
The NULL pointer dereference occurs during GPU init of the card. These cards
are fanless and specifically chosen for that. They are used in server installs
and have been flawless for years. If it was just one card acting up, I could
see it may be a card problem, but I have two identical servers setup with this
card and both show the exact same "BUG: kernel NULL pointer dereference":
[ 9.660937] [drm] amdgpu kernel modesetting enabled.
[ 9.661025] amdgpu: CRAT table not found
[ 9.661028] amdgpu: Virtual CRAT table created for CPU
[ 9.661040] amdgpu: Topology: Add CPU node
[ 9.661296] [drm] initializing kernel modesetting (IP DISCOVERY
0x1002:0x5B70 0x1002:0x0F03 0x00).
[ 9.661302] amdgpu 0000:01:00.1: amdgpu: Trusted Memory Zone (TMZ) feature
disabled as experimental (default)
[ 9.661305] amdgpu 0000:01:00.1: amdgpu: Fatal error during GPU init
[ 9.661318] amdgpu: probe of 0000:01:00.1 failed with error -12
[ 9.661338] BUG: kernel NULL pointer dereference, address: 0000000000000000
Full dmesg output for this with backtrace is attached.
Bugs related to this problem are open with freedesktop, and with Archinux.
https://gitlab.freedesktop.org/drm/amd/-/issues/2070
and
https://bugs.archlinux.org/task/74346#comment209209
Are those the proper locations for the bug report or does a kernel bug also
need to be opened to track the issue? Let me know there and let me know if you
need any further information from the machines and I'm happy to get it.
--
David C. Rankin, J.D.,P.E.