2023-12-15 11:56:52

by Juergen Gross

[permalink] [raw]
Subject: Crashes under Xen with Radeon graphics card

Hi,

I recently stumbled over a test system which showed crashes probably resulting
from memory being overwritten randomly.

The problem is occurring only in Dom0 when running under Xen. It seems to be
present since at least kernel 6.3 (I didn't go back further yet), and it seems
NOT to be present in kernel 5.14.

I tracked the problem down to the initialization of the graphics card (the
problem might surface only later, but at least an early initialization
failure made the problem go away).

# lspci
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Caicos
XTX [Radeon HD 8490 / R5 235X OEM]
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI Audio
[Radeon HD 6450 / 7450/8450/8490 OEM / R5 230/235/235X OEM]

I had a working .config and one which did produce the crashes, so I narrowed
the problem down to detect that the important difference was in the area of
firmware loading (the working .config didn't have CONFIG_FW_LOADER_COMPRESS_XZ
set, causing firmware loading for the card to fail). This was of course not
the real problem, but it caused the card initialization to fail.

I manually decompressed the firmware files one by one to see whether the
problem would be in the decompressor or probably in the driver of the card.

The last step without crash was:

# dmesg | grep radeon
[ 10.106405] [drm] radeon kernel modesetting enabled.
[ 10.106455] radeon 0000:01:00.0: vgaarb: deactivate vga console
[ 10.222944] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 -
0x000000003FFFFFFF (1024M used)
[ 10.252921] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
0x000000007FFFFFFF
[ 10.278255] [drm] radeon: 1024M of VRAM memory ready
[ 10.295828] [drm] radeon: 1024M of GTT memory ready.
[ 10.295867] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_pfp.bin succeeded
[ 10.330846] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_me.bin succeeded
[ 10.330858] radeon 0000:01:00.0: Direct firmware load for radeon/BTC_rlc.bin
succeeded
[ 10.330870] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_mc.bin failed with error -2
[ 10.380979] ni_cp: Failed to load firmware "radeon/CAICOS_mc.bin"
[ 10.381006] [drm:evergreen_init [radeon]] *ERROR* Failed to load firmware!
[ 10.405765] radeon 0000:01:00.0: Fatal error during GPU init
[ 10.432107] [drm] radeon: finishing device.
[ 10.439179] [drm] radeon: ttm finalized
[ 10.463203] radeon: probe of 0000:01:00.0 failed with error -2

And with decompressing radeon/CAICOS_mc.bin I got:

# dmesg | grep radeon
[ 10.266491] [drm] radeon kernel modesetting enabled.
[ 10.266552] radeon 0000:01:00.0: vgaarb: deactivate vga console
[ 10.456047] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000 -
0x000000003FFFFFFF (1024M used)
[ 10.470270] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
0x000000007FFFFFFF
[ 10.566946] [drm] radeon: 1024M of VRAM memory ready
[ 10.576891] [drm] radeon: 1024M of GTT memory ready.
[ 10.586971] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_pfp.bin succeeded
[ 10.611886] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_me.bin succeeded
[ 10.611909] radeon 0000:01:00.0: Direct firmware load for radeon/BTC_rlc.bin
succeeded
[ 10.611938] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_mc.bin succeeded
[ 10.660599] radeon 0000:01:00.0: Direct firmware load for
radeon/CAICOS_smc.bin failed with error -2
[ 10.660601] smc: error loading firmware "radeon/CAICOS_smc.bin"
[ 10.661676] [drm] radeon: power management initialized
[ 10.713666] radeon 0000:01:00.0: Direct firmware load for radeon/SUMO_uvd.bin
failed with error -2
[ 10.713668] radeon 0000:01:00.0: radeon_uvd: Can't load firmware
"radeon/SUMO_uvd.bin"
[ 10.713669] radeon 0000:01:00.0: failed UVD (-2) init.
[ 10.714787] [drm] enabling PCIE gen 2 link speeds, disable with
radeon.pcie_gen2=0
[ 10.809213] radeon 0000:01:00.0: WB enabled
[ 10.817528] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
0x0000000040000c00
[ 10.833755] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
0x0000000040000c0c
[ 10.850330] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
[ 10.862154] radeon 0000:01:00.0: radeon: using MSI.
[ 10.871930] [drm] radeon: irq initialized.
[ 11.062028] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0 on minor 0
[ 11.119723] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
monitor but no|invalid EDID
[ 11.411370] fbcon: radeondrmfb (fb0) is primary device
[ 11.507252] radeon 0000:01:00.0: [drm] fb0: radeondrmfb frame buffer device
[ 11.674028] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
monitor but no|invalid EDID
[ 11.834317] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
monitor but no|invalid EDID
[ 28.313041] snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops
radeon_audio_component_bind_ops [radeon])
[ 44.371991] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
monitor but no|invalid EDID
[ 44.428068] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
monitor but no|invalid EDID

followed by a crash some seconds after the system was up.

The crashes vary, but often the kernel accesses non-canonical addresses or
tries to map illegal physical addresses. Sometimes the system is just
hanging, either with softlockups or without any further signs of being
alive.

I can easily reproduce the problem, so any debug patches to narrow down the
problem are welcome.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.66 kB)
OpenPGP public key
OpenPGP_signature.asc (505.00 B)
OpenPGP digital signature
Download all attachments

2023-12-15 16:05:00

by Deucher, Alexander

[permalink] [raw]
Subject: RE: Crashes under Xen with Radeon graphics card

[Public]

> -----Original Message-----
> From: Juergen Gross <[email protected]>
> Sent: Friday, December 15, 2023 6:57 AM
> To: lkml <[email protected]>; [email protected]; amd-
> [email protected]
> Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> <[email protected]>; Pan, Xinhui <[email protected]>
> Subject: Crashes under Xen with Radeon graphics card
>
> Hi,
>
> I recently stumbled over a test system which showed crashes probably
> resulting from memory being overwritten randomly.
>
> The problem is occurring only in Dom0 when running under Xen. It seems to
> be present since at least kernel 6.3 (I didn't go back further yet), and it seems
> NOT to be present in kernel 5.14.
>
> I tracked the problem down to the initialization of the graphics card (the
> problem might surface only later, but at least an early initialization failure made
> the problem go away).
>
> # lspci
> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> Caicos XTX [Radeon HD 8490 / R5 235X OEM]
> 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI
> Audio [Radeon HD 6450 / 7450/8450/8490 OEM / R5 230/235/235X OEM]
>
> I had a working .config and one which did produce the crashes, so I narrowed
> the problem down to detect that the important difference was in the area of
> firmware loading (the working .config didn't have
> CONFIG_FW_LOADER_COMPRESS_XZ set, causing firmware loading for the
> card to fail). This was of course not the real problem, but it caused the card
> initialization to fail.
>
> I manually decompressed the firmware files one by one to see whether the
> problem would be in the decompressor or probably in the driver of the card.
>
> The last step without crash was:
>
> # dmesg | grep radeon
> [ 10.106405] [drm] radeon kernel modesetting enabled.
> [ 10.106455] radeon 0000:01:00.0: vgaarb: deactivate vga console
> [ 10.222944] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
> -
> 0x000000003FFFFFFF (1024M used)
> [ 10.252921] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> 0x000000007FFFFFFF
> [ 10.278255] [drm] radeon: 1024M of VRAM memory ready
> [ 10.295828] [drm] radeon: 1024M of GTT memory ready.
> [ 10.295867] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_pfp.bin succeeded
> [ 10.330846] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_me.bin succeeded
> [ 10.330858] radeon 0000:01:00.0: Direct firmware load for
> radeon/BTC_rlc.bin
> succeeded
> [ 10.330870] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_mc.bin failed with error -2
> [ 10.380979] ni_cp: Failed to load firmware "radeon/CAICOS_mc.bin"
> [ 10.381006] [drm:evergreen_init [radeon]] *ERROR* Failed to load
> firmware!
> [ 10.405765] radeon 0000:01:00.0: Fatal error during GPU init
> [ 10.432107] [drm] radeon: finishing device.
> [ 10.439179] [drm] radeon: ttm finalized
> [ 10.463203] radeon: probe of 0000:01:00.0 failed with error -2
>
> And with decompressing radeon/CAICOS_mc.bin I got:
>
> # dmesg | grep radeon
> [ 10.266491] [drm] radeon kernel modesetting enabled.
> [ 10.266552] radeon 0000:01:00.0: vgaarb: deactivate vga console
> [ 10.456047] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
> -
> 0x000000003FFFFFFF (1024M used)
> [ 10.470270] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
> 0x000000007FFFFFFF
> [ 10.566946] [drm] radeon: 1024M of VRAM memory ready
> [ 10.576891] [drm] radeon: 1024M of GTT memory ready.
> [ 10.586971] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_pfp.bin succeeded
> [ 10.611886] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_me.bin succeeded
> [ 10.611909] radeon 0000:01:00.0: Direct firmware load for
> radeon/BTC_rlc.bin
> succeeded
> [ 10.611938] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_mc.bin succeeded
> [ 10.660599] radeon 0000:01:00.0: Direct firmware load for
> radeon/CAICOS_smc.bin failed with error -2
> [ 10.660601] smc: error loading firmware "radeon/CAICOS_smc.bin"

You also need to make sure CAICOS_smc.bin is available.

> [ 10.661676] [drm] radeon: power management initialized
> [ 10.713666] radeon 0000:01:00.0: Direct firmware load for
> radeon/SUMO_uvd.bin
> failed with error -2
> [ 10.713668] radeon 0000:01:00.0: radeon_uvd: Can't load firmware
> "radeon/SUMO_uvd.bin"
> [ 10.713669] radeon 0000:01:00.0: failed UVD (-2) init.

And SUMO_uvd.bin.

> [ 10.714787] [drm] enabling PCIE gen 2 link speeds, disable with
> radeon.pcie_gen2=0
> [ 10.809213] radeon 0000:01:00.0: WB enabled
> [ 10.817528] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
> 0x0000000040000c00
> [ 10.833755] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
> 0x0000000040000c0c
> [ 10.850330] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
> [ 10.862154] radeon 0000:01:00.0: radeon: using MSI.
> [ 10.871930] [drm] radeon: irq initialized.
> [ 11.062028] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0 on
> minor 0
> [ 11.119723] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [ 11.411370] fbcon: radeondrmfb (fb0) is primary device
> [ 11.507252] radeon 0000:01:00.0: [drm] fb0: radeondrmfb frame buffer
> device
> [ 11.674028] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [ 11.834317] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [ 28.313041] snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops
> radeon_audio_component_bind_ops [radeon])
> [ 44.371991] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
> [ 44.428068] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
> monitor but no|invalid EDID
>
> followed by a crash some seconds after the system was up.
>
> The crashes vary, but often the kernel accesses non-canonical addresses or
> tries to map illegal physical addresses. Sometimes the system is just hanging,
> either with softlockups or without any further signs of being alive.
>
> I can easily reproduce the problem, so any debug patches to narrow down the
> problem are welcome.

There are still missing firmware required for proper operation. Please fix them up.

Alex

2023-12-15 16:13:20

by Juergen Gross

[permalink] [raw]
Subject: Re: Crashes under Xen with Radeon graphics card

On 15.12.23 17:04, Deucher, Alexander wrote:
> [Public]
>
>> -----Original Message-----
>> From: Juergen Gross <[email protected]>
>> Sent: Friday, December 15, 2023 6:57 AM
>> To: lkml <[email protected]>; [email protected]; amd-
>> [email protected]
>> Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
>> <[email protected]>; Pan, Xinhui <[email protected]>
>> Subject: Crashes under Xen with Radeon graphics card
>>
>> Hi,
>>
>> I recently stumbled over a test system which showed crashes probably
>> resulting from memory being overwritten randomly.
>>
>> The problem is occurring only in Dom0 when running under Xen. It seems to
>> be present since at least kernel 6.3 (I didn't go back further yet), and it seems
>> NOT to be present in kernel 5.14.
>>
>> I tracked the problem down to the initialization of the graphics card (the
>> problem might surface only later, but at least an early initialization failure made
>> the problem go away).
>>
>> # lspci
>> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
>> Caicos XTX [Radeon HD 8490 / R5 235X OEM]
>> 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Caicos HDMI
>> Audio [Radeon HD 6450 / 7450/8450/8490 OEM / R5 230/235/235X OEM]
>>
>> I had a working .config and one which did produce the crashes, so I narrowed
>> the problem down to detect that the important difference was in the area of
>> firmware loading (the working .config didn't have
>> CONFIG_FW_LOADER_COMPRESS_XZ set, causing firmware loading for the
>> card to fail). This was of course not the real problem, but it caused the card
>> initialization to fail.
>>
>> I manually decompressed the firmware files one by one to see whether the
>> problem would be in the decompressor or probably in the driver of the card.
>>
>> The last step without crash was:
>>
>> # dmesg | grep radeon
>> [ 10.106405] [drm] radeon kernel modesetting enabled.
>> [ 10.106455] radeon 0000:01:00.0: vgaarb: deactivate vga console
>> [ 10.222944] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
>> -
>> 0x000000003FFFFFFF (1024M used)
>> [ 10.252921] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
>> 0x000000007FFFFFFF
>> [ 10.278255] [drm] radeon: 1024M of VRAM memory ready
>> [ 10.295828] [drm] radeon: 1024M of GTT memory ready.
>> [ 10.295867] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_pfp.bin succeeded
>> [ 10.330846] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_me.bin succeeded
>> [ 10.330858] radeon 0000:01:00.0: Direct firmware load for
>> radeon/BTC_rlc.bin
>> succeeded
>> [ 10.330870] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_mc.bin failed with error -2
>> [ 10.380979] ni_cp: Failed to load firmware "radeon/CAICOS_mc.bin"
>> [ 10.381006] [drm:evergreen_init [radeon]] *ERROR* Failed to load
>> firmware!
>> [ 10.405765] radeon 0000:01:00.0: Fatal error during GPU init
>> [ 10.432107] [drm] radeon: finishing device.
>> [ 10.439179] [drm] radeon: ttm finalized
>> [ 10.463203] radeon: probe of 0000:01:00.0 failed with error -2
>>
>> And with decompressing radeon/CAICOS_mc.bin I got:
>>
>> # dmesg | grep radeon
>> [ 10.266491] [drm] radeon kernel modesetting enabled.
>> [ 10.266552] radeon 0000:01:00.0: vgaarb: deactivate vga console
>> [ 10.456047] radeon 0000:01:00.0: VRAM: 1024M 0x0000000000000000
>> -
>> 0x000000003FFFFFFF (1024M used)
>> [ 10.470270] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000 -
>> 0x000000007FFFFFFF
>> [ 10.566946] [drm] radeon: 1024M of VRAM memory ready
>> [ 10.576891] [drm] radeon: 1024M of GTT memory ready.
>> [ 10.586971] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_pfp.bin succeeded
>> [ 10.611886] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_me.bin succeeded
>> [ 10.611909] radeon 0000:01:00.0: Direct firmware load for
>> radeon/BTC_rlc.bin
>> succeeded
>> [ 10.611938] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_mc.bin succeeded
>> [ 10.660599] radeon 0000:01:00.0: Direct firmware load for
>> radeon/CAICOS_smc.bin failed with error -2
>> [ 10.660601] smc: error loading firmware "radeon/CAICOS_smc.bin"
>
> You also need to make sure CAICOS_smc.bin is available.

Of course. But with all firmware files loadable the system is crashing, too.

I thought it might help to see after which firmware the crashes are starting.

>
>> [ 10.661676] [drm] radeon: power management initialized
>> [ 10.713666] radeon 0000:01:00.0: Direct firmware load for
>> radeon/SUMO_uvd.bin
>> failed with error -2
>> [ 10.713668] radeon 0000:01:00.0: radeon_uvd: Can't load firmware
>> "radeon/SUMO_uvd.bin"
>> [ 10.713669] radeon 0000:01:00.0: failed UVD (-2) init.
>
> And SUMO_uvd.bin.

Sure.

>
>> [ 10.714787] [drm] enabling PCIE gen 2 link speeds, disable with
>> radeon.pcie_gen2=0
>> [ 10.809213] radeon 0000:01:00.0: WB enabled
>> [ 10.817528] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
>> 0x0000000040000c00
>> [ 10.833755] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
>> 0x0000000040000c0c
>> [ 10.850330] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
>> [ 10.862154] radeon 0000:01:00.0: radeon: using MSI.
>> [ 10.871930] [drm] radeon: irq initialized.
>> [ 11.062028] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0 on
>> minor 0
>> [ 11.119723] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
>> monitor but no|invalid EDID
>> [ 11.411370] fbcon: radeondrmfb (fb0) is primary device
>> [ 11.507252] radeon 0000:01:00.0: [drm] fb0: radeondrmfb frame buffer
>> device
>> [ 11.674028] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
>> monitor but no|invalid EDID
>> [ 11.834317] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
>> monitor but no|invalid EDID
>> [ 28.313041] snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops
>> radeon_audio_component_bind_ops [radeon])
>> [ 44.371991] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
>> monitor but no|invalid EDID
>> [ 44.428068] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1: probed a
>> monitor but no|invalid EDID
>>
>> followed by a crash some seconds after the system was up.
>>
>> The crashes vary, but often the kernel accesses non-canonical addresses or
>> tries to map illegal physical addresses. Sometimes the system is just hanging,
>> either with softlockups or without any further signs of being alive.
>>
>> I can easily reproduce the problem, so any debug patches to narrow down the
>> problem are welcome.
>
> There are still missing firmware required for proper operation. Please fix them up.

That was the starting point, of course!

BTW, meanwhile I have tested kernel 5.19, which is working. I suspected that
the patch series merging swiotlb and swiotlb-xen could be to blame, but that
went into v5.19.


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.66 kB)
OpenPGP public key
OpenPGP_signature.asc (505.00 B)
OpenPGP digital signature
Download all attachments

2023-12-15 16:19:42

by Deucher, Alexander

[permalink] [raw]
Subject: RE: Crashes under Xen with Radeon graphics card

[AMD Official Use Only - General]

> -----Original Message-----
> From: Juergen Gross <[email protected]>
> Sent: Friday, December 15, 2023 11:13 AM
> To: Deucher, Alexander <[email protected]>; lkml <linux-
> [email protected]>; [email protected]; amd-
> [email protected]
> Cc: Koenig, Christian <[email protected]>; Pan, Xinhui
> <[email protected]>
> Subject: Re: Crashes under Xen with Radeon graphics card
>
> On 15.12.23 17:04, Deucher, Alexander wrote:
> > [Public]
> >
> >> -----Original Message-----
> >> From: Juergen Gross <[email protected]>
> >> Sent: Friday, December 15, 2023 6:57 AM
> >> To: lkml <[email protected]>;
> >> [email protected]; amd- [email protected]
> >> Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> >> <[email protected]>; Pan, Xinhui <[email protected]>
> >> Subject: Crashes under Xen with Radeon graphics card
> >>
> >> Hi,
> >>
> >> I recently stumbled over a test system which showed crashes probably
> >> resulting from memory being overwritten randomly.
> >>
> >> The problem is occurring only in Dom0 when running under Xen. It
> >> seems to be present since at least kernel 6.3 (I didn't go back
> >> further yet), and it seems NOT to be present in kernel 5.14.
> >>
> >> I tracked the problem down to the initialization of the graphics card
> >> (the problem might surface only later, but at least an early
> >> initialization failure made the problem go away).
> >>
> >> # lspci
> >> 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> >> [AMD/ATI] Caicos XTX [Radeon HD 8490 / R5 235X OEM]
> >> 01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Caicos
> >> HDMI Audio [Radeon HD 6450 / 7450/8450/8490 OEM / R5
> 230/235/235X
> >> OEM]
> >>
> >> I had a working .config and one which did produce the crashes, so I
> >> narrowed the problem down to detect that the important difference was
> >> in the area of firmware loading (the working .config didn't have
> >> CONFIG_FW_LOADER_COMPRESS_XZ set, causing firmware loading for the
> >> card to fail). This was of course not the real problem, but it caused
> >> the card initialization to fail.
> >>
> >> I manually decompressed the firmware files one by one to see whether
> >> the problem would be in the decompressor or probably in the driver of the
> card.
> >>
> >> The last step without crash was:
> >>
> >> # dmesg | grep radeon
> >> [ 10.106405] [drm] radeon kernel modesetting enabled.
> >> [ 10.106455] radeon 0000:01:00.0: vgaarb: deactivate vga console
> >> [ 10.222944] radeon 0000:01:00.0: VRAM: 1024M
> 0x0000000000000000
> >> -
> >> 0x000000003FFFFFFF (1024M used)
> >> [ 10.252921] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000
> -
> >> 0x000000007FFFFFFF
> >> [ 10.278255] [drm] radeon: 1024M of VRAM memory ready
> >> [ 10.295828] [drm] radeon: 1024M of GTT memory ready.
> >> [ 10.295867] radeon 0000:01:00.0: Direct firmware load for
> >> radeon/CAICOS_pfp.bin succeeded
> >> [ 10.330846] radeon 0000:01:00.0: Direct firmware load for
> >> radeon/CAICOS_me.bin succeeded
> >> [ 10.330858] radeon 0000:01:00.0: Direct firmware load for
> >> radeon/BTC_rlc.bin
> >> succeeded
> >> [ 10.330870] radeon 0000:01:00.0: Direct firmware load for
> >> radeon/CAICOS_mc.bin failed with error -2
> >> [ 10.380979] ni_cp: Failed to load firmware "radeon/CAICOS_mc.bin"
> >> [ 10.381006] [drm:evergreen_init [radeon]] *ERROR* Failed to load
> >> firmware!
> >> [ 10.405765] radeon 0000:01:00.0: Fatal error during GPU init
> >> [ 10.432107] [drm] radeon: finishing device.
> >> [ 10.439179] [drm] radeon: ttm finalized
> >> [ 10.463203] radeon: probe of 0000:01:00.0 failed with error -2
> >>
> >> And with decompressing radeon/CAICOS_mc.bin I got:
> >>
> >> # dmesg | grep radeon
> >> [ 10.266491] [drm] radeon kernel modesetting enabled.
> >> [ 10.266552] radeon 0000:01:00.0: vgaarb: deactivate vga console
> >> [ 10.456047] radeon 0000:01:00.0: VRAM: 1024M
> 0x0000000000000000
> >> -
> >> 0x000000003FFFFFFF (1024M used)
> >> [ 10.470270] radeon 0000:01:00.0: GTT: 1024M 0x0000000040000000
> -
> >> 0x000000007FFFFFFF
> >> [ 10.566946] [drm] radeon: 1024M of VRAM memory ready
> >> [ 10.576891] [drm] radeon: 1024M of GTT memory ready.
> >> [ 10.586971] radeon 0000:01:00.0: Direct firmware load for
> >> radeon/CAICOS_pfp.bin succeeded
> >> [ 10.611886] radeon 0000:01:00.0: Direct firmware load for
> >> radeon/CAICOS_me.bin succeeded
> >> [ 10.611909] radeon 0000:01:00.0: Direct firmware load for
> >> radeon/BTC_rlc.bin
> >> succeeded
> >> [ 10.611938] radeon 0000:01:00.0: Direct firmware load for
> >> radeon/CAICOS_mc.bin succeeded
> >> [ 10.660599] radeon 0000:01:00.0: Direct firmware load for
> >> radeon/CAICOS_smc.bin failed with error -2
> >> [ 10.660601] smc: error loading firmware "radeon/CAICOS_smc.bin"
> >
> > You also need to make sure CAICOS_smc.bin is available.
>
> Of course. But with all firmware files loadable the system is crashing, too.
>
> I thought it might help to see after which firmware the crashes are starting.
>
> >
> >> [ 10.661676] [drm] radeon: power management initialized
> >> [ 10.713666] radeon 0000:01:00.0: Direct firmware load for
> >> radeon/SUMO_uvd.bin
> >> failed with error -2
> >> [ 10.713668] radeon 0000:01:00.0: radeon_uvd: Can't load firmware
> >> "radeon/SUMO_uvd.bin"
> >> [ 10.713669] radeon 0000:01:00.0: failed UVD (-2) init.
> >
> > And SUMO_uvd.bin.
>
> Sure.
>
> >
> >> [ 10.714787] [drm] enabling PCIE gen 2 link speeds, disable with
> >> radeon.pcie_gen2=0
> >> [ 10.809213] radeon 0000:01:00.0: WB enabled
> >> [ 10.817528] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr
> >> 0x0000000040000c00
> >> [ 10.833755] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr
> >> 0x0000000040000c0c
> >> [ 10.850330] radeon 0000:01:00.0: radeon: MSI limited to 32-bit
> >> [ 10.862154] radeon 0000:01:00.0: radeon: using MSI.
> >> [ 10.871930] [drm] radeon: irq initialized.
> >> [ 11.062028] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:00.0
> on
> >> minor 0
> >> [ 11.119723] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1:
> probed a
> >> monitor but no|invalid EDID
> >> [ 11.411370] fbcon: radeondrmfb (fb0) is primary device
> >> [ 11.507252] radeon 0000:01:00.0: [drm] fb0: radeondrmfb frame buffer
> >> device
> >> [ 11.674028] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1:
> probed a
> >> monitor but no|invalid EDID
> >> [ 11.834317] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1:
> probed a
> >> monitor but no|invalid EDID
> >> [ 28.313041] snd_hda_intel 0000:01:00.1: bound 0000:01:00.0 (ops
> >> radeon_audio_component_bind_ops [radeon])
> >> [ 44.371991] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1:
> probed a
> >> monitor but no|invalid EDID
> >> [ 44.428068] [drm:radeon_dvi_detect [radeon]] *ERROR* DVI-I-1:
> probed a
> >> monitor but no|invalid EDID
> >>
> >> followed by a crash some seconds after the system was up.
> >>
> >> The crashes vary, but often the kernel accesses non-canonical
> >> addresses or tries to map illegal physical addresses. Sometimes the
> >> system is just hanging, either with softlockups or without any further signs
> of being alive.
> >>
> >> I can easily reproduce the problem, so any debug patches to narrow
> >> down the problem are welcome.
> >
> > There are still missing firmware required for proper operation. Please fix
> them up.
>
> That was the starting point, of course!

Ah, ok. Thanks for clarifying. What exactly happens when you get this crash? System hang? Kernel oops? Is there anything in the dmesg when it happens?

>
> BTW, meanwhile I have tested kernel 5.19, which is working. I suspected that
> the patch series merging swiotlb and swiotlb-xen could be to blame, but that
> went into v5.19.

Can you bisect?

Alex

2023-12-15 16:33:38

by Juergen Gross

[permalink] [raw]
Subject: Re: Crashes under Xen with Radeon graphics card

On 15.12.23 17:19, Deucher, Alexander wrote:
> [AMD Official Use Only - General]
>
>> -----Original Message-----
>> From: Juergen Gross <[email protected]>
>> Sent: Friday, December 15, 2023 11:13 AM
>> To: Deucher, Alexander <[email protected]>; lkml <linux-
>> [email protected]>; [email protected]; amd-
>> [email protected]
>> Cc: Koenig, Christian <[email protected]>; Pan, Xinhui
>> <[email protected]>
>> Subject: Re: Crashes under Xen with Radeon graphics card
>>
>> On 15.12.23 17:04, Deucher, Alexander wrote:
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Juergen Gross <[email protected]>

...

>>>> The crashes vary, but often the kernel accesses non-canonical
>>>> addresses or tries to map illegal physical addresses. Sometimes the
>>>> system is just hanging, either with softlockups or without any further signs
>> of being alive.
>>>>
>>>> I can easily reproduce the problem, so any debug patches to narrow
>>>> down the problem are welcome.
>>>
>>> There are still missing firmware required for proper operation. Please fix
>> them up.
>>
>> That was the starting point, of course!
>
> Ah, ok. Thanks for clarifying. What exactly happens when you get this crash? System hang? Kernel oops? Is there anything in the dmesg when it happens?

As I wrote above: rather different cases. The crash happens normally
within 20 seconds after the system is completely up. I had one case
where it survived ca. 2 minutes.

One example:

[ 64.549114] BUG: unable to handle page fault for address: ffff888121291000
[ 64.562850] #PF: supervisor write access in kernel mode
[ 64.573352] #PF: error_code(0x0003) - permissions violation
[ 64.584589] PGD 2836067 P4D 2836067 PUD 3e73f7067 PMD 3e72ed067 PTE
8010000121291025
[ 64.600212] Oops: 0003 [#1] PREEMPT SMP NOPTI
[ 64.608985] CPU: 3 PID: 2090 Comm: kioslave5 Tainted: G E
6.7.0-rc5-default #974
[ 64.626721] Hardware name: Dell Inc. OptiPlex 9020/0PC5F7, BIOS A25 05/30/2019
[ 64.641193] RIP: e030:clear_page_erms+0x7/0x10
[ 64.650161] Code: 48 89 47 38 48 8d 7f 40 75 d9 90 c3 cc cc cc cc 0f 1f 00 90
90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 b9 00 10 00 00 31 c0 <f3> aa c3 cc
cc cc cc 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90
[ 64.687996] RSP: e02b:ffffc9004206fb50 EFLAGS: 00010246
[ 64.698378] RAX: 0000000000000000 RBX: ffffea000484a400 RCX: 0000000000001000
[ 64.712780] RDX: 0000000000052dc0 RSI: 0000000000000003 RDI: ffff888121291000
[ 64.727154] RBP: 0000000000000901 R08: ffffea000484a440 R09: ffffea000484a600
[ 64.741491] R10: 0000000000000002 R11: 000000000000241e R12: ffff8883e7d21d80
[ 64.755843] R13: 000000000028d834 R14: 0000000000000901 R15: ffffea000484a400
[ 64.770207] FS: 00007f4c2b79d280(0000) GS:ffff888409380000(0000)
knlGS:0000000000000000
[ 64.786487] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 64.798019] CR2: ffff888121291000 CR3: 000000014fef4000 CR4: 0000000000050660
[ 64.812411] Call Trace:
[ 64.817308] <TASK>
[ 64.821625] ? __die_body+0x1a/0x60
[ 64.828746] ? page_fault_oops+0x151/0x470
[ 64.837065] ? search_bpf_extables+0x65/0x70
[ 64.845717] ? fixup_exception+0x22/0x320
[ 64.853844] ? exc_page_fault+0xb3/0x150
[ 64.861792] ? asm_exc_page_fault+0x22/0x30
[ 64.870275] ? clear_page_erms+0x7/0x10
[ 64.878050] prep_new_page+0x97/0xb0
[ 64.885308] get_page_from_freelist+0x7a4/0x1f40
[ 64.894678] __alloc_pages+0x18b/0x350
[ 64.902270] ? kvmalloc_node+0x3a/0xd0
[ 64.909892] __kmalloc_large_node+0x7a/0x140
[ 64.918542] __kmalloc_node+0xc1/0x130
[ 64.926149] kvmalloc_node+0x3a/0xd0
[ 64.933399] proc_sys_call_handler+0xfa/0x230
[ 64.942259] vfs_read+0x22f/0x2e0
[ 64.949007] ksys_read+0xa5/0xe0
[ 64.955527] do_syscall_64+0x5d/0xe0
[ 64.962806] ? do_user_addr_fault+0x5b3/0x8a0
[ 64.971647] ? exc_page_fault+0x6f/0x150
[ 64.979587] entry_SYSCALL_64_after_hwframe+0x6f/0x77
[ 64.989821] RIP: 0033:0x7f4c29f06a3e
[ 64.997098] Code: 08 e8 f4 1e 02 00 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 64 8b 04 25 18 00 00 00 85 c0 75 14 0f 05 <48> 3d 00 f0
ff ff 77 5a f3 c3 0f 1f 84 00 00 00 00 00 41 54 55 49
[ 65.034962] RSP: 002b:00007ffd5a86f2b8 EFLAGS: 00000246 ORIG_RAX:
0000000000000000
[ 65.050071] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f4c29f06a3e
[ 65.064415] RDX: 0000000000004000 RSI: 0000000002562c18 RDI: 0000000000000004
[ 65.078775] RBP: 0000000002561d60 R08: 00007f4c2abd3418 R09: 0000000000000028
[ 65.093155] R10: 000000000253b010 R11: 0000000000000246 R12: 0000000000004000
[ 65.107492] R13: 0000000000004000 R14: 0000000000000004 R15: 0000000002562c18
[ 65.121850] </TASK>

>
>>
>> BTW, meanwhile I have tested kernel 5.19, which is working. I suspected that
>> the patch series merging swiotlb and swiotlb-xen could be to blame, but that
>> went into v5.19.
>
> Can you bisect?

I can try to find the offending commit, sure. I just wanted to share my current
findings in the hope that someone might have an idea ...


Juergen


Attachments:
OpenPGP_0xB0DE9DD628BF132F.asc (3.66 kB)
OpenPGP public key
OpenPGP_signature.asc (505.00 B)
OpenPGP digital signature
Download all attachments