2023-02-23 23:41:07

by Mikhail Gavrilov

[permalink] [raw]
Subject: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

Hi,
I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But
it is impossible to use without AC power because the system losts nvme
when I disconnect the power adapter.

Messages from kernel log when it happens:
nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
nvme nvme0: Does your device have a faulty power saving mode enabled?
nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
and report a bug

I tried to use recommended parameters
(nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve
this issue, but without successed.

In the linux-nvme mail list the last advice was to try the "pci=nocrs"
parameter.

But with this parameter the amdgpu driver refuses to work and makes
the system unbootable. I can solve the problem with the booting system
by blacklisting the driver but it is not a good solution, because I
don't wanna lose the GPU.

Why amdgpu not work with "pci=nocrs" ?
And is it possible to solve this incompatibility?
It is very important because when I boot the system without amdgpu
driver with "pci=nocrs" nvme is not losts when I disconnect the power
adapter. So "pci=nocrs" really helps.

Below that I see in kernel log when adds "pci=nocrs" parameter:

amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM
amdgpu: ATOM BIOS: SWBRT77321.001
[drm] VCN(0) decode is enabled in VM mode
[drm] VCN(0) encode is enabled in VM mode
[drm] JPEG decode is enabled in VM mode
Console: switching to colour dummy device 80x25
amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature
disabled as experimental (default)
[drm] GPU posting now...
[drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment
size is 9-bit
amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 -
0x00000082FEFFFFFF (12272M used)
amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 -
0x0000FFFFFFFFFFFF
[drm] Detected VRAM RAM=12272M, BAR=16384M
[drm] RAM width 192bits GDDR6
[drm] amdgpu: 12272M of VRAM memory ready
[drm] amdgpu: 31774M of GTT memory ready.
amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo
[drm] Debug VRAM access will use slowpath MM access
amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page
[drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block
<gmc_v10_0> failed -12
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.

Of course a full system log is also attached.

--
Best Regards,
Mike Gavrilov.


Attachments:
system-log-Fatal-error-during-GPU-init.tar.xz (40.03 kB)

2023-02-24 07:14:02

by Christian König

[permalink] [raw]
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

Hi Mikhail,

this is pretty clearly a problem with the system and/or it's BIOS and
not the GPU hw or the driver.

The option pci=nocrs makes the kernel ignore additional resource windows
the BIOS reports through ACPI. This then most likely leads to problems
with amdgpu because it can't bring up its PCIe resources any more.

The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help
understand the problem, but I strongly suggest to try a BIOS update first.

Regards,
Christian.

Am 24.02.23 um 00:40 schrieb Mikhail Gavrilov:
> Hi,
> I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But
> it is impossible to use without AC power because the system losts nvme
> when I disconnect the power adapter.
>
> Messages from kernel log when it happens:
> nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> nvme nvme0: Does your device have a faulty power saving mode enabled?
> nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
> and report a bug
>
> I tried to use recommended parameters
> (nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve
> this issue, but without successed.
>
> In the linux-nvme mail list the last advice was to try the "pci=nocrs"
> parameter.
>
> But with this parameter the amdgpu driver refuses to work and makes
> the system unbootable. I can solve the problem with the booting system
> by blacklisting the driver but it is not a good solution, because I
> don't wanna lose the GPU.
>
> Why amdgpu not work with "pci=nocrs" ?
> And is it possible to solve this incompatibility?
> It is very important because when I boot the system without amdgpu
> driver with "pci=nocrs" nvme is not losts when I disconnect the power
> adapter. So "pci=nocrs" really helps.
>
> Below that I see in kernel log when adds "pci=nocrs" parameter:
>
> amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM
> amdgpu: ATOM BIOS: SWBRT77321.001
> [drm] VCN(0) decode is enabled in VM mode
> [drm] VCN(0) encode is enabled in VM mode
> [drm] JPEG decode is enabled in VM mode
> Console: switching to colour dummy device 80x25
> amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature
> disabled as experimental (default)
> [drm] GPU posting now...
> [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment
> size is 9-bit
> amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 -
> 0x00000082FEFFFFFF (12272M used)
> amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
> amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 -
> 0x0000FFFFFFFFFFFF
> [drm] Detected VRAM RAM=12272M, BAR=16384M
> [drm] RAM width 192bits GDDR6
> [drm] amdgpu: 12272M of VRAM memory ready
> [drm] amdgpu: 31774M of GTT memory ready.
> amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo
> [drm] Debug VRAM access will use slowpath MM access
> amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page
> [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block
> <gmc_v10_0> failed -12
> amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
>
> Of course a full system log is also attached.
>


2023-02-24 08:39:08

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

On Fri, Feb 24, 2023 at 12:13 PM Christian König
<[email protected]> wrote:
>
> Hi Mikhail,
>
> this is pretty clearly a problem with the system and/or it's BIOS and
> not the GPU hw or the driver.
>
> The option pci=nocrs makes the kernel ignore additional resource windows
> the BIOS reports through ACPI. This then most likely leads to problems
> with amdgpu because it can't bring up its PCIe resources any more.
>
> The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help
> understand the problem

I attach both lspci for pci=nocrs and without pci=nocrs.

The differences for Cezanne Radeon Vega Series:
with pci=nocrs:
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Interrupt: pin A routed to IRQ 255
Region 4: I/O ports at e000 [disabled] [size=256]
Capabilities: [c0] MSI-X: Enable- Count=4 Masked-

Without pci=nocrs:
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Interrupt: pin A routed to IRQ 44
Region 4: I/O ports at e000 [size=256]
Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-


The differences for Navi 22 Radeon 6800M:
with pci=nocrs:
Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx-
Interrupt: pin A routed to IRQ 255
Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled] [size=16G]
Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled] [size=256M]
Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled] [size=1M]
AtomicOpsCtl: ReqEn-
Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000000000000000 Data: 0000

Without pci=nocrs:
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
Stepping- SERR- FastB2B- DisINTx+
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 103
Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G]
Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M]
Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M]
AtomicOpsCtl: ReqEn+
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000 Data: 0000

> but I strongly suggest to try a BIOS update first.

This is the first thing that was done. And I am afraid no more BIOS updates.
https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/

I also have experience in dealing with manufacturers' tech support.
Usually it ends with "we do not provide drivers for Linux".

--
Best Regards,
Mike Gavrilov.


Attachments:
lspci-with-pci=nocrs.txt (7.99 kB)
lspci.txt (8.04 kB)
Download all attachments

2023-02-24 12:29:49

by Christian König

[permalink] [raw]
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

Am 24.02.23 um 09:38 schrieb Mikhail Gavrilov:
> On Fri, Feb 24, 2023 at 12:13 PM Christian König
> <[email protected]> wrote:
>> Hi Mikhail,
>>
>> this is pretty clearly a problem with the system and/or it's BIOS and
>> not the GPU hw or the driver.
>>
>> The option pci=nocrs makes the kernel ignore additional resource windows
>> the BIOS reports through ACPI. This then most likely leads to problems
>> with amdgpu because it can't bring up its PCIe resources any more.
>>
>> The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help
>> understand the problem
> I attach both lspci for pci=nocrs and without pci=nocrs.
>
> The differences for Cezanne Radeon Vega Series:
> with pci=nocrs:
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx-
> Interrupt: pin A routed to IRQ 255
> Region 4: I/O ports at e000 [disabled] [size=256]
> Capabilities: [c0] MSI-X: Enable- Count=4 Masked-
>
> Without pci=nocrs:
> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Interrupt: pin A routed to IRQ 44
> Region 4: I/O ports at e000 [size=256]
> Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-
>
>
> The differences for Navi 22 Radeon 6800M:
> with pci=nocrs:
> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx-
> Interrupt: pin A routed to IRQ 255
> Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled] [size=16G]
> Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled] [size=256M]
> Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled] [size=1M]

Well that explains it. When the PCI subsystem has to disable the BARs of
the GPU we can't access it any more.

The only thing we could do is to make sure that the driver at least
fails gracefully.

Do you still have network access to the box when amdgpu fails to load
and could grab whatevery is in dmesg?

Thanks,
Christian.

> AtomicOpsCtl: ReqEn-
> Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
> Address: 0000000000000000 Data: 0000
>
> Without pci=nocrs:
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> Stepping- SERR- FastB2B- DisINTx+
> Latency: 0, Cache Line Size: 64 bytes
> Interrupt: pin A routed to IRQ 103
> Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G]
> Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M]
> Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M]
> AtomicOpsCtl: ReqEn+
> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Address: 00000000fee00000 Data: 0000
>
>> but I strongly suggest to try a BIOS update first.
> This is the first thing that was done. And I am afraid no more BIOS updates.
> https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/
>
> I also have experience in dealing with manufacturers' tech support.
> Usually it ends with "we do not provide drivers for Linux".
>


2023-02-24 15:19:28

by Christian König

[permalink] [raw]
Subject: Keyword Review - Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

Hi Mikhail,

this is pretty clearly a problem with the system and/or it's BIOS and
not the GPU hw or the driver.

The option pci=nocrs makes the kernel ignore additional resource windows
the BIOS reports through ACPI. This then most likely leads to problems
with amdgpu because it can't bring up its PCIe resources any more.

The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help
understand the problem, but I strongly suggest to try a BIOS update first.

Regards,
Christian.

Am 24.02.23 um 00:40 schrieb Mikhail Gavrilov:
> Hi,
> I have a laptop ASUS ROG Strix G15 Advantage Edition G513QY-HQ007. But
> it is impossible to use without AC power because the system losts nvme
> when I disconnect the power adapter.
>
> Messages from kernel log when it happens:
> nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
> nvme nvme0: Does your device have a faulty power saving mode enabled?
> nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"
> and report a bug
>
> I tried to use recommended parameters
> (nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off) to resolve
> this issue, but without successed.
>
> In the linux-nvme mail list the last advice was to try the "pci=nocrs"
> parameter.
>
> But with this parameter the amdgpu driver refuses to work and makes
> the system unbootable. I can solve the problem with the booting system
> by blacklisting the driver but it is not a good solution, because I
> don't wanna lose the GPU.
>
> Why amdgpu not work with "pci=nocrs" ?
> And is it possible to solve this incompatibility?
> It is very important because when I boot the system without amdgpu
> driver with "pci=nocrs" nvme is not losts when I disconnect the power
> adapter. So "pci=nocrs" really helps.
>
> Below that I see in kernel log when adds "pci=nocrs" parameter:
>
> amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from ATRM
> amdgpu: ATOM BIOS: SWBRT77321.001
> [drm] VCN(0) decode is enabled in VM mode
> [drm] VCN(0) encode is enabled in VM mode
> [drm] JPEG decode is enabled in VM mode
> Console: switching to colour dummy device 80x25
> amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature
> disabled as experimental (default)
> [drm] GPU posting now...
> [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment
> size is 9-bit
> amdgpu 0000:03:00.0: amdgpu: VRAM: 12272M 0x0000008000000000 -
> 0x00000082FEFFFFFF (12272M used)
> amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
> amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 -
> 0x0000FFFFFFFFFFFF
> [drm] Detected VRAM RAM=12272M, BAR=16384M
> [drm] RAM width 192bits GDDR6
> [drm] amdgpu: 12272M of VRAM memory ready
> [drm] amdgpu: 31774M of GTT memory ready.
> amdgpu 0000:03:00.0: amdgpu: (-14) failed to allocate kernel bo
> [drm] Debug VRAM access will use slowpath MM access
> amdgpu 0000:03:00.0: amdgpu: Failed to DMA MAP the dummy page
> [drm:amdgpu_device_init [amdgpu]] *ERROR* sw_init of IP block
> <gmc_v10_0> failed -12
> amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
> amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
> amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
>
> Of course a full system log is also attached.
>


2023-02-24 15:31:24

by Christian König

[permalink] [raw]
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

Am 24.02.23 um 13:29 schrieb Christian König:
> Am 24.02.23 um 09:38 schrieb Mikhail Gavrilov:
>> On Fri, Feb 24, 2023 at 12:13 PM Christian König
>> <[email protected]> wrote:
>>> Hi Mikhail,
>>>
>>> this is pretty clearly a problem with the system and/or it's BIOS and
>>> not the GPU hw or the driver.
>>>
>>> The option pci=nocrs makes the kernel ignore additional resource
>>> windows
>>> the BIOS reports through ACPI. This then most likely leads to problems
>>> with amdgpu because it can't bring up its PCIe resources any more.
>>>
>>> The output of "sudo lspci -vvvv -s $BUSID_OF_AMDGPU" might help
>>> understand the problem
>> I attach both lspci for pci=nocrs and without pci=nocrs.
>>
>> The differences for Cezanne Radeon Vega Series:
>> with pci=nocrs:
>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR- FastB2B- DisINTx-
>> Interrupt: pin A routed to IRQ 255
>> Region 4: I/O ports at e000 [disabled] [size=256]
>> Capabilities: [c0] MSI-X: Enable- Count=4 Masked-
>>
>> Without pci=nocrs:
>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR- FastB2B- DisINTx+
>> Interrupt: pin A routed to IRQ 44
>> Region 4: I/O ports at e000 [size=256]
>> Capabilities: [c0] MSI-X: Enable+ Count=4 Masked-
>>
>>
>> The differences for Navi 22 Radeon 6800M:
>> with pci=nocrs:
>> Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR- FastB2B- DisINTx-
>> Interrupt: pin A routed to IRQ 255
>> Region 0: Memory at f800000000 (64-bit, prefetchable) [disabled]
>> [size=16G]
>> Region 2: Memory at fc00000000 (64-bit, prefetchable) [disabled]
>> [size=256M]
>> Region 5: Memory at fca00000 (32-bit, non-prefetchable) [disabled]
>> [size=1M]
>
> Well that explains it. When the PCI subsystem has to disable the BARs
> of the GPU we can't access it any more.
>
> The only thing we could do is to make sure that the driver at least
> fails gracefully.
>
> Do you still have network access to the box when amdgpu fails to load
> and could grab whatevery is in dmesg?

Sorry I totally missed that you attached the full dmesg to your original
mail.

Yeah, the driver did fail gracefully. But then X doesn't come up and
then gdm just dies.

Sorry there is really nothing we can do here, maybe ping somebody with
more ACPI background for help.

Regards,
Christian.

>
> Thanks,
> Christian.
>
>> AtomicOpsCtl: ReqEn-
>> Capabilities: [a0] MSI: Enable- Count=1/1 Maskable- 64bit+
>> Address: 0000000000000000  Data: 0000
>>
>> Without pci=nocrs:
>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>> Stepping- SERR- FastB2B- DisINTx+
>> Latency: 0, Cache Line Size: 64 bytes
>> Interrupt: pin A routed to IRQ 103
>> Region 0: Memory at f800000000 (64-bit, prefetchable) [size=16G]
>> Region 2: Memory at fc00000000 (64-bit, prefetchable) [size=256M]
>> Region 5: Memory at fca00000 (32-bit, non-prefetchable) [size=1M]
>> AtomicOpsCtl: ReqEn+
>> Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
>> Address: 00000000fee00000  Data: 0000
>>
>>> but I strongly suggest to try a BIOS update first.
>> This is the first thing that was done. And I am afraid no more BIOS
>> updates.
>> https://rog.asus.com/laptops/rog-strix/2021-rog-strix-g15-advantage-edition-series/helpdesk_bios/
>>
>>
>> I also have experience in dealing with manufacturers' tech support.
>> Usually it ends with "we do not provide drivers for Linux".
>>
>


2023-02-24 16:21:31

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

On Fri, Feb 24, 2023 at 8:31 PM Christian König
<[email protected]> wrote:
>
> Sorry I totally missed that you attached the full dmesg to your original
> mail.
>
> Yeah, the driver did fail gracefully. But then X doesn't come up and
> then gdm just dies.

Are you sure that these messages should be present when the driver
fails gracefully?

turning off the locking correctness validator.
CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L
------- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
#1
Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
BIOS G513QY.320 09/07/2022
Call Trace:
<TASK>
dump_stack_lvl+0x57/0x90
register_lock_class+0x47d/0x490
__lock_acquire+0x74/0x21f0
? lock_release+0x155/0x450
lock_acquire+0xd2/0x320
? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
? lock_is_held_type+0xce/0x120
_raw_spin_lock_irqsave+0x4d/0xa0
? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
amdgpu_device_fini_hw+0x43/0x2c0 [amdgpu]
amdgpu_driver_load_kms+0xe8/0x190 [amdgpu]
amdgpu_pci_probe+0x140/0x420 [amdgpu]
local_pci_probe+0x41/0x90
pci_device_probe+0xc3/0x230
really_probe+0x1b6/0x410
__driver_probe_device+0x78/0x170
driver_probe_device+0x1f/0x90
__driver_attach+0xd2/0x1c0
? __pfx___driver_attach+0x10/0x10
bus_for_each_dev+0x8a/0xd0
bus_add_driver+0x141/0x230
driver_register+0x77/0x120
? __pfx_init_module+0x10/0x10 [amdgpu]
do_one_initcall+0x6e/0x350
do_init_module+0x4a/0x220
__do_sys_init_module+0x192/0x1c0
do_syscall_64+0x5b/0x80
? asm_exc_page_fault+0x22/0x30
? lockdep_hardirqs_on+0x7d/0x100
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fd58cfcb1be
Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01
RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be
RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010
RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0
R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670
R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0
</TASK>
amdgpu: probe of 0000:03:00.0 failed with error -12
amdgpu 0000:08:00.0: enabling device (0006 -> 0007)
[drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1043:0x16C2 0xC4).


list_add corruption. prev->next should be next (ffffffffc0940328), but
was 0000000000000000. (prev=ffff8c9b734062b0).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:30!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L
------- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
#1
Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
BIOS G513QY.320 09/07/2022
RIP: 0010:__list_add_valid+0x74/0x90
Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246
RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0
R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48
R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000
FS: 00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0
PKRU: 55555554
Call Trace:
<TASK>
ttm_device_init+0x184/0x1c0 [ttm]
amdgpu_ttm_init+0xb8/0x610 [amdgpu]
? _printk+0x60/0x80
gmc_v9_0_sw_init+0x4a3/0x7c0 [amdgpu]
amdgpu_device_init+0x14e5/0x2520 [amdgpu]
amdgpu_driver_load_kms+0x15/0x190 [amdgpu]
amdgpu_pci_probe+0x140/0x420 [amdgpu]
local_pci_probe+0x41/0x90
pci_device_probe+0xc3/0x230
really_probe+0x1b6/0x410
__driver_probe_device+0x78/0x170
driver_probe_device+0x1f/0x90
__driver_attach+0xd2/0x1c0
? __pfx___driver_attach+0x10/0x10
bus_for_each_dev+0x8a/0xd0
bus_add_driver+0x141/0x230
driver_register+0x77/0x120
? __pfx_init_module+0x10/0x10 [amdgpu]
do_one_initcall+0x6e/0x350
do_init_module+0x4a/0x220
__do_sys_init_module+0x192/0x1c0
do_syscall_64+0x5b/0x80
? asm_exc_page_fault+0x22/0x30
? lockdep_hardirqs_on+0x7d/0x100
entry_SYSCALL_64_after_hwframe+0x72/0xdc
RIP: 0033:0x7fd58cfcb1be
Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 48
RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be
RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010
RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0
R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670
R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0
</TASK>
Modules linked in: amdgpu(+) drm_ttm_helper hid_asus ttm asus_wmi
iommu_v2 crct10dif_pclmul ledtrig_audio drm_buddy crc32_pclmul
sparse_keymap gpu_sched crc32c_intel polyval_clmulni platform_profile
hid_multitouch polyval_generic drm_display_helper nvme rfkill
ucsi_acpi ghash_clmulni_intel nvme_core typec_ucsi serio_raw
sp5100_tco ccp sha512_ssse3 r8169 cec typec nvme_common i2c_hid_acpi
video i2c_hid wmi ip6_tables ip_tables fuse
---[ end trace 0000000000000000 ]---
RIP: 0010:__list_add_valid+0x74/0x90
Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246
RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0
R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48
R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000
FS: 00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0
PKRU: 55555554
(udev-worker) (470) used greatest stack depth: 12416 bytes left

I thought that gracefully means switching to svga mode and showing the
desktop with software rendering (exactly as it happens when I
blacklist amdgpu driver). Currently the boot process stucking and the
local console is unavailable.


--
Best Regards,
Mike Gavrilov.

2023-02-27 10:22:50

by Christian König

[permalink] [raw]
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

Am 24.02.23 um 17:21 schrieb Mikhail Gavrilov:
> On Fri, Feb 24, 2023 at 8:31 PM Christian König
> <[email protected]> wrote:
>> Sorry I totally missed that you attached the full dmesg to your original
>> mail.
>>
>> Yeah, the driver did fail gracefully. But then X doesn't come up and
>> then gdm just dies.
> Are you sure that these messages should be present when the driver
> fails gracefully?

Unfortunately yes. We could clean that up a bit more so that you don't
run into a BUG() assertion, but what essentially happens here is that we
completely fail to talk to the hardware.

In this situation we can't even re-enable vesa or text console any more.

Regards,
Christian.

>
> turning off the locking correctness validator.
> CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L
> ------- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
> #1
> Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
> BIOS G513QY.320 09/07/2022
> Call Trace:
> <TASK>
> dump_stack_lvl+0x57/0x90
> register_lock_class+0x47d/0x490
> __lock_acquire+0x74/0x21f0
> ? lock_release+0x155/0x450
> lock_acquire+0xd2/0x320
> ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
> ? lock_is_held_type+0xce/0x120
> _raw_spin_lock_irqsave+0x4d/0xa0
> ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
> amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
> amdgpu_device_fini_hw+0x43/0x2c0 [amdgpu]
> amdgpu_driver_load_kms+0xe8/0x190 [amdgpu]
> amdgpu_pci_probe+0x140/0x420 [amdgpu]
> local_pci_probe+0x41/0x90
> pci_device_probe+0xc3/0x230
> really_probe+0x1b6/0x410
> __driver_probe_device+0x78/0x170
> driver_probe_device+0x1f/0x90
> __driver_attach+0xd2/0x1c0
> ? __pfx___driver_attach+0x10/0x10
> bus_for_each_dev+0x8a/0xd0
> bus_add_driver+0x141/0x230
> driver_register+0x77/0x120
> ? __pfx_init_module+0x10/0x10 [amdgpu]
> do_one_initcall+0x6e/0x350
> do_init_module+0x4a/0x220
> __do_sys_init_module+0x192/0x1c0
> do_syscall_64+0x5b/0x80
> ? asm_exc_page_fault+0x22/0x30
> ? lockdep_hardirqs_on+0x7d/0x100
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7fd58cfcb1be
> Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
> 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01
> RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
> RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be
> RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010
> RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0
> R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670
> R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0
> </TASK>
> amdgpu: probe of 0000:03:00.0 failed with error -12
> amdgpu 0000:08:00.0: enabling device (0006 -> 0007)
> [drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1043:0x16C2 0xC4).
>
>
> list_add corruption. prev->next should be next (ffffffffc0940328), but
> was 0000000000000000. (prev=ffff8c9b734062b0).
> ------------[ cut here ]------------
> kernel BUG at lib/list_debug.c:30!
> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G L
> ------- --- 6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
> #1
> Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
> BIOS G513QY.320 09/07/2022
> RIP: 0010:__list_add_valid+0x74/0x90
> Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
> 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
> 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
> RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246
> RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0
> R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48
> R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000
> FS: 00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0
> PKRU: 55555554
> Call Trace:
> <TASK>
> ttm_device_init+0x184/0x1c0 [ttm]
> amdgpu_ttm_init+0xb8/0x610 [amdgpu]
> ? _printk+0x60/0x80
> gmc_v9_0_sw_init+0x4a3/0x7c0 [amdgpu]
> amdgpu_device_init+0x14e5/0x2520 [amdgpu]
> amdgpu_driver_load_kms+0x15/0x190 [amdgpu]
> amdgpu_pci_probe+0x140/0x420 [amdgpu]
> local_pci_probe+0x41/0x90
> pci_device_probe+0xc3/0x230
> really_probe+0x1b6/0x410
> __driver_probe_device+0x78/0x170
> driver_probe_device+0x1f/0x90
> __driver_attach+0xd2/0x1c0
> ? __pfx___driver_attach+0x10/0x10
> bus_for_each_dev+0x8a/0xd0
> bus_add_driver+0x141/0x230
> driver_register+0x77/0x120
> ? __pfx_init_module+0x10/0x10 [amdgpu]
> do_one_initcall+0x6e/0x350
> do_init_module+0x4a/0x220
> __do_sys_init_module+0x192/0x1c0
> do_syscall_64+0x5b/0x80
> ? asm_exc_page_fault+0x22/0x30
> ? lockdep_hardirqs_on+0x7d/0x100
> entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7fd58cfcb1be
> Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
> 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 48
> RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
> RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be
> RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010
> RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0
> R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670
> R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0
> </TASK>
> Modules linked in: amdgpu(+) drm_ttm_helper hid_asus ttm asus_wmi
> iommu_v2 crct10dif_pclmul ledtrig_audio drm_buddy crc32_pclmul
> sparse_keymap gpu_sched crc32c_intel polyval_clmulni platform_profile
> hid_multitouch polyval_generic drm_display_helper nvme rfkill
> ucsi_acpi ghash_clmulni_intel nvme_core typec_ucsi serio_raw
> sp5100_tco ccp sha512_ssse3 r8169 cec typec nvme_common i2c_hid_acpi
> video i2c_hid wmi ip6_tables ip_tables fuse
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:__list_add_valid+0x74/0x90
> Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
> 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
> 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
> RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246
> RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0
> R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48
> R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000
> FS: 00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0
> PKRU: 55555554
> (udev-worker) (470) used greatest stack depth: 12416 bytes left
>
> I thought that gracefully means switching to svga mode and showing the
> desktop with software rendering (exactly as it happens when I
> blacklist amdgpu driver). Currently the boot process stucking and the
> local console is unavailable.
>
>


2023-02-28 09:52:19

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

On Mon, Feb 27, 2023 at 3:22 PM Christian König
>
> Unfortunately yes. We could clean that up a bit more so that you don't
> run into a BUG() assertion, but what essentially happens here is that we
> completely fail to talk to the hardware.
>
> In this situation we can't even re-enable vesa or text console any more.
>
Then I don't understand why when amdgpu is blacklisted via
modprobe.blacklist=amdgpu then I see graphics and could login into
GNOME. Yes without hardware acceleration, but it is better than non
working graphics. It means there is some other driver (I assume this
is "video") which can successfully talk to the AMD hardware in
conditions where amdgpu cannot do this. My suggestion is that if
amdgpu fails to talk to the hardware, then let another suitable driver
do it. I attached a system log when I apply "pci=nocrs" with
"modprobe.blacklist=amdgpu" for showing that graphics work right in
this case.
To do this, does the Linux module loading mechanism need to be refined?


--
Best Regards,
Mike Gavrilov.


Attachments:
system-without-amdgpu.tar.xz (40.74 kB)

2023-02-28 12:43:33

by Christian König

[permalink] [raw]
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

Am 28.02.23 um 10:52 schrieb Mikhail Gavrilov:
> On Mon, Feb 27, 2023 at 3:22 PM Christian König
>> Unfortunately yes. We could clean that up a bit more so that you don't
>> run into a BUG() assertion, but what essentially happens here is that we
>> completely fail to talk to the hardware.
>>
>> In this situation we can't even re-enable vesa or text console any more.
>>
> Then I don't understand why when amdgpu is blacklisted via
> modprobe.blacklist=amdgpu then I see graphics and could login into
> GNOME. Yes without hardware acceleration, but it is better than non
> working graphics. It means there is some other driver (I assume this
> is "video") which can successfully talk to the AMD hardware in
> conditions where amdgpu cannot do this.

The point is it doesn't need to talk to the amdgpu hardware. What it
does is that it talks to the good old VGA/VESA emulation and that just
happens to be still enabled by the BIOS/GRUB.

And that VGA/VESA emulation doesn't need any BAR or whatever to keep the
hw running in the state where it was initialized before the kernel
started. The kernel just grabs the addresses where it needs to write the
display data and keeps going with that.

But when a hw specific driver wants to load this is the first thing
which gets disabled because we need to load new firmware. And with the
BARs disabled this can't be re-enabled without rebooting the system.

> My suggestion is that if
> amdgpu fails to talk to the hardware, then let another suitable driver
> do it. I attached a system log when I apply "pci=nocrs" with
> "modprobe.blacklist=amdgpu" for showing that graphics work right in
> this case.
> To do this, does the Linux module loading mechanism need to be refined?

That's actually working as expected. The real problem is that the BIOS
on that system is so broken that we can't access the hw correctly.

What we could to do is to check the BARs very early on and refuse to
load when they are disable. The problem with this approach is that there
are systems where it is normal that the BARs are disable until the
driver loads and get enabled during the hardware initialization process.

What you might want to look into is to find a quirk for the BIOS to
properly enable the nvme controller.

Regards,
Christian.


2023-12-15 11:47:26

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

On Tue, Feb 28, 2023 at 5:43 PM Christian König
<[email protected]> wrote:
>
> The point is it doesn't need to talk to the amdgpu hardware. What it
> does is that it talks to the good old VGA/VESA emulation and that just
> happens to be still enabled by the BIOS/GRUB.
>
> And that VGA/VESA emulation doesn't need any BAR or whatever to keep the
> hw running in the state where it was initialized before the kernel
> started. The kernel just grabs the addresses where it needs to write the
> display data and keeps going with that.
>
> But when a hw specific driver wants to load this is the first thing
> which gets disabled because we need to load new firmware. And with the
> BARs disabled this can't be re-enabled without rebooting the system.
>
> > My suggestion is that if
> > amdgpu fails to talk to the hardware, then let another suitable driver
> > do it. I attached a system log when I apply "pci=nocrs" with
> > "modprobe.blacklist=amdgpu" for showing that graphics work right in
> > this case.
> > To do this, does the Linux module loading mechanism need to be refined?
>
> That's actually working as expected. The real problem is that the BIOS
> on that system is so broken that we can't access the hw correctly.
>
> What we could to do is to check the BARs very early on and refuse to
> load when they are disable. The problem with this approach is that there
> are systems where it is normal that the BARs are disable until the
> driver loads and get enabled during the hardware initialization process.
>
> What you might want to look into is to find a quirk for the BIOS to
> properly enable the nvme controller.
>

That's interesting. I noticed that now amdgpu could work even with
parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels.
It means BARs became available?
I attached here the kerner log and lspci. What's changed?

--
Best Regards,
Mike Gavrilov.


Attachments:
dmesg-nvme-down-2.zip (45.48 kB)
lspci.zip (2.65 kB)
Download all attachments

2023-12-15 12:37:52

by Christian König

[permalink] [raw]
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

Am 15.12.23 um 12:45 schrieb Mikhail Gavrilov:
> On Tue, Feb 28, 2023 at 5:43 PM Christian König
> <[email protected]> wrote:
>> The point is it doesn't need to talk to the amdgpu hardware. What it
>> does is that it talks to the good old VGA/VESA emulation and that just
>> happens to be still enabled by the BIOS/GRUB.
>>
>> And that VGA/VESA emulation doesn't need any BAR or whatever to keep the
>> hw running in the state where it was initialized before the kernel
>> started. The kernel just grabs the addresses where it needs to write the
>> display data and keeps going with that.
>>
>> But when a hw specific driver wants to load this is the first thing
>> which gets disabled because we need to load new firmware. And with the
>> BARs disabled this can't be re-enabled without rebooting the system.
>>
>>> My suggestion is that if
>>> amdgpu fails to talk to the hardware, then let another suitable driver
>>> do it. I attached a system log when I apply "pci=nocrs" with
>>> "modprobe.blacklist=amdgpu" for showing that graphics work right in
>>> this case.
>>> To do this, does the Linux module loading mechanism need to be refined?
>> That's actually working as expected. The real problem is that the BIOS
>> on that system is so broken that we can't access the hw correctly.
>>
>> What we could to do is to check the BARs very early on and refuse to
>> load when they are disable. The problem with this approach is that there
>> are systems where it is normal that the BARs are disable until the
>> driver loads and get enabled during the hardware initialization process.
>>
>> What you might want to look into is to find a quirk for the BIOS to
>> properly enable the nvme controller.
>>
> That's interesting. I noticed that now amdgpu could work even with
> parameter [pci=nocrs] on 6.7.0-0.rc4 and higher kernels.
> It means BARs became available?
> I attached here the kerner log and lspci. What's changed?

I have no idea :)

From the logs I can see that the AMDGPU now has the proper BARs assigned:

[    5.722015] pci 0000:03:00.0: [1002:73df] type 00 class 0x038000
[    5.722051] pci 0000:03:00.0: reg 0x10: [mem
0xf800000000-0xfbffffffff 64bit pref]
[    5.722081] pci 0000:03:00.0: reg 0x18: [mem
0xfc00000000-0xfc0fffffff 64bit pref]
[    5.722112] pci 0000:03:00.0: reg 0x24: [mem 0xfca00000-0xfcafffff]
[    5.722134] pci 0000:03:00.0: reg 0x30: [mem 0xfcb00000-0xfcb1ffff pref]
[    5.722368] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
[    5.722484] pci 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth,
limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048
Gb/s with 16.0 GT/s PCIe x16 link)

And with that the driver can work perfectly fine.

Have you updated the BIOS or added/removed some other hardware? Maybe
somebody added a quirk for your BIOS into the PCIe code or something
like that.

Regards,
Christian.




2023-12-19 09:46:06

by Mikhail Gavrilov

[permalink] [raw]
Subject: Re: amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

On Fri, Dec 15, 2023 at 5:37 PM Christian König
<[email protected]> wrote:
>
> I have no idea :)
>
> From the logs I can see that the AMDGPU now has the proper BARs assigned:
>
> [ 5.722015] pci 0000:03:00.0: [1002:73df] type 00 class 0x038000
> [ 5.722051] pci 0000:03:00.0: reg 0x10: [mem
> 0xf800000000-0xfbffffffff 64bit pref]
> [ 5.722081] pci 0000:03:00.0: reg 0x18: [mem
> 0xfc00000000-0xfc0fffffff 64bit pref]
> [ 5.722112] pci 0000:03:00.0: reg 0x24: [mem 0xfca00000-0xfcafffff]
> [ 5.722134] pci 0000:03:00.0: reg 0x30: [mem 0xfcb00000-0xfcb1ffff pref]
> [ 5.722368] pci 0000:03:00.0: PME# supported from D1 D2 D3hot D3cold
> [ 5.722484] pci 0000:03:00.0: 63.008 Gb/s available PCIe bandwidth,
> limited by 8.0 GT/s PCIe x8 link at 0000:00:01.1 (capable of 252.048
> Gb/s with 16.0 GT/s PCIe x16 link)
>
> And with that the driver can work perfectly fine.
>
> Have you updated the BIOS or added/removed some other hardware? Maybe
> somebody added a quirk for your BIOS into the PCIe code or something
> like that.

No, nothing changed in hardware.
But I found the commit which fixes it.

> git bisect unfixed
92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6 is the first fixed commit
commit 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6
Author: Vasant Hegde <[email protected]>
Date: Thu Sep 21 09:21:45 2023 +0000

iommu/amd: Introduce iommu_dev_data.flags to track device capabilities

Currently we use struct iommu_dev_data.iommu_v2 to keep track of the device
ATS, PRI, and PASID capabilities. But these capabilities can be enabled
independently (except PRI requires ATS support). Hence, replace
the iommu_v2 variable with a flags variable, which keep track of the device
capabilities.

From commit 9bf49e36d718 ("PCI/ATS: Handle sharing of PF PRI Capability
with all VFs"), device PRI/PASID is shared between PF and any associated
VFs. Hence use pci_pri_supported() and pci_pasid_features() instead of
pci_find_ext_capability() to check device PRI/PASID support.

Signed-off-by: Vasant Hegde <[email protected]>
Reviewed-by: Jason Gunthorpe <[email protected]>
Reviewed-by: Jerry Snitselaar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Joerg Roedel <[email protected]>

drivers/iommu/amd/amd_iommu_types.h | 3 ++-
drivers/iommu/amd/iommu.c | 46 ++++++++++++++++++++++---------------
2 files changed, 30 insertions(+), 19 deletions(-)


> git bisect log
git bisect start '--term-new=fixed' '--term-old=unfixed'
# status: waiting for both good and bad commits
# fixed: [33cc938e65a98f1d29d0a18403dbbee050dcad9a] Linux 6.7-rc4
git bisect fixed 33cc938e65a98f1d29d0a18403dbbee050dcad9a
# status: waiting for good commit(s), bad commit known
# unfixed: [ffc253263a1375a65fa6c9f62a893e9767fbebfa] Linux 6.6
git bisect unfixed ffc253263a1375a65fa6c9f62a893e9767fbebfa
# unfixed: [7d461b291e65938f15f56fe58da2303b07578a76] Merge tag
'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm
git bisect unfixed 7d461b291e65938f15f56fe58da2303b07578a76
# unfixed: [e14aec23025eeb1f2159ba34dbc1458467c4c347] s390/ap: fix AP
bus crash on early config change callback invocation
git bisect unfixed e14aec23025eeb1f2159ba34dbc1458467c4c347
# unfixed: [be3ca57cfb777ad820c6659d52e60bbdd36bf5ff] Merge tag
'media/v6.7-1' of
git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
git bisect unfixed be3ca57cfb777ad820c6659d52e60bbdd36bf5ff
# fixed: [c0d12d769299e1e08338988c7745009e0db2a4a0] Merge tag
'drm-next-2023-11-10' of git://anongit.freedesktop.org/drm/drm
git bisect fixed c0d12d769299e1e08338988c7745009e0db2a4a0
# fixed: [4bbdb725a36b0d235f3b832bd0c1e885f0442d9f] Merge tag
'iommu-updates-v6.7' of
git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
git bisect fixed 4bbdb725a36b0d235f3b832bd0c1e885f0442d9f
# unfixed: [25b6377007ebe1c3ede773fd6979f613386db000] Merge tag
'drm-next-2023-11-07' of git://anongit.freedesktop.org/drm/drm
git bisect unfixed 25b6377007ebe1c3ede773fd6979f613386db000
# unfixed: [67c0afb6424fee94238d9a32b97c407d0c97155e] Merge tag
'exfat-for-6.7-rc1-part2' of
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
git bisect unfixed 67c0afb6424fee94238d9a32b97c407d0c97155e
# unfixed: [3613047280ec42a4e1350fdc1a6dd161ff4008cc] Merge tag
'v6.6-rc7' into core
git bisect unfixed 3613047280ec42a4e1350fdc1a6dd161ff4008cc
# fixed: [cedc811c76778bdef91d405717acee0de54d8db5] iommu/amd: Remove
DMA_FQ type from domain allocation path
git bisect fixed cedc811c76778bdef91d405717acee0de54d8db5
# unfixed: [b0cc5dae1ac0c18748706a4beb636e3b726dd744] iommu/amd:
Rename ats related variables
git bisect unfixed b0cc5dae1ac0c18748706a4beb636e3b726dd744
# fixed: [5a0b11a180a9b82b4437a4be1cf73530053f139b] iommu/amd: Remove
iommu_v2 module
git bisect fixed 5a0b11a180a9b82b4437a4be1cf73530053f139b
# fixed: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6] iommu/amd:
Introduce iommu_dev_data.flags to track device capabilities
git bisect fixed 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6
# unfixed: [739eb25514c90aa8ea053ed4d2b971f531e63ded] iommu/amd:
Introduce iommu_dev_data.ppr
git bisect unfixed 739eb25514c90aa8ea053ed4d2b971f531e63ded
# first fixed commit: [92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6]
iommu/amd: Introduce iommu_dev_data.flags to track device capabilities

--
Best Regards,
Mike Gavrilov.