2020-11-23 13:47:26

by Will Deacon

[permalink] [raw]
Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

Edgar Merger reports that the AMD Raven GPU does not work reliably on
his system when the IOMMU is enabled:

| [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1, emitted seq=3
| [...]
| amdgpu 0000:0b:00.0: GPU reset begin!
| AMD-Vi: Completion-Wait loop timed out
| iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x38edc0970]

This is indicative of a hardware/platform configuration issue so, since
disabling ATS has been shown to resolve the problem, add a quirk to
match this particular device while Edgar follows-up with AMD for more
information.

Cc: Bjorn Helgaas <[email protected]>
Cc: Alex Deucher <[email protected]>
Reported-by: Edgar Merger <[email protected]>
Suggested-by: Joerg Roedel <[email protected]>
Link: https://lore.kernel.org/linux-iommu/MWHPR10MB1310F042A30661D4158520B589FC0@MWHPR10MB1310.namprd10.prod.outlook.com
Signed-off-by: Will Deacon <[email protected]>
---

Hi all,

Since Joerg is away at the moment, I'm posting this to try to make some
progress with the thread in the Link: tag.

Cheers,

Will

drivers/pci/quirks.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index f70692ac79c5..3911b0ec57ba 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5176,6 +5176,8 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x6900, quirk_amd_harvest_no_ats);
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7312, quirk_amd_harvest_no_ats);
/* AMD Navi14 dGPU */
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7340, quirk_amd_harvest_no_ats);
+/* AMD Raven platform iGPU */
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x15d8, quirk_amd_harvest_no_ats);
#endif /* CONFIG_PCI_ATS */

/* Freescale PCIe doesn't support MSI in RC mode */
--
2.29.2.454.gaff20da3a2-goog


2020-11-23 21:09:06

by Deucher, Alexander

[permalink] [raw]
Subject: RE: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

[AMD Public Use]

> -----Original Message-----
> From: Will Deacon <[email protected]>
> Sent: Monday, November 23, 2020 8:44 AM
> To: [email protected]
> Cc: [email protected]; [email protected]; Will
> Deacon <[email protected]>; Bjorn Helgaas <[email protected]>;
> Deucher, Alexander <[email protected]>; Edgar Merger
> <[email protected]>; Joerg Roedel <[email protected]>
> Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
>
> Edgar Merger reports that the AMD Raven GPU does not work reliably on his
> system when the IOMMU is enabled:
>
> | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> signaled seq=1, emitted seq=3
> | [...]
> | amdgpu 0000:0b:00.0: GPU reset begin!
> | AMD-Vi: Completion-Wait loop timed out
> | iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT
> device=0b:00.0 address=0x38edc0970]
>
> This is indicative of a hardware/platform configuration issue so, since
> disabling ATS has been shown to resolve the problem, add a quirk to match
> this particular device while Edgar follows-up with AMD for more information.
>
> Cc: Bjorn Helgaas <[email protected]>
> Cc: Alex Deucher <[email protected]>
> Reported-by: Edgar Merger <[email protected]>
> Suggested-by: Joerg Roedel <[email protected]>
> Link:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> kernel.org%2Flinux-
> iommu%2FMWHPR10MB1310F042A30661D4158520B589FC0%40MWHPR10M
> B1310.namprd10.prod.outlook.com&amp;data=04%7C01%7Calexander.deuc
> her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488
> 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7
> CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
> LCJXVCI6Mn0%3D%7C1000&amp;sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ
> LOUfX5oGaoLN8n%2B8%3D&amp;reserved=0
> Signed-off-by: Will Deacon <[email protected]>
> ---
>
> Hi all,
>
> Since Joerg is away at the moment, I'm posting this to try to make some
> progress with the thread in the Link: tag.

+ Felix

What system is this? Can you provide more details? Does a sbios update fix this? Disabling ATS for all Ravens will break GPU compute for a lot of people. I'd prefer to just black list this particular system (e.g., just SSIDs or revision) if possible.

Alex

>
> Cheers,
>
> Will
>
> drivers/pci/quirks.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index
> f70692ac79c5..3911b0ec57ba 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5176,6 +5176,8 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI,
> 0x6900, quirk_amd_harvest_no_ats);
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7312,
> quirk_amd_harvest_no_ats);
> /* AMD Navi14 dGPU */
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x7340,
> quirk_amd_harvest_no_ats);
> +/* AMD Raven platform iGPU */
> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_ATI, 0x15d8,
> +quirk_amd_harvest_no_ats);
> #endif /* CONFIG_PCI_ATS */
>
> /* Freescale PCIe doesn't support MSI in RC mode */
> --
> 2.29.2.454.gaff20da3a2-goog

2020-11-23 22:36:00

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

On Mon, Nov 23, 2020 at 09:04:14PM +0000, Deucher, Alexander wrote:
> [AMD Public Use]
>
> > -----Original Message-----
> > From: Will Deacon <[email protected]>
> > Sent: Monday, November 23, 2020 8:44 AM
> > To: [email protected]
> > Cc: [email protected]; [email protected]; Will
> > Deacon <[email protected]>; Bjorn Helgaas <[email protected]>;
> > Deucher, Alexander <[email protected]>; Edgar Merger
> > <[email protected]>; Joerg Roedel <[email protected]>
> > Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
> >
> > Edgar Merger reports that the AMD Raven GPU does not work reliably on his
> > system when the IOMMU is enabled:
> >
> > | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> > signaled seq=1, emitted seq=3
> > | [...]
> > | amdgpu 0000:0b:00.0: GPU reset begin!
> > | AMD-Vi: Completion-Wait loop timed out
> > | iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT
> > device=0b:00.0 address=0x38edc0970]
> >
> > This is indicative of a hardware/platform configuration issue so, since
> > disabling ATS has been shown to resolve the problem, add a quirk to match
> > this particular device while Edgar follows-up with AMD for more information.
> >
> > Cc: Bjorn Helgaas <[email protected]>
> > Cc: Alex Deucher <[email protected]>
> > Reported-by: Edgar Merger <[email protected]>
> > Suggested-by: Joerg Roedel <[email protected]>
> > Link:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.
> > kernel.org%2Flinux-
> > iommu%2FMWHPR10MB1310F042A30661D4158520B589FC0%40MWHPR10M
> > B1310.namprd10.prod.outlook.com&amp;data=04%7C01%7Calexander.deuc
> > her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488
> > 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7
> > CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
> > LCJXVCI6Mn0%3D%7C1000&amp;sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ
> > LOUfX5oGaoLN8n%2B8%3D&amp;reserved=0
> > Signed-off-by: Will Deacon <[email protected]>
> > ---
> >
> > Hi all,
> >
> > Since Joerg is away at the moment, I'm posting this to try to make some
> > progress with the thread in the Link: tag.
>
> + Felix
>
> What system is this? Can you provide more details? Does a sbios update
> fix this? Disabling ATS for all Ravens will break GPU compute for a lot
> of people. I'd prefer to just black list this particular system (e.g.,
> just SSIDs or revision) if possible.

Cheers, Alex. I'll have to defer to Edgar for the details, as my
understanding from the original thread over at:

https://lore.kernel.org/linux-iommu/MWHPR10MB1310CDB6829DDCF5EA84A14689150@MWHPR10MB1310.namprd10.prod.outlook.com/

is that this is a board developed by his company.

Edgar -- please can you answer Alex's questions?

Will

2020-11-23 22:54:09

by Felix Kuehling

[permalink] [raw]
Subject: Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

On 2020-11-23 5:33 p.m., Will Deacon wrote:
> On Mon, Nov 23, 2020 at 09:04:14PM +0000, Deucher, Alexander wrote:
>> [AMD Public Use]
>>
>>> -----Original Message-----
>>> From: Will Deacon <[email protected]>
>>> Sent: Monday, November 23, 2020 8:44 AM
>>> To: [email protected]
>>> Cc: [email protected]; [email protected]; Will
>>> Deacon <[email protected]>; Bjorn Helgaas <[email protected]>;
>>> Deucher, Alexander <[email protected]>; Edgar Merger
>>> <[email protected]>; Joerg Roedel <[email protected]>
>>> Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
>>>
>>> Edgar Merger reports that the AMD Raven GPU does not work reliably on his
>>> system when the IOMMU is enabled:
>>>
>>> | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
>>> signaled seq=1, emitted seq=3
>>> | [...]
>>> | amdgpu 0000:0b:00.0: GPU reset begin!
>>> | AMD-Vi: Completion-Wait loop timed out
>>> | iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT
>>> device=0b:00.0 address=0x38edc0970]
>>>
>>> This is indicative of a hardware/platform configuration issue so, since
>>> disabling ATS has been shown to resolve the problem, add a quirk to match
>>> this particular device while Edgar follows-up with AMD for more information.
>>>
>>> Cc: Bjorn Helgaas <[email protected]>
>>> Cc: Alex Deucher <[email protected]>
>>> Reported-by: Edgar Merger <[email protected]>
>>> Suggested-by: Joerg Roedel <[email protected]>
>>> Link:
>>> https://lore.
>>> kernel.org/linux-
>>> iommu/MWHPR10MB1310F042A30661D4158520B589FC0@MWHPR10M
>>> B1310.namprd10.prod.outlook.com
>>> her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488
>>> 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7
>>> CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
>>> LCJXVCI6Mn0%3D%7C1000&amp;sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ
>>> LOUfX5oGaoLN8n%2B8%3D&amp;reserved=0
>>> Signed-off-by: Will Deacon <[email protected]>
>>> ---
>>>
>>> Hi all,
>>>
>>> Since Joerg is away at the moment, I'm posting this to try to make some
>>> progress with the thread in the Link: tag.
>> + Felix
>>
>> What system is this? Can you provide more details? Does a sbios update
>> fix this? Disabling ATS for all Ravens will break GPU compute for a lot
>> of people. I'd prefer to just black list this particular system (e.g.,
>> just SSIDs or revision) if possible.

+Ray

There are already many systems where the IOMMU is disabled in the BIOS,
or the CRAT table reporting the APU compute capabilities is broken. Ray
has been working on a fallback to make APUs behave like dGPUs on such
systems. That should also cover this case where ATS is blacklisted. That
said, it affects the programming model, because we don't support the
unified and coherent memory model on dGPUs like we do on APUs with
IOMMUv2. So it would be good to make the conditions for this workaround
as narrow as possible.

These are the relevant changes in KFD and Thunk for reference:

### KFD ###

commit 914913ab04dfbcd0226ecb6bc99d276832ea2908
Author: Huang Rui <[email protected]>
Date:   Tue Aug 18 14:54:23 2020 +0800

    drm/amdkfd: implement the dGPU fallback path for apu (v6)

    We still have a few iommu issues which need to address, so force raven
    as "dgpu" path for the moment.

    This is to add the fallback path to bypass IOMMU if IOMMU v2 is
disabled
    or ACPI CRAT table not correct.

    v2: Use ignore_crat parameter to decide whether it will go with
IOMMUv2.
    v3: Align with existed thunk, don't change the way of raven, only
renoir
        will use "dgpu" path by default.
    v4: don't update global ignore_crat in the driver, and revise fallback
        function if CRAT is broken.
    v5: refine acpi crat good but no iommu support case, and rename the
        title.
    v6: fix the issue of dGPU initialized firstly, just modify the report
        value in the node_show().

    Signed-off-by: Huang Rui <[email protected]>
    Reviewed-by: Felix Kuehling <[email protected]>
    Signed-off-by: Alex Deucher <[email protected]>

### Thunk ###

commit e32482fa4b9ca398c8bdc303920abfd672592764
Author: Huang Rui <[email protected]>
Date:   Tue Aug 18 18:54:05 2020 +0800

    libhsakmt: remove is_dgpu flag in the hsa_gfxip_table

    Whether use dgpu path will check the props which exposed from kernel.
    We won't need hard code in the ASIC table.

    Signed-off-by: Huang Rui <[email protected]>
    Change-Id: I0c018a26b219914a41197ff36dbec7a75945d452

commit 7c60f6d912034aa67ed27b47a29221422423f5cc
Author: Huang Rui <[email protected]>
Date:   Thu Jul 30 10:22:23 2020 +0800

    libhsakmt: implement the method that using flag which exposed by
kfd to configure is_dgpu

    KFD already implemented the fallback path for APU. Thunk will use flag
    which exposed by kfd to configure is_dgpu instead of hardcode before.

    Signed-off-by: Huang Rui <[email protected]>
    Change-Id: I445f6cf668f9484dd06cd9ae1bb3cfe7428ec7eb

Regards,
  Felix


> Cheers, Alex. I'll have to defer to Edgar for the details, as my
> understanding from the original thread over at:
>
> https://lore.kernel.org/linux-iommu/MWHPR10MB1310CDB6829DDCF5EA84A14689150@MWHPR10MB1310.namprd10.prod.outlook.com/
>
> is that this is a board developed by his company.
>
> Edgar -- please can you answer Alex's questions?
>
> Will

Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

This is a board developed by my company.
Subsystem-ID is ea50:0c19 or ea50:cc10 (depending on which particular carrier board the compute module is attached to), however we haven?t managed yet to enter this Subsystem-ID to every PCI-Device in the system, because of missing means to do that by our UEFI-FW. This might will change if we update to latest AGESA version.

-----Original Message-----
From: Will Deacon <[email protected]>
Sent: Montag, 23. November 2020 23:34
To: Deucher, Alexander <[email protected]>
Cc: [email protected]; [email protected]; [email protected]; Bjorn Helgaas <[email protected]>; Merger, Edgar [AUTOSOL/MAS/AUGS] <[email protected]>; Joerg Roedel <[email protected]>; Kuehling, Felix <[email protected]>
Subject: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

On Mon, Nov 23, 2020 at 09:04:14PM +0000, Deucher, Alexander wrote:
> [AMD Public Use]
>
> > -----Original Message-----
> > From: Will Deacon <[email protected]>
> > Sent: Monday, November 23, 2020 8:44 AM
> > To: [email protected]
> > Cc: [email protected]; [email protected];
> > Will Deacon <[email protected]>; Bjorn Helgaas <[email protected]>;
> > Deucher, Alexander <[email protected]>; Edgar Merger
> > <[email protected]>; Joerg Roedel <[email protected]>
> > Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
> >
> > Edgar Merger reports that the AMD Raven GPU does not work reliably
> > on his system when the IOMMU is enabled:
> >
> > | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> > signaled seq=1, emitted seq=3
> > | [...]
> > | amdgpu 0000:0b:00.0: GPU reset begin!
> > | AMD-Vi: Completion-Wait loop timed out
> > | iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT
> > device=0b:00.0 address=0x38edc0970]
> >
> > This is indicative of a hardware/platform configuration issue so,
> > since disabling ATS has been shown to resolve the problem, add a
> > quirk to match this particular device while Edgar follows-up with AMD for more information.
> >
> > Cc: Bjorn Helgaas <[email protected]>
> > Cc: Alex Deucher <[email protected]>
> > Reported-by: Edgar Merger <[email protected]>
> > Suggested-by: Joerg Roedel <[email protected]>
> > Link:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__nam11.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Flore&d=DwIBAg&c=jOURTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-862rdSP13_P6LVp7j_9l1xmg&m=WjiRGepDgI7voSyaAJcvnvZb6gsvZ1fvcnR2tm6bGXg&s=O1nU-RafBXMAS7Mao5Gtu6o1Xkuj8fg4oHQs74TssuA&e= .
> > kernel.org%2Flinux-
> > iommu%2FMWHPR10MB1310F042A30661D4158520B589FC0%40MWHPR10M
> > B1310.namprd10.prod.outlook.com&amp;data=04%7C01%7Calexander.deuc
> > her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488
> > 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7
> > CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
> > LCJXVCI6Mn0%3D%7C1000&amp;sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ
> > LOUfX5oGaoLN8n%2B8%3D&amp;reserved=0
> > Signed-off-by: Will Deacon <[email protected]>
> > ---
> >
> > Hi all,
> >
> > Since Joerg is away at the moment, I'm posting this to try to make
> > some progress with the thread in the Link: tag.
>
> + Felix
>
> What system is this? Can you provide more details? Does a sbios
> update fix this? Disabling ATS for all Ravens will break GPU compute
> for a lot of people. I'd prefer to just black list this particular
> system (e.g., just SSIDs or revision) if possible.

Cheers, Alex. I'll have to defer to Edgar for the details, as my understanding from the original thread over at:

https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org_linux-2Diommu_MWHPR10MB1310CDB6829DDCF5EA84A14689150-40MWHPR10MB1310.namprd10.prod.outlook.com_&d=DwIBAg&c=jOURTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-862rdSP13_P6LVp7j_9l1xmg&m=WjiRGepDgI7voSyaAJcvnvZb6gsvZ1fvcnR2tm6bGXg&s=9qyuCqHeOGaY1sKjkzNN5A6ks6PNF7V2M2PPckHyFKk&e=

is that this is a board developed by his company.

Edgar -- please can you answer Alex's questions?

Will

2020-11-24 19:50:23

by Huang Rui

[permalink] [raw]
Subject: Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

On Tue, Nov 24, 2020 at 06:51:11AM +0800, Kuehling, Felix wrote:
> On 2020-11-23 5:33 p.m., Will Deacon wrote:
> > On Mon, Nov 23, 2020 at 09:04:14PM +0000, Deucher, Alexander wrote:
> >> [AMD Public Use]
> >>
> >>> -----Original Message-----
> >>> From: Will Deacon <[email protected]>
> >>> Sent: Monday, November 23, 2020 8:44 AM
> >>> To: [email protected]
> >>> Cc: [email protected]; [email protected]; Will
> >>> Deacon <[email protected]>; Bjorn Helgaas <[email protected]>;
> >>> Deucher, Alexander <[email protected]>; Edgar Merger
> >>> <[email protected]>; Joerg Roedel <[email protected]>
> >>> Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
> >>>
> >>> Edgar Merger reports that the AMD Raven GPU does not work reliably on his
> >>> system when the IOMMU is enabled:
> >>>
> >>> | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> >>> signaled seq=1, emitted seq=3
> >>> | [...]
> >>> | amdgpu 0000:0b:00.0: GPU reset begin!
> >>> | AMD-Vi: Completion-Wait loop timed out
> >>> | iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT
> >>> device=0b:00.0 address=0x38edc0970]
> >>>
> >>> This is indicative of a hardware/platform configuration issue so, since
> >>> disabling ATS has been shown to resolve the problem, add a quirk to match
> >>> this particular device while Edgar follows-up with AMD for more information.
> >>>
> >>> Cc: Bjorn Helgaas <[email protected]>
> >>> Cc: Alex Deucher <[email protected]>
> >>> Reported-by: Edgar Merger <[email protected]>
> >>> Suggested-by: Joerg Roedel <[email protected]>
> >>> Link:
> >>> https://lore.
> >>> kernel.org/linux-
> >>> iommu/MWHPR10MB1310F042A30661D4158520B589FC0@MWHPR10M
> >>> B1310.namprd10.prod.outlook.com
> >>> her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488
> >>> 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7
> >>> CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
> >>> LCJXVCI6Mn0%3D%7C1000&amp;sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ
> >>> LOUfX5oGaoLN8n%2B8%3D&amp;reserved=0
> >>> Signed-off-by: Will Deacon <[email protected]>
> >>> ---
> >>>
> >>> Hi all,
> >>>
> >>> Since Joerg is away at the moment, I'm posting this to try to make some
> >>> progress with the thread in the Link: tag.
> >> + Felix
> >>
> >> What system is this? Can you provide more details? Does a sbios update
> >> fix this? Disabling ATS for all Ravens will break GPU compute for a lot
> >> of people. I'd prefer to just black list this particular system (e.g.,
> >> just SSIDs or revision) if possible.
>
> +Ray
>
> There are already many systems where the IOMMU is disabled in the BIOS,
> or the CRAT table reporting the APU compute capabilities is broken. Ray
> has been working on a fallback to make APUs behave like dGPUs on such
> systems. That should also cover this case where ATS is blacklisted. That
> said, it affects the programming model, because we don't support the
> unified and coherent memory model on dGPUs like we do on APUs with
> IOMMUv2. So it would be good to make the conditions for this workaround
> as narrow as possible.

Yes, besides the comments from Alex and Felix, may we get your firmware
version (SMC firmware which is from SBIOS) and device id?

> >>> | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> >>> signaled seq=1, emitted seq=3

It looks only gfx ib test passed, and fails to lanuch desktop, am I right?

We would like to see whether it is Raven, Raven kicker (new Raven), or
Picasso. In our side, per the internal test result, we didn't see the
similiar issue on Raven kicker and Picasso platform.

Thanks,
Ray

>
> These are the relevant changes in KFD and Thunk for reference:
>
> ### KFD ###
>
> commit 914913ab04dfbcd0226ecb6bc99d276832ea2908
> Author: Huang Rui <[email protected]>
> Date:?? Tue Aug 18 14:54:23 2020 +0800
>
> ??? drm/amdkfd: implement the dGPU fallback path for apu (v6)
>
> ??? We still have a few iommu issues which need to address, so force raven
> ??? as "dgpu" path for the moment.
>
> ??? This is to add the fallback path to bypass IOMMU if IOMMU v2 is
> disabled
> ??? or ACPI CRAT table not correct.
>
> ??? v2: Use ignore_crat parameter to decide whether it will go with
> IOMMUv2.
> ??? v3: Align with existed thunk, don't change the way of raven, only
> renoir
> ??????? will use "dgpu" path by default.
> ??? v4: don't update global ignore_crat in the driver, and revise fallback
> ??????? function if CRAT is broken.
> ??? v5: refine acpi crat good but no iommu support case, and rename the
> ??????? title.
> ??? v6: fix the issue of dGPU initialized firstly, just modify the report
> ??????? value in the node_show().
>
> ??? Signed-off-by: Huang Rui <[email protected]>
> ??? Reviewed-by: Felix Kuehling <[email protected]>
> ??? Signed-off-by: Alex Deucher <[email protected]>
>
> ### Thunk ###
>
> commit e32482fa4b9ca398c8bdc303920abfd672592764
> Author: Huang Rui <[email protected]>
> Date:?? Tue Aug 18 18:54:05 2020 +0800
>
> ??? libhsakmt: remove is_dgpu flag in the hsa_gfxip_table
>
> ??? Whether use dgpu path will check the props which exposed from kernel.
> ??? We won't need hard code in the ASIC table.
>
> ??? Signed-off-by: Huang Rui <[email protected]>
> ??? Change-Id: I0c018a26b219914a41197ff36dbec7a75945d452
>
> commit 7c60f6d912034aa67ed27b47a29221422423f5cc
> Author: Huang Rui <[email protected]>
> Date:?? Thu Jul 30 10:22:23 2020 +0800
>
> ??? libhsakmt: implement the method that using flag which exposed by
> kfd to configure is_dgpu
>
> ??? KFD already implemented the fallback path for APU. Thunk will use flag
> ??? which exposed by kfd to configure is_dgpu instead of hardcode before.
>
> ??? Signed-off-by: Huang Rui <[email protected]>
> ??? Change-Id: I445f6cf668f9484dd06cd9ae1bb3cfe7428ec7eb
>
> Regards,
> ? Felix
>
>
> > Cheers, Alex. I'll have to defer to Edgar for the details, as my
> > understanding from the original thread over at:
> >
> > https://lore.kernel.org/linux-iommu/MWHPR10MB1310CDB6829DDCF5EA84A14689150@MWHPR10MB1310.namprd10.prod.outlook.com/
> >
> > is that this is a board developed by his company.
> >
> > Edgar -- please can you answer Alex's questions?
> >
> > Will

Subject: RE: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

Module Version : PiccasoCpu 10
AGESA Version : PiccasoPI 100A

I did not try to enter the system in any other way (like via ssh) than via Desktop.

-----Original Message-----
From: Huang Rui <[email protected]>
Sent: Dienstag, 24. November 2020 07:43
To: Kuehling, Felix <[email protected]>
Cc: Will Deacon <[email protected]>; Deucher, Alexander <[email protected]>; [email protected]; [email protected]; [email protected]; Bjorn Helgaas <[email protected]>; Merger, Edgar [AUTOSOL/MAS/AUGS] <[email protected]>; Joerg Roedel <[email protected]>; Changfeng Zhu <[email protected]>
Subject: [EXTERNAL] Re: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken

On Tue, Nov 24, 2020 at 06:51:11AM +0800, Kuehling, Felix wrote:
> On 2020-11-23 5:33 p.m., Will Deacon wrote:
> > On Mon, Nov 23, 2020 at 09:04:14PM +0000, Deucher, Alexander wrote:
> >> [AMD Public Use]
> >>
> >>> -----Original Message-----
> >>> From: Will Deacon <[email protected]>
> >>> Sent: Monday, November 23, 2020 8:44 AM
> >>> To: [email protected]
> >>> Cc: [email protected]; [email protected];
> >>> Will Deacon <[email protected]>; Bjorn Helgaas
> >>> <[email protected]>; Deucher, Alexander
> >>> <[email protected]>; Edgar Merger
> >>> <[email protected]>; Joerg Roedel <[email protected]>
> >>> Subject: [PATCH] PCI: Mark AMD Raven iGPU ATS as broken
> >>>
> >>> Edgar Merger reports that the AMD Raven GPU does not work reliably
> >>> on his system when the IOMMU is enabled:
> >>>
> >>> | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> >>> signaled seq=1, emitted seq=3
> >>> | [...]
> >>> | amdgpu 0000:0b:00.0: GPU reset begin!
> >>> | AMD-Vi: Completion-Wait loop timed out
> >>> | iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT
> >>> device=0b:00.0 address=0x38edc0970]
> >>>
> >>> This is indicative of a hardware/platform configuration issue so,
> >>> since disabling ATS has been shown to resolve the problem, add a
> >>> quirk to match this particular device while Edgar follows-up with AMD for more information.
> >>>
> >>> Cc: Bjorn Helgaas <[email protected]>
> >>> Cc: Alex Deucher <[email protected]>
> >>> Reported-by: Edgar Merger <[email protected]>
> >>> Suggested-by: Joerg Roedel <[email protected]>
> >>> Link:
> >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__lore&d=DwIDAw&c=jOURTkCZzT8tVB5xPEYIm3YJGoxoTaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-862rdSP13_P6LVp7j_9l1xmg&m=lNXu2xwvyxEZ3PzoVmXMBXXS55jsmfDicuQFJqkIOH4&s=_5VDNCRQdA7AhsvvZ3TJJtQZ2iBp9c9tFHIleTYT_ZM&e= .
> >>> kernel.org/linux-
> >>> iommu/MWHPR10MB1310F042A30661D4158520B589FC0@MWHPR10M
> >>> B1310.namprd10.prod.outlook.com
> >>> her%40amd.com%7C1a883fe14d0c408e7d9508d88fb5df4e%7C3dd8961fe488
> >>> 4e608e11a82d994e183d%7C0%7C0%7C637417358593629699%7CUnknown%7
> >>> CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwi
> >>> LCJXVCI6Mn0%3D%7C1000&amp;sdata=TMgKldWzsX8XZ0l7q3%2BszDWXQJJ
> >>> LOUfX5oGaoLN8n%2B8%3D&amp;reserved=0
> >>> Signed-off-by: Will Deacon <[email protected]>
> >>> ---
> >>>
> >>> Hi all,
> >>>
> >>> Since Joerg is away at the moment, I'm posting this to try to make
> >>> some progress with the thread in the Link: tag.
> >> + Felix
> >>
> >> What system is this? Can you provide more details? Does a sbios
> >> update fix this? Disabling ATS for all Ravens will break GPU
> >> compute for a lot of people. I'd prefer to just black list this
> >> particular system (e.g., just SSIDs or revision) if possible.
>
> +Ray
>
> There are already many systems where the IOMMU is disabled in the
> BIOS, or the CRAT table reporting the APU compute capabilities is
> broken. Ray has been working on a fallback to make APUs behave like
> dGPUs on such systems. That should also cover this case where ATS is
> blacklisted. That said, it affects the programming model, because we
> don't support the unified and coherent memory model on dGPUs like we
> do on APUs with IOMMUv2. So it would be good to make the conditions
> for this workaround as narrow as possible.

Yes, besides the comments from Alex and Felix, may we get your firmware version (SMC firmware which is from SBIOS) and device id?

> >>> | [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> >>> signaled seq=1, emitted seq=3

It looks only gfx ib test passed, and fails to lanuch desktop, am I right?

We would like to see whether it is Raven, Raven kicker (new Raven), or Picasso. In our side, per the internal test result, we didn't see the similiar issue on Raven kicker and Picasso platform.

Thanks,
Ray

>
> These are the relevant changes in KFD and Thunk for reference:
>
> ### KFD ###
>
> commit 914913ab04dfbcd0226ecb6bc99d276832ea2908
> Author: Huang Rui <[email protected]>
> Date:?? Tue Aug 18 14:54:23 2020 +0800
>
> ??? drm/amdkfd: implement the dGPU fallback path for apu (v6)
>
> ??? We still have a few iommu issues which need to address, so force
> raven
> ??? as "dgpu" path for the moment.
>
> ??? This is to add the fallback path to bypass IOMMU if IOMMU v2 is
> disabled
> ??? or ACPI CRAT table not correct.
>
> ??? v2: Use ignore_crat parameter to decide whether it will go with
> IOMMUv2.
> ??? v3: Align with existed thunk, don't change the way of raven, only
> renoir
> ??????? will use "dgpu" path by default.
> ??? v4: don't update global ignore_crat in the driver, and revise
> fallback
> ??????? function if CRAT is broken.
> ??? v5: refine acpi crat good but no iommu support case, and rename
> the
> ??????? title.
> ??? v6: fix the issue of dGPU initialized firstly, just modify the
> report
> ??????? value in the node_show().
>
> ??? Signed-off-by: Huang Rui <[email protected]>
> ??? Reviewed-by: Felix Kuehling <[email protected]>
> ??? Signed-off-by: Alex Deucher <[email protected]>
>
> ### Thunk ###
>
> commit e32482fa4b9ca398c8bdc303920abfd672592764
> Author: Huang Rui <[email protected]>
> Date:?? Tue Aug 18 18:54:05 2020 +0800
>
> ??? libhsakmt: remove is_dgpu flag in the hsa_gfxip_table
>
> ??? Whether use dgpu path will check the props which exposed from kernel.
> ??? We won't need hard code in the ASIC table.
>
> ??? Signed-off-by: Huang Rui <[email protected]>
> ??? Change-Id: I0c018a26b219914a41197ff36dbec7a75945d452
>
> commit 7c60f6d912034aa67ed27b47a29221422423f5cc
> Author: Huang Rui <[email protected]>
> Date:?? Thu Jul 30 10:22:23 2020 +0800
>
> ??? libhsakmt: implement the method that using flag which exposed by
> kfd to configure is_dgpu
>
> ??? KFD already implemented the fallback path for APU. Thunk will use
> flag
> ??? which exposed by kfd to configure is_dgpu instead of hardcode before.
>
> ??? Signed-off-by: Huang Rui <[email protected]>
> ??? Change-Id: I445f6cf668f9484dd06cd9ae1bb3cfe7428ec7eb
>
> Regards,
> ? Felix
>
>
> > Cheers, Alex. I'll have to defer to Edgar for the details, as my
> > understanding from the original thread over at:
> >
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lore.kernel.org
> > _linux-2Diommu_MWHPR10MB1310CDB6829DDCF5EA84A14689150-40MWHPR10MB131
> > 0.namprd10.prod.outlook.com_&d=DwIDAw&c=jOURTkCZzT8tVB5xPEYIm3YJGoxo
> > TaQsQPzPKJGaWbo&r=BJxhacqqa4K1PJGm6_-862rdSP13_P6LVp7j_9l1xmg&m=lNXu
> > 2xwvyxEZ3PzoVmXMBXXS55jsmfDicuQFJqkIOH4&s=dsAVVJbD7gJIj3ctZpnnU60y21
> > ijWZmZ8xmOK1cO_O0&e=
> >
> > is that this is a board developed by his company.
> >
> > Edgar -- please can you answer Alex's questions?
> >
> > Will