2024-05-16 17:57:55

by Tim Van Patten

[permalink] [raw]
Subject: [PATCH] drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1

From: Tim Van Patten <[email protected]>

The following commit updated gmc->noretry from 0 to 1 for GC HW IP
9.3.0:

commit 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")

This causes the device to hang when a page fault occurs, until the
device is rebooted. Instead, revert back to gmc->noretry=0 so the device
is still responsive.

Fixes: 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
Signed-off-by: Tim Van Patten <[email protected]>
---

drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
index be4629cdac049..bff54a20835f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -876,7 +876,6 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
struct amdgpu_gmc *gmc = &adev->gmc;
uint32_t gc_ver = amdgpu_ip_version(adev, GC_HWIP, 0);
bool noretry_default = (gc_ver == IP_VERSION(9, 0, 1) ||
- gc_ver == IP_VERSION(9, 3, 0) ||
gc_ver == IP_VERSION(9, 4, 0) ||
gc_ver == IP_VERSION(9, 4, 1) ||
gc_ver == IP_VERSION(9, 4, 2) ||
--
2.45.0.rc1.225.g2a3ae87e7f-goog



2024-05-16 19:49:59

by Alex Deucher

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1

Applied. Thanks!

Alex

On Thu, May 16, 2024 at 3:46 PM Tim Van Patten <[email protected]> wrote:
>
> From: Tim Van Patten <[email protected]>
>
> The following commit updated gmc->noretry from 0 to 1 for GC HW IP
> 9.3.0:
>
> commit 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
>
> This causes the device to hang when a page fault occurs, until the
> device is rebooted. Instead, revert back to gmc->noretry=0 so the device
> is still responsive.
>
> Fixes: 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
> Signed-off-by: Tim Van Patten <[email protected]>
> ---
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index be4629cdac049..bff54a20835f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -876,7 +876,6 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
> struct amdgpu_gmc *gmc = &adev->gmc;
> uint32_t gc_ver = amdgpu_ip_version(adev, GC_HWIP, 0);
> bool noretry_default = (gc_ver == IP_VERSION(9, 0, 1) ||
> - gc_ver == IP_VERSION(9, 3, 0) ||
> gc_ver == IP_VERSION(9, 4, 0) ||
> gc_ver == IP_VERSION(9, 4, 1) ||
> gc_ver == IP_VERSION(9, 4, 2) ||
> --
> 2.45.0.rc1.225.g2a3ae87e7f-goog
>

2024-05-17 06:35:23

by Christian König

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1

Am 16.05.24 um 19:57 schrieb Tim Van Patten:
> From: Tim Van Patten <[email protected]>
>
> The following commit updated gmc->noretry from 0 to 1 for GC HW IP
> 9.3.0:
>
> commit 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
>
> This causes the device to hang when a page fault occurs, until the
> device is rebooted. Instead, revert back to gmc->noretry=0 so the device
> is still responsive.

Wait a second. Why does the device hang on a page fault? That shouldn't
happen independent of noretry.

So that strongly sounds like this is just hiding a bug elsewhere.

Regards,
Christian.

>
> Fixes: 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
> Signed-off-by: Tim Van Patten <[email protected]>
> ---
>
> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> index be4629cdac049..bff54a20835f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> @@ -876,7 +876,6 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
> struct amdgpu_gmc *gmc = &adev->gmc;
> uint32_t gc_ver = amdgpu_ip_version(adev, GC_HWIP, 0);
> bool noretry_default = (gc_ver == IP_VERSION(9, 0, 1) ||
> - gc_ver == IP_VERSION(9, 3, 0) ||
> gc_ver == IP_VERSION(9, 4, 0) ||
> gc_ver == IP_VERSION(9, 4, 1) ||
> gc_ver == IP_VERSION(9, 4, 2) ||


2024-05-17 15:31:55

by Tim Van Patten

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1

On Fri, May 17, 2024 at 12:35 AM Christian König
<[email protected]> wrote:
>
> Am 16.05.24 um 19:57 schrieb Tim Van Patten:
> > From: Tim Van Patten <[email protected]>
> >
> > The following commit updated gmc->noretry from 0 to 1 for GC HW IP
> > 9.3.0:
> >
> > commit 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
> >
> > This causes the device to hang when a page fault occurs, until the
> > device is rebooted. Instead, revert back to gmc->noretry=0 so the device
> > is still responsive.
>
> Wait a second. Why does the device hang on a page fault? That shouldn't
> happen independent of noretry.

No idea, but hopefully someone within AMD can help answer that.

I'm not an expert in this area, I was just able to bisect to the CL
causing the change in behavior. There are other reports of people
bisecting to the same CL, so this issue appears to extend beyond
ChromeOS:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/9728#note_2063879

> So that strongly sounds like this is just hiding a bug elsewhere.

That's entirely possible, bringing the number of real issues up to (at
least) two:
1. Why the page faults are occurring to begin with.
- Any video of size 66x2158 seems to trigger the issue.
2. Why the page faults result in the device hanging with gmc->noretry=1.

I've asked prathyushi.nangia@amd (chris.kuruts@amd may be helping as
well) to look into the page faults further, since they can't hang the
device if they don't exist. She should be able to provide any further
details if you're interested, but please feel free to reach out to me
as well if you have any other questions.

2024-05-17 15:52:53

by Alex Deucher

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1

On Fri, May 17, 2024 at 2:35 AM Christian König
<[email protected]> wrote:
>
> Am 16.05.24 um 19:57 schrieb Tim Van Patten:
> > From: Tim Van Patten <[email protected]>
> >
> > The following commit updated gmc->noretry from 0 to 1 for GC HW IP
> > 9.3.0:
> >
> > commit 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
> >
> > This causes the device to hang when a page fault occurs, until the
> > device is rebooted. Instead, revert back to gmc->noretry=0 so the device
> > is still responsive.
>
> Wait a second. Why does the device hang on a page fault? That shouldn't
> happen independent of noretry.
>
> So that strongly sounds like this is just hiding a bug elsewhere.

Fair enough, but this is also the only gfx9 APU which defaults to
noretry=1, all of the rest are dGPUs. I'd argue it should align with
the other GFX9 APUs or they should all enable noretry=1.

Alex

>
> Regards,
> Christian.
>
> >
> > Fixes: 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
> > Signed-off-by: Tim Van Patten <[email protected]>
> > ---
> >
> > drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 1 -
> > 1 file changed, 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > index be4629cdac049..bff54a20835f1 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
> > @@ -876,7 +876,6 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
> > struct amdgpu_gmc *gmc = &adev->gmc;
> > uint32_t gc_ver = amdgpu_ip_version(adev, GC_HWIP, 0);
> > bool noretry_default = (gc_ver == IP_VERSION(9, 0, 1) ||
> > - gc_ver == IP_VERSION(9, 3, 0) ||
> > gc_ver == IP_VERSION(9, 4, 0) ||
> > gc_ver == IP_VERSION(9, 4, 1) ||
> > gc_ver == IP_VERSION(9, 4, 2) ||
>

2024-05-17 17:28:52

by Tim Van Patten

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1

> Fair enough, but this is also the only gfx9 APU which defaults to
> noretry=1, all of the rest are dGPUs. I'd argue it should align with
> the other GFX9 APUs or they should all enable noretry=1.

Do you mean we should remove all IP_VERSION(9, X, X) entries from
amdgpu_gmc_noretry_set(), leaving just >= IP_VERSION(10, 3, 0)?

2024-05-17 20:15:46

by Alex Deucher

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1

On Fri, May 17, 2024 at 1:27 PM Tim Van Patten <[email protected]> wrote:
>
> > Fair enough, but this is also the only gfx9 APU which defaults to
> > noretry=1, all of the rest are dGPUs. I'd argue it should align with
> > the other GFX9 APUs or they should all enable noretry=1.
>
> Do you mean we should remove all IP_VERSION(9, X, X) entries from
> amdgpu_gmc_noretry_set(), leaving just >= IP_VERSION(10, 3, 0)?

No, just take your patch as is. All of the other 9.x IP versions in
that check are dGPUs. 9.3.0 was the only APU in that list.

Alex

2024-05-21 11:05:56

by Christian König

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Remove GC HW IP 9.3.0 from noretry=1

Am 17.05.24 um 17:46 schrieb Alex Deucher:
> On Fri, May 17, 2024 at 2:35 AM Christian König
> <[email protected]> wrote:
>> Am 16.05.24 um 19:57 schrieb Tim Van Patten:
>>> From: Tim Van Patten <[email protected]>
>>>
>>> The following commit updated gmc->noretry from 0 to 1 for GC HW IP
>>> 9.3.0:
>>>
>>> commit 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
>>>
>>> This causes the device to hang when a page fault occurs, until the
>>> device is rebooted. Instead, revert back to gmc->noretry=0 so the device
>>> is still responsive.
>> Wait a second. Why does the device hang on a page fault? That shouldn't
>> happen independent of noretry.
>>
>> So that strongly sounds like this is just hiding a bug elsewhere.
> Fair enough, but this is also the only gfx9 APU which defaults to
> noretry=1, all of the rest are dGPUs. I'd argue it should align with
> the other GFX9 APUs or they should all enable noretry=1.

Completely agree.

It's just that while the hardware should theoretically be able to handle
recoverable page faults it's just that this features is never tested on
APUs because our hw engineering assumes that they don't have to support
the use case. That's also the reason why we physically don't have the
second IH ring on APUs.

I strongly suggest that instead of doing that for each hw generations
individually to just disallow enabling retry on APUs.

Alternatively we could start testing it on hw and sw side and try to fix
all the bugs.

Regards,
Christian.

>
> Alex
>
>> Regards,
>> Christian.
>>
>>> Fixes: 5f3854f1f4e2 ("drm/amdgpu: add more cases to noretry=1")
>>> Signed-off-by: Tim Van Patten <[email protected]>
>>> ---
>>>
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 1 -
>>> 1 file changed, 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> index be4629cdac049..bff54a20835f1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
>>> @@ -876,7 +876,6 @@ void amdgpu_gmc_noretry_set(struct amdgpu_device *adev)
>>> struct amdgpu_gmc *gmc = &adev->gmc;
>>> uint32_t gc_ver = amdgpu_ip_version(adev, GC_HWIP, 0);
>>> bool noretry_default = (gc_ver == IP_VERSION(9, 0, 1) ||
>>> - gc_ver == IP_VERSION(9, 3, 0) ||
>>> gc_ver == IP_VERSION(9, 4, 0) ||
>>> gc_ver == IP_VERSION(9, 4, 1) ||
>>> gc_ver == IP_VERSION(9, 4, 2) ||