2022-11-18 08:01:46

by 李真能

[permalink] [raw]
Subject: [PATCH] drm/amdgpu: add mb for si

During reboot test on arm64 platform, it may failure on boot,
so add this mb in smc.

The error message are as follows:
[ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR*
late_init of IP block <si_dpm> failed -22
[ 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed
[ 7.014224][ 7] [ T295] amdgpu 0000:04:00.0: Fatal error during GPU init

Signed-off-by: Zhenneng Li <[email protected]>
---
drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
index 8f994ffa9cd1..c7656f22278d 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
@@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct amdgpu_device *adev)
u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);

+ mb();
+
if (!(rst & RST_REG) && !(clk & CK_DISABLE))
return true;

--
2.25.1



2022-11-18 08:38:46

by Christian König

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: add mb for si

Am 18.11.22 um 08:48 schrieb Zhenneng Li:
> During reboot test on arm64 platform, it may failure on boot,
> so add this mb in smc.
>
> The error message are as follows:
> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR*
> late_init of IP block <si_dpm> failed -22
> [ 7.006919][ 7] [ T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed
> [ 7.014224][ 7] [ T295] amdgpu 0000:04:00.0: Fatal error during GPU init

Memory barries are not supposed to be sprinkled around like this, you
need to give a detailed explanation why this is necessary.

Regards,
Christian.

>
> Signed-off-by: Zhenneng Li <[email protected]>
> ---
> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> index 8f994ffa9cd1..c7656f22278d 100644
> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct amdgpu_device *adev)
> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>
> + mb();
> +
> if (!(rst & RST_REG) && !(clk & CK_DISABLE))
> return true;
>


2022-11-18 09:41:15

by 李真能

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: add mb for si


在 2022/11/18 17:18, Michel Dänzer 写道:
> On 11/18/22 09:01, Christian König wrote:
>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>> During reboot test on arm64 platform, it may failure on boot,
>>> so add this mb in smc.
>>>
>>> The error message are as follows:
>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR*
>>>                 late_init of IP block <si_dpm> failed -22
>>> [    7.006919][ 7] [  T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed
>>> [    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init
>> Memory barries are not supposed to be sprinkled around like this, you need to give a detailed explanation why this is necessary.
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Zhenneng Li <[email protected]>
>>> ---
>>>   drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>   1 file changed, 2 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>> index 8f994ffa9cd1..c7656f22278d 100644
>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct amdgpu_device *adev)
>>>       u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>       u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>   +    mb();
>>> +
>>>       if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>           return true;
> In particular, it makes no sense in this specific place, since it cannot directly affect the values of rst & clk.

I thinks so too.

But when I do reboot test using nine desktop machines,  there maybe
report this error on one or two machines after Hundreds of times or
Thousands of times reboot test, at the beginning, I use msleep() instead
of mb(), these two methods are all works, but I don't know what is the
root case.

I use this method on other verdor's oland card, this error message are
reported again.

What could be the root reason?

test environmen:

graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87

driver: amdgpu

os: ubuntu 2004

platform: arm64

kernel: 5.4.18

>


2022-11-18 09:43:19

by Michel Dänzer

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: add mb for si

On 11/18/22 09:01, Christian König wrote:
> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>> During reboot test on arm64 platform, it may failure on boot,
>> so add this mb in smc.
>>
>> The error message are as follows:
>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init [amdgpu]] *ERROR*
>>                 late_init of IP block <si_dpm> failed -22
>> [    7.006919][ 7] [  T295] amdgpu 0000:04:00.0: amdgpu_device_ip_late_init failed
>> [    7.014224][ 7] [  T295] amdgpu 0000:04:00.0: Fatal error during GPU init
>
> Memory barries are not supposed to be sprinkled around like this, you need to give a detailed explanation why this is necessary.
>
> Regards,
> Christian.
>
>>
>> Signed-off-by: Zhenneng Li <[email protected]>
>> ---
>>   drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>> index 8f994ffa9cd1..c7656f22278d 100644
>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct amdgpu_device *adev)
>>       u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>       u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>   +    mb();
>> +
>>       if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>           return true;

In particular, it makes no sense in this specific place, since it cannot directly affect the values of rst & clk.


--
Earthling Michel Dänzer | https://redhat.com
Libre software enthusiast | Mesa and Xwayland developer


2022-11-24 10:20:36

by Quan, Evan

[permalink] [raw]
Subject: RE: [PATCH] drm/amdgpu: add mb for si

[AMD Official Use Only - General]

Could the attached patch help?

Evan
> -----Original Message-----
> From: amd-gfx <[email protected]> On Behalf Of ???
> Sent: Friday, November 18, 2022 5:25 PM
> To: Michel Dänzer <[email protected]>; Koenig, Christian
> <[email protected]>; Deucher, Alexander
> <[email protected]>
> Cc: [email protected]; Pan, Xinhui <[email protected]>;
> [email protected]; [email protected]
> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>
>
> 在 2022/11/18 17:18, Michel Dänzer 写道:
> > On 11/18/22 09:01, Christian König wrote:
> >> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
> >>> During reboot test on arm64 platform, it may failure on boot, so add
> >>> this mb in smc.
> >>>
> >>> The error message are as follows:
> >>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
> >>> [amdgpu]] *ERROR*
> >>>                 late_init of IP block <si_dpm> failed -22 [
> >>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> >>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
> >>> 0000:04:00.0: Fatal error during GPU init
> >> Memory barries are not supposed to be sprinkled around like this, you
> need to give a detailed explanation why this is necessary.
> >>
> >> Regards,
> >> Christian.
> >>
> >>> Signed-off-by: Zhenneng Li <[email protected]>
> >>> ---
> >>>   drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
> >>>   1 file changed, 2 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>> index 8f994ffa9cd1..c7656f22278d 100644
> >>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
> >>> amdgpu_device *adev)
> >>>       u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
> >>>       u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
> >>>   +    mb();
> >>> +
> >>>       if (!(rst & RST_REG) && !(clk & CK_DISABLE))
> >>>           return true;
> > In particular, it makes no sense in this specific place, since it cannot directly
> affect the values of rst & clk.
>
> I thinks so too.
>
> But when I do reboot test using nine desktop machines,  there maybe report
> this error on one or two machines after Hundreds of times or Thousands of
> times reboot test, at the beginning, I use msleep() instead of mb(), these
> two methods are all works, but I don't know what is the root case.
>
> I use this method on other verdor's oland card, this error message are
> reported again.
>
> What could be the root reason?
>
> test environmen:
>
> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>
> driver: amdgpu
>
> os: ubuntu 2004
>
> platform: arm64
>
> kernel: 5.4.18
>
> >


Attachments:
winmail.dat (25.85 kB)

2022-11-24 10:29:59

by Christian König

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: add mb for si

That's not a patch but some binary file?

Christian.

Am 24.11.22 um 11:04 schrieb Quan, Evan:
> [AMD Official Use Only - General]
>
> Could the attached patch help?
>
> Evan
>> -----Original Message-----
>> From: amd-gfx <[email protected]> On Behalf Of ???
>> Sent: Friday, November 18, 2022 5:25 PM
>> To: Michel Dänzer <[email protected]>; Koenig, Christian
>> <[email protected]>; Deucher, Alexander
>> <[email protected]>
>> Cc: [email protected]; Pan, Xinhui <[email protected]>;
>> [email protected]; [email protected]
>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>
>>
>> 在 2022/11/18 17:18, Michel Dänzer 写道:
>>> On 11/18/22 09:01, Christian König wrote:
>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>>>> During reboot test on arm64 platform, it may failure on boot, so add
>>>>> this mb in smc.
>>>>>
>>>>> The error message are as follows:
>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
>>>>> [amdgpu]] *ERROR*
>>>>>                 late_init of IP block <si_dpm> failed -22 [
>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
>>>>> 0000:04:00.0: Fatal error during GPU init
>>>> Memory barries are not supposed to be sprinkled around like this, you
>> need to give a detailed explanation why this is necessary.
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Zhenneng Li <[email protected]>
>>>>> ---
>>>>>   drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>>>   1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> index 8f994ffa9cd1..c7656f22278d 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
>>>>> amdgpu_device *adev)
>>>>>       u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>>>       u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>>>   +    mb();
>>>>> +
>>>>>       if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>>>           return true;
>>> In particular, it makes no sense in this specific place, since it cannot directly
>> affect the values of rst & clk.
>>
>> I thinks so too.
>>
>> But when I do reboot test using nine desktop machines,  there maybe report
>> this error on one or two machines after Hundreds of times or Thousands of
>> times reboot test, at the beginning, I use msleep() instead of mb(), these
>> two methods are all works, but I don't know what is the root case.
>>
>> I use this method on other verdor's oland card, this error message are
>> reported again.
>>
>> What could be the root reason?
>>
>> test environmen:
>>
>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>>
>> driver: amdgpu
>>
>> os: ubuntu 2004
>>
>> platform: arm64
>>
>> kernel: 5.4.18
>>

2022-11-24 11:13:53

by Lazar, Lijo

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: add mb for si



On 11/24/2022 3:34 PM, Quan, Evan wrote:
> [AMD Official Use Only - General]
>
> Could the attached patch help?
>
> Evan
>> -----Original Message-----
>> From: amd-gfx <[email protected]> On Behalf Of ???
>> Sent: Friday, November 18, 2022 5:25 PM
>> To: Michel Dänzer <[email protected]>; Koenig, Christian
>> <[email protected]>; Deucher, Alexander
>> <[email protected]>
>> Cc: [email protected]; Pan, Xinhui <[email protected]>;
>> [email protected]; [email protected]
>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>
>>
>> 在 2022/11/18 17:18, Michel Dänzer 写道:
>>> On 11/18/22 09:01, Christian König wrote:
>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>>>> During reboot test on arm64 platform, it may failure on boot, so add
>>>>> this mb in smc.
>>>>>
>>>>> The error message are as follows:
>>>>> [ 6.996395][ 7] [ T295] [drm:amdgpu_device_ip_late_init
>>>>> [amdgpu]] *ERROR*
>>>>> late_init of IP block <si_dpm> failed -22 [
>>>>> 7.006919][ 7] [ T295] amdgpu 0000:04:00.0:

The issue is happening in late_init() which eventually does

ret = si_thermal_enable_alert(adev, false);

Just before this, si_thermal_start_thermal_controller is called in
hw_init and that enables thermal alert.

Maybe the issue is with enable/disable of thermal alerts in quick
succession. Adding a delay inside si_thermal_start_thermal_controller
might help.

Thanks,
Lijo

>>>>> amdgpu_device_ip_late_init failed [ 7.014224][ 7] [ T295] amdgpu
>>>>> 0000:04:00.0: Fatal error during GPU init
>>>> Memory barries are not supposed to be sprinkled around like this, you
>> need to give a detailed explanation why this is necessary.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>>> Signed-off-by: Zhenneng Li <[email protected]>
>>>>> ---
>>>>> drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>>> 1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> index 8f994ffa9cd1..c7656f22278d 100644
>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
>>>>> amdgpu_device *adev)
>>>>> u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>>> u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>>> + mb();
>>>>> +
>>>>> if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>>> return true;
>>> In particular, it makes no sense in this specific place, since it cannot directly
>> affect the values of rst & clk.
>>
>> I thinks so too.
>>
>> But when I do reboot test using nine desktop machines, there maybe report
>> this error on one or two machines after Hundreds of times or Thousands of
>> times reboot test, at the beginning, I use msleep() instead of mb(), these
>> two methods are all works, but I don't know what is the root case.
>>
>> I use this method on other verdor's oland card, this error message are
>> reported again.
>>
>> What could be the root reason?
>>
>> test environmen:
>>
>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>>
>> driver: amdgpu
>>
>> os: ubuntu 2004
>>
>> platform: arm64
>>
>> kernel: 5.4.18
>>
>>>

2022-11-24 11:49:51

by Lazar, Lijo

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: add mb for si



On 11/24/2022 4:11 PM, Lazar, Lijo wrote:
>
>
> On 11/24/2022 3:34 PM, Quan, Evan wrote:
>> [AMD Official Use Only - General]
>>
>> Could the attached patch help?
>>
>> Evan
>>> -----Original Message-----
>>> From: amd-gfx <[email protected]> On Behalf Of ???
>>> Sent: Friday, November 18, 2022 5:25 PM
>>> To: Michel Dänzer <[email protected]>; Koenig, Christian
>>> <[email protected]>; Deucher, Alexander
>>> <[email protected]>
>>> Cc: [email protected]; Pan, Xinhui <[email protected]>;
>>> [email protected]; [email protected]
>>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>>
>>>
>>> 在 2022/11/18 17:18, Michel Dänzer 写道:
>>>> On 11/18/22 09:01, Christian König wrote:
>>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>>>>> During reboot test on arm64 platform, it may failure on boot, so add
>>>>>> this mb in smc.
>>>>>>
>>>>>> The error message are as follows:
>>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
>>>>>> [amdgpu]] *ERROR*
>>>>>>                   late_init of IP block <si_dpm> failed -22 [
>>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
>
> The issue is happening in late_init() which eventually does
>
>     ret = si_thermal_enable_alert(adev, false);
>
> Just before this, si_thermal_start_thermal_controller is called in
> hw_init and that enables thermal alert.
>
> Maybe the issue is with enable/disable of thermal alerts in quick
> succession. Adding a delay inside si_thermal_start_thermal_controller
> might help.
>

On a second look, temperature range is already set as part of
si_thermal_start_thermal_controller in hw_init
https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780

There is no need to set it again here -

https://elixir.bootlin.com/linux/v6.1-rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635

I think it is safe to remove the call from late_init altogether. Alex/Evan?

Thanks,
Lijo

> Thanks,
> Lijo
>
>>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
>>>>>> 0000:04:00.0: Fatal error during GPU init
>>>>> Memory barries are not supposed to be sprinkled around like this, you
>>> need to give a detailed explanation why this is necessary.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>>
>>>>>> Signed-off-by: Zhenneng Li <[email protected]>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>>>>     1 file changed, 2 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> index 8f994ffa9cd1..c7656f22278d 100644
>>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
>>>>>> amdgpu_device *adev)
>>>>>>         u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>>>>         u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>>>>     +    mb();
>>>>>> +
>>>>>>         if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>>>>             return true;
>>>> In particular, it makes no sense in this specific place, since it
>>>> cannot directly
>>> affect the values of rst & clk.
>>>
>>> I thinks so too.
>>>
>>> But when I do reboot test using nine desktop machines,  there maybe
>>> report
>>> this error on one or two machines after Hundreds of times or
>>> Thousands of
>>> times reboot test, at the beginning, I use msleep() instead of mb(),
>>> these
>>> two methods are all works, but I don't know what is the root case.
>>>
>>> I use this method on other verdor's oland card, this error message are
>>> reported again.
>>>
>>> What could be the root reason?
>>>
>>> test environmen:
>>>
>>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>>>
>>> driver: amdgpu
>>>
>>> os: ubuntu 2004
>>>
>>> platform: arm64
>>>
>>> kernel: 5.4.18
>>>
>>>>

2022-11-25 02:12:49

by Quan, Evan

[permalink] [raw]
Subject: RE: [PATCH] drm/amdgpu: add mb for si

[AMD Official Use Only - General]

Did you see that? It's a patch which I created by git-format-patch.
Anyway I will paste the changes below. I was suspecting maybe we need some waits for smu running.

diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
index 49c398ec0aaf..9f308a021b2d 100644
--- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
+++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c
@@ -6814,6 +6814,7 @@ static int si_dpm_enable(struct amdgpu_device *adev)
struct si_power_info *si_pi = si_get_pi(adev);
struct amdgpu_ps *boot_ps = adev->pm.dpm.boot_ps;
int ret;
+ int i;

if (amdgpu_si_is_smc_running(adev))
return -EINVAL;
@@ -6909,6 +6910,17 @@ static int si_dpm_enable(struct amdgpu_device *adev)
si_program_response_times(adev);
si_program_ds_registers(adev);
si_dpm_start_smc(adev);
+ /* Waiting for smc alive */
+ for (i = 0; i < adev->usec_timeout; i++) {
+ if (amdgpu_si_is_smc_running(adev))
+ break;
+ udelay(1);
+ }
+ if (i >= adev->usec_timeout) {
+ DRM_ERROR("Timedout on waiting for smu running\n");
+ return -EINVAL;
+ }
+
ret = si_notify_smc_display_change(adev, false);
if (ret) {
DRM_ERROR("si_notify_smc_display_change failed\n");


BR
Evan
> -----Original Message-----
> From: Christian König <[email protected]>
> Sent: Thursday, November 24, 2022 6:06 PM
> To: Quan, Evan <[email protected]>; 李真能 <[email protected]>;
> Michel Dänzer <[email protected]>; Koenig, Christian
> <[email protected]>; Deucher, Alexander
> <[email protected]>
> Cc: [email protected]; Pan, Xinhui <[email protected]>;
> [email protected]; [email protected]
> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>
> That's not a patch but some binary file?
>
> Christian.
>
> Am 24.11.22 um 11:04 schrieb Quan, Evan:
> > [AMD Official Use Only - General]
> >
> > Could the attached patch help?
> >
> > Evan
> >> -----Original Message-----
> >> From: amd-gfx <[email protected]> On Behalf
> Of ???
> >> Sent: Friday, November 18, 2022 5:25 PM
> >> To: Michel Dänzer <[email protected]>; Koenig, Christian
> >> <[email protected]>; Deucher, Alexander
> >> <[email protected]>
> >> Cc: [email protected]; Pan, Xinhui <[email protected]>;
> >> [email protected]; [email protected]
> >> Subject: Re: [PATCH] drm/amdgpu: add mb for si
> >>
> >>
> >> 在 2022/11/18 17:18, Michel Dänzer 写道:
> >>> On 11/18/22 09:01, Christian König wrote:
> >>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
> >>>>> During reboot test on arm64 platform, it may failure on boot, so
> >>>>> add this mb in smc.
> >>>>>
> >>>>> The error message are as follows:
> >>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
> >>>>> [amdgpu]] *ERROR*
> >>>>>                 late_init of IP block <si_dpm> failed -22 [
> >>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> >>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295]
> >>>>> amdgpu
> >>>>> 0000:04:00.0: Fatal error during GPU init
> >>>> Memory barries are not supposed to be sprinkled around like this,
> >>>> you
> >> need to give a detailed explanation why this is necessary.
> >>>> Regards,
> >>>> Christian.
> >>>>
> >>>>> Signed-off-by: Zhenneng Li <[email protected]>
> >>>>> ---
> >>>>>   drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
> >>>>>   1 file changed, 2 insertions(+)
> >>>>>
> >>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>> index 8f994ffa9cd1..c7656f22278d 100644
> >>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
> >>>>> amdgpu_device *adev)
> >>>>>       u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
> >>>>>       u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
> >>>>>   +    mb();
> >>>>> +
> >>>>>       if (!(rst & RST_REG) && !(clk & CK_DISABLE))
> >>>>>           return true;
> >>> In particular, it makes no sense in this specific place, since it
> >>> cannot directly
> >> affect the values of rst & clk.
> >>
> >> I thinks so too.
> >>
> >> But when I do reboot test using nine desktop machines,  there maybe
> >> report this error on one or two machines after Hundreds of times or
> >> Thousands of times reboot test, at the beginning, I use msleep()
> >> instead of mb(), these two methods are all works, but I don't know what
> is the root case.
> >>
> >> I use this method on other verdor's oland card, this error message
> >> are reported again.
> >>
> >> What could be the root reason?
> >>
> >> test environmen:
> >>
> >> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
> >>
> >> driver: amdgpu
> >>
> >> os: ubuntu 2004
> >>
> >> platform: arm64
> >>
> >> kernel: 5.4.18
> >>


Attachments:
winmail.dat (17.79 kB)

2022-11-25 02:41:52

by Quan, Evan

[permalink] [raw]
Subject: RE: [PATCH] drm/amdgpu: add mb for si

[AMD Official Use Only - General]



> -----Original Message-----
> From: Lazar, Lijo <[email protected]>
> Sent: Thursday, November 24, 2022 6:49 PM
> To: Quan, Evan <[email protected]>; 李真能 <[email protected]>;
> Michel Dänzer <[email protected]>; Koenig, Christian
> <[email protected]>; Deucher, Alexander
> <[email protected]>
> Cc: [email protected]; Pan, Xinhui <[email protected]>;
> [email protected]; [email protected]
> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>
>
>
> On 11/24/2022 4:11 PM, Lazar, Lijo wrote:
> >
> >
> > On 11/24/2022 3:34 PM, Quan, Evan wrote:
> >> [AMD Official Use Only - General]
> >>
> >> Could the attached patch help?
> >>
> >> Evan
> >>> -----Original Message-----
> >>> From: amd-gfx <[email protected]> On Behalf
> Of ???
> >>> Sent: Friday, November 18, 2022 5:25 PM
> >>> To: Michel Dänzer <[email protected]>; Koenig, Christian
> >>> <[email protected]>; Deucher, Alexander
> >>> <[email protected]>
> >>> Cc: [email protected]; Pan, Xinhui <[email protected]>;
> >>> [email protected]; [email protected]
> >>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
> >>>
> >>>
> >>> 在 2022/11/18 17:18, Michel Dänzer 写道:
> >>>> On 11/18/22 09:01, Christian König wrote:
> >>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
> >>>>>> During reboot test on arm64 platform, it may failure on boot, so
> >>>>>> add this mb in smc.
> >>>>>>
> >>>>>> The error message are as follows:
> >>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
> >>>>>> [amdgpu]] *ERROR*
> >>>>>>                   late_init of IP block <si_dpm> failed -22 [
> >>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
> >
> > The issue is happening in late_init() which eventually does
> >
> >     ret = si_thermal_enable_alert(adev, false);
> >
> > Just before this, si_thermal_start_thermal_controller is called in
> > hw_init and that enables thermal alert.
> >
> > Maybe the issue is with enable/disable of thermal alerts in quick
> > succession. Adding a delay inside si_thermal_start_thermal_controller
> > might help.
> >
>
> On a second look, temperature range is already set as part of
> si_thermal_start_thermal_controller in hw_init
> https://elixir.bootlin.com/linux/v6.1-
> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780
>
> There is no need to set it again here -
>
> https://elixir.bootlin.com/linux/v6.1-
> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635
>
> I think it is safe to remove the call from late_init altogether. Alex/Evan?
>
[Quan, Evan] Yes, it makes sense to me. But I'm not sure whether that’s related with the issue here.
Since per my understandings, if the issue is caused by double calling of thermal_alert enablement, it will fail every time.
That cannot explain why adding some delays or a mb() calling can help.

BR
Evan
> Thanks,
> Lijo
>
> > Thanks,
> > Lijo
> >
> >>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
> >>>>>> 0000:04:00.0: Fatal error during GPU init
> >>>>> Memory barries are not supposed to be sprinkled around like this,
> you
> >>> need to give a detailed explanation why this is necessary.
> >>>>>
> >>>>> Regards,
> >>>>> Christian.
> >>>>>
> >>>>>> Signed-off-by: Zhenneng Li <[email protected]>
> >>>>>> ---
> >>>>>>     drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
> >>>>>>     1 file changed, 2 insertions(+)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> index 8f994ffa9cd1..c7656f22278d 100644
> >>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
> >>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
> >>>>>> amdgpu_device *adev)
> >>>>>>         u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
> >>>>>>         u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
> >>>>>>     +    mb();
> >>>>>> +
> >>>>>>         if (!(rst & RST_REG) && !(clk & CK_DISABLE))
> >>>>>>             return true;
> >>>> In particular, it makes no sense in this specific place, since it
> >>>> cannot directly
> >>> affect the values of rst & clk.
> >>>
> >>> I thinks so too.
> >>>
> >>> But when I do reboot test using nine desktop machines,  there maybe
> >>> report
> >>> this error on one or two machines after Hundreds of times or
> >>> Thousands of
> >>> times reboot test, at the beginning, I use msleep() instead of mb(),
> >>> these
> >>> two methods are all works, but I don't know what is the root case.
> >>>
> >>> I use this method on other verdor's oland card, this error message are
> >>> reported again.
> >>>
> >>> What could be the root reason?
> >>>
> >>> test environmen:
> >>>
> >>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
> >>>
> >>> driver: amdgpu
> >>>
> >>> os: ubuntu 2004
> >>>
> >>> platform: arm64
> >>>
> >>> kernel: 5.4.18
> >>>
> >>>>


Attachments:
winmail.dat (17.99 kB)

2022-11-25 06:55:49

by Lazar, Lijo

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: add mb for si


On 11/25/2022 7:43 AM, Quan, Evan wrote:
> [AMD Official Use Only - General]
>
>
>
>> -----Original Message-----
>> From: Lazar, Lijo <[email protected]>
>> Sent: Thursday, November 24, 2022 6:49 PM
>> To: Quan, Evan <[email protected]>; 李真能 <[email protected]>;
>> Michel Dänzer <[email protected]>; Koenig, Christian
>> <[email protected]>; Deucher, Alexander
>> <[email protected]>
>> Cc: [email protected]; Pan, Xinhui <[email protected]>;
>> [email protected]; [email protected]
>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>
>>
>>
>> On 11/24/2022 4:11 PM, Lazar, Lijo wrote:
>>>
>>> On 11/24/2022 3:34 PM, Quan, Evan wrote:
>>>> [AMD Official Use Only - General]
>>>>
>>>> Could the attached patch help?
>>>>
>>>> Evan
>>>>> -----Original Message-----
>>>>> From: amd-gfx <[email protected]> On Behalf
>> Of ???
>>>>> Sent: Friday, November 18, 2022 5:25 PM
>>>>> To: Michel Dänzer <[email protected]>; Koenig, Christian
>>>>> <[email protected]>; Deucher, Alexander
>>>>> <[email protected]>
>>>>> Cc: [email protected]; Pan, Xinhui <[email protected]>;
>>>>> [email protected]; [email protected]
>>>>> Subject: Re: [PATCH] drm/amdgpu: add mb for si
>>>>>
>>>>>
>>>>> 在 2022/11/18 17:18, Michel Dänzer 写道:
>>>>>> On 11/18/22 09:01, Christian König wrote:
>>>>>>> Am 18.11.22 um 08:48 schrieb Zhenneng Li:
>>>>>>>> During reboot test on arm64 platform, it may failure on boot, so
>>>>>>>> add this mb in smc.
>>>>>>>>
>>>>>>>> The error message are as follows:
>>>>>>>> [    6.996395][ 7] [  T295] [drm:amdgpu_device_ip_late_init
>>>>>>>> [amdgpu]] *ERROR*
>>>>>>>>                   late_init of IP block <si_dpm> failed -22 [
>>>>>>>> 7.006919][ 7] [  T295] amdgpu 0000:04:00.0:
>>> The issue is happening in late_init() which eventually does
>>>
>>>     ret = si_thermal_enable_alert(adev, false);
>>>
>>> Just before this, si_thermal_start_thermal_controller is called in
>>> hw_init and that enables thermal alert.
>>>
>>> Maybe the issue is with enable/disable of thermal alerts in quick
>>> succession. Adding a delay inside si_thermal_start_thermal_controller
>>> might help.
>>>
>> On a second look, temperature range is already set as part of
>> si_thermal_start_thermal_controller in hw_init
>> https://elixir.bootlin.com/linux/v6.1-
>> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L6780
>>
>> There is no need to set it again here -
>>
>> https://elixir.bootlin.com/linux/v6.1-
>> rc6/source/drivers/gpu/drm/amd/pm/legacy-dpm/si_dpm.c#L7635
>>
>> I think it is safe to remove the call from late_init altogether. Alex/Evan?
>>
> [Quan, Evan] Yes, it makes sense to me. But I'm not sure whether that’s related with the issue here.
> Since per my understandings, if the issue is caused by double calling of thermal_alert enablement, it will fail every time.
> That cannot explain why adding some delays or a mb() calling can help.

The side effect of the patch is just some random delay introduced for
every SMC message

The issue happens in late_init(). Between late_init() and dpm
enablement, there are many smc messages sent which don't have this
issue. So I think the issue is not with FW not running.

Thus the only case I see is enable/disable of thermal alert in random
succession.

Thanks,

Lijo

> BR
> Evan
>> Thanks,
>> Lijo
>>
>>> Thanks,
>>> Lijo
>>>
>>>>>>>> amdgpu_device_ip_late_init failed [    7.014224][ 7] [  T295] amdgpu
>>>>>>>> 0000:04:00.0: Fatal error during GPU init
>>>>>>> Memory barries are not supposed to be sprinkled around like this,
>> you
>>>>> need to give a detailed explanation why this is necessary.
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>> Signed-off-by: Zhenneng Li <[email protected]>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c | 2 ++
>>>>>>>>     1 file changed, 2 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> index 8f994ffa9cd1..c7656f22278d 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/pm/legacy-dpm/si_smc.c
>>>>>>>> @@ -155,6 +155,8 @@ bool amdgpu_si_is_smc_running(struct
>>>>>>>> amdgpu_device *adev)
>>>>>>>>         u32 rst = RREG32_SMC(SMC_SYSCON_RESET_CNTL);
>>>>>>>>         u32 clk = RREG32_SMC(SMC_SYSCON_CLOCK_CNTL_0);
>>>>>>>>     +    mb();
>>>>>>>> +
>>>>>>>>         if (!(rst & RST_REG) && !(clk & CK_DISABLE))
>>>>>>>>             return true;
>>>>>> In particular, it makes no sense in this specific place, since it
>>>>>> cannot directly
>>>>> affect the values of rst & clk.
>>>>>
>>>>> I thinks so too.
>>>>>
>>>>> But when I do reboot test using nine desktop machines,  there maybe
>>>>> report
>>>>> this error on one or two machines after Hundreds of times or
>>>>> Thousands of
>>>>> times reboot test, at the beginning, I use msleep() instead of mb(),
>>>>> these
>>>>> two methods are all works, but I don't know what is the root case.
>>>>>
>>>>> I use this method on other verdor's oland card, this error message are
>>>>> reported again.
>>>>>
>>>>> What could be the root reason?
>>>>>
>>>>> test environmen:
>>>>>
>>>>> graphics card: OLAND 0x1002:0x6611 0x1642:0x1869 0x87
>>>>>
>>>>> driver: amdgpu
>>>>>
>>>>> os: ubuntu 2004
>>>>>
>>>>> platform: arm64
>>>>>
>>>>> kernel: 5.4.18
>>>>>