2024-01-16 11:59:52

by Daniel Lezcano

[permalink] [raw]
Subject: [PATCH] arm64: dts: qcom: sdm845: Fix wild reboot during Antutu test

Running an Antutu benchmark makes the board to do a hard reboot.

Cause: it appears the gpu-bottom and gpu-top temperature sensors are showing
too high temperatures, above 115°C.

Out of tree configuratons show the gpu thermal zone is configured to
be mitigated at 85°C with devfreq.

Add the DT snippet to enable the thermal mitigation on the sdm845
based board.

Fixes: c79800103eb18 ("arm64: dts: sdm845: Add gpu and gmu device nodes")
Cc: Amit Pundir <[email protected]>
Signed-off-by: Daniel Lezcano <[email protected]>
---
arch/arm64/boot/dts/qcom/sdm845.dtsi | 32 ++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
index c2244824355a..20fefd6af0f8 100644
--- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
+++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
@@ -4764,6 +4764,8 @@ gpu: gpu@5000000 {
interconnects = <&mem_noc MASTER_GFX3D 0 &mem_noc SLAVE_EBI1 0>;
interconnect-names = "gfx-mem";

+ #cooling-cells = <2>;
+
status = "disabled";

gpu_opp_table: opp-table {
@@ -5603,12 +5605,25 @@ gpu-top-thermal {
thermal-sensors = <&tsens0 11>;

trips {
- gpu1_alert0: trip-point0 {
+ gpu1_alert0: trip-point0 {
+ temperature = <85000>;
+ hysteresis = <2000>;
+ type = "passive";
+ };
+
+ gpu1_alert1: trip-point1 {
temperature = <90000>;
hysteresis = <2000>;
type = "hot";
};
};
+
+ cooling-maps {
+ map0 {
+ trip = <&gpu1_alert0>;
+ cooling-device = <&gpu THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
+ };
+ };
};

gpu-bottom-thermal {
@@ -5618,12 +5633,25 @@ gpu-bottom-thermal {
thermal-sensors = <&tsens0 12>;

trips {
- gpu2_alert0: trip-point0 {
+ gpu2_alert0: trip-point0 {
+ temperature = <85000>;
+ hysteresis = <2000>;
+ type = "passive";
+ };
+
+ gpu2_alert1: trip-point1 {
temperature = <90000>;
hysteresis = <2000>;
type = "hot";
};
};
+
+ cooling-maps {
+ map0 {
+ trip = <&gpu2_alert0>;
+ cooling-device = <&gpu THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
+ };
+ };
};

aoss1-thermal {
--
2.34.1



2024-01-16 12:19:30

by Amit Pundir

[permalink] [raw]
Subject: Re: [PATCH] arm64: dts: qcom: sdm845: Fix wild reboot during Antutu test

On Tue, 16 Jan 2024 at 17:29, Daniel Lezcano <[email protected]> wrote:
>
> Running an Antutu benchmark makes the board to do a hard reboot.
>
> Cause: it appears the gpu-bottom and gpu-top temperature sensors are showing
> too high temperatures, above 115°C.
>
> Out of tree configuratons show the gpu thermal zone is configured to
> be mitigated at 85°C with devfreq.
>
> Add the DT snippet to enable the thermal mitigation on the sdm845
> based board.

Smoke tested on Dragonboard 845c running AOSP with a custom/generic thermal HAL.

Tested-by: Amit Pundir <[email protected]>

PS: Probably good to mention the OS (AOSP) and board (DB845c), on
which the Antutu crash was reported, in the commit message as well .

>
> Fixes: c79800103eb18 ("arm64: dts: sdm845: Add gpu and gmu device nodes")
> Cc: Amit Pundir <[email protected]>
> Signed-off-by: Daniel Lezcano <[email protected]>
> ---
> arch/arm64/boot/dts/qcom/sdm845.dtsi | 32 ++++++++++++++++++++++++++--
> 1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> index c2244824355a..20fefd6af0f8 100644
> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> @@ -4764,6 +4764,8 @@ gpu: gpu@5000000 {
> interconnects = <&mem_noc MASTER_GFX3D 0 &mem_noc SLAVE_EBI1 0>;
> interconnect-names = "gfx-mem";
>
> + #cooling-cells = <2>;
> +
> status = "disabled";
>
> gpu_opp_table: opp-table {
> @@ -5603,12 +5605,25 @@ gpu-top-thermal {
> thermal-sensors = <&tsens0 11>;
>
> trips {
> - gpu1_alert0: trip-point0 {
> + gpu1_alert0: trip-point0 {
> + temperature = <85000>;
> + hysteresis = <2000>;
> + type = "passive";
> + };
> +
> + gpu1_alert1: trip-point1 {
> temperature = <90000>;
> hysteresis = <2000>;
> type = "hot";
> };
> };
> +
> + cooling-maps {
> + map0 {
> + trip = <&gpu1_alert0>;
> + cooling-device = <&gpu THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> + };
> + };
> };
>
> gpu-bottom-thermal {
> @@ -5618,12 +5633,25 @@ gpu-bottom-thermal {
> thermal-sensors = <&tsens0 12>;
>
> trips {
> - gpu2_alert0: trip-point0 {
> + gpu2_alert0: trip-point0 {
> + temperature = <85000>;
> + hysteresis = <2000>;
> + type = "passive";
> + };
> +
> + gpu2_alert1: trip-point1 {
> temperature = <90000>;
> hysteresis = <2000>;
> type = "hot";
> };
> };
> +
> + cooling-maps {
> + map0 {
> + trip = <&gpu2_alert0>;
> + cooling-device = <&gpu THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> + };
> + };
> };
>
> aoss1-thermal {
> --
> 2.34.1
>

2024-01-16 12:37:42

by Luca Weiss

[permalink] [raw]
Subject: Re: [PATCH] arm64: dts: qcom: sdm845: Fix wild reboot during Antutu test

On Tue Jan 16, 2024 at 12:59 PM CET, Daniel Lezcano wrote:
> Running an Antutu benchmark makes the board to do a hard reboot.
>
> Cause: it appears the gpu-bottom and gpu-top temperature sensors are showing
> too high temperatures, above 115°C.
>
> Out of tree configuratons show the gpu thermal zone is configured to
> be mitigated at 85°C with devfreq.
>
> Add the DT snippet to enable the thermal mitigation on the sdm845
> based board.
>
> Fixes: c79800103eb18 ("arm64: dts: sdm845: Add gpu and gmu device nodes")
> Cc: Amit Pundir <[email protected]>
> Signed-off-by: Daniel Lezcano <[email protected]>

A part of this is already included with this patch:
https://lore.kernel.org/linux-arm-msm/[email protected]/

Maybe rebase on top of that one and add the 85degC trip point or
something?

> ---
> arch/arm64/boot/dts/qcom/sdm845.dtsi | 32 ++++++++++++++++++++++++++--
> 1 file changed, 30 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/boot/dts/qcom/sdm845.dtsi b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> index c2244824355a..20fefd6af0f8 100644
> --- a/arch/arm64/boot/dts/qcom/sdm845.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sdm845.dtsi
> @@ -4764,6 +4764,8 @@ gpu: gpu@5000000 {
> interconnects = <&mem_noc MASTER_GFX3D 0 &mem_noc SLAVE_EBI1 0>;
> interconnect-names = "gfx-mem";
>
> + #cooling-cells = <2>;
> +
> status = "disabled";
>
> gpu_opp_table: opp-table {
> @@ -5603,12 +5605,25 @@ gpu-top-thermal {
> thermal-sensors = <&tsens0 11>;
>
> trips {
> - gpu1_alert0: trip-point0 {
> + gpu1_alert0: trip-point0 {
> + temperature = <85000>;
> + hysteresis = <2000>;
> + type = "passive";
> + };
> +

The indentation here should use tabs not spaces.

> + gpu1_alert1: trip-point1 {
> temperature = <90000>;
> hysteresis = <2000>;
> type = "hot";
> };
> };
> +
> + cooling-maps {
> + map0 {
> + trip = <&gpu1_alert0>;
> + cooling-device = <&gpu THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> + };
> + };
> };
>
> gpu-bottom-thermal {
> @@ -5618,12 +5633,25 @@ gpu-bottom-thermal {
> thermal-sensors = <&tsens0 12>;
>
> trips {
> - gpu2_alert0: trip-point0 {
> + gpu2_alert0: trip-point0 {
> + temperature = <85000>;
> + hysteresis = <2000>;
> + type = "passive";
> + };
> +
> + gpu2_alert1: trip-point1 {
> temperature = <90000>;
> hysteresis = <2000>;
> type = "hot";
> };
> };
> +
> + cooling-maps {
> + map0 {
> + trip = <&gpu2_alert0>;
> + cooling-device = <&gpu THERMAL_NO_LIMIT THERMAL_NO_LIMIT>;
> + };
> + };
> };
>
> aoss1-thermal {


2024-01-16 12:51:51

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH] arm64: dts: qcom: sdm845: Fix wild reboot during Antutu test

On 16/01/2024 13:37, Luca Weiss wrote:
> On Tue Jan 16, 2024 at 12:59 PM CET, Daniel Lezcano wrote:
>> Running an Antutu benchmark makes the board to do a hard reboot.
>>
>> Cause: it appears the gpu-bottom and gpu-top temperature sensors are showing
>> too high temperatures, above 115°C.
>>
>> Out of tree configuratons show the gpu thermal zone is configured to
>> be mitigated at 85°C with devfreq.
>>
>> Add the DT snippet to enable the thermal mitigation on the sdm845
>> based board.
>>
>> Fixes: c79800103eb18 ("arm64: dts: sdm845: Add gpu and gmu device nodes")
>> Cc: Amit Pundir <[email protected]>
>> Signed-off-by: Daniel Lezcano <[email protected]>
>
> A part of this is already included with this patch:
> https://lore.kernel.org/linux-arm-msm/[email protected]/
>
> Maybe rebase on top of that one and add the 85degC trip point or
> something?

Actually, I think the patch is wrong.

The cooling effect does not operate on 'hot' trip point type as it is
considered as a critical trip point. The governor is not invoked, so no
mitigation happen. The 'hot' trip point type results in sending a
notification to userspace to give the last chance to do something before
'critical' is reached where the system is shut down.

I suggest to revert it and pick the one I proposed.

--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2024-01-16 14:03:47

by Luca Weiss

[permalink] [raw]
Subject: Re: [PATCH] arm64: dts: qcom: sdm845: Fix wild reboot during Antutu test

On Tue Jan 16, 2024 at 1:51 PM CET, Daniel Lezcano wrote:
> On 16/01/2024 13:37, Luca Weiss wrote:
> > On Tue Jan 16, 2024 at 12:59 PM CET, Daniel Lezcano wrote:
> >> Running an Antutu benchmark makes the board to do a hard reboot.
> >>
> >> Cause: it appears the gpu-bottom and gpu-top temperature sensors are showing
> >> too high temperatures, above 115°C.
> >>
> >> Out of tree configuratons show the gpu thermal zone is configured to
> >> be mitigated at 85°C with devfreq.
> >>
> >> Add the DT snippet to enable the thermal mitigation on the sdm845
> >> based board.
> >>
> >> Fixes: c79800103eb18 ("arm64: dts: sdm845: Add gpu and gmu device nodes")
> >> Cc: Amit Pundir <[email protected]>
> >> Signed-off-by: Daniel Lezcano <[email protected]>
> >
> > A part of this is already included with this patch:
> > https://lore.kernel.org/linux-arm-msm/[email protected]/
> >
> > Maybe rebase on top of that one and add the 85degC trip point or
> > something?
>
> Actually, I think the patch is wrong.

I recommend telling Konrad in that patch then, not me :)

>
> The cooling effect does not operate on 'hot' trip point type as it is
> considered as a critical trip point. The governor is not invoked, so no
> mitigation happen. The 'hot' trip point type results in sending a
> notification to userspace to give the last chance to do something before
> 'critical' is reached where the system is shut down.
>
> I suggest to revert it and pick the one I proposed.

It hasn't been applied yet so it can be fixed in v2 there.

Regards
Luca

2024-01-16 15:38:49

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH] arm64: dts: qcom: sdm845: Fix wild reboot during Antutu test

On 16/01/2024 15:03, Luca Weiss wrote:
> On Tue Jan 16, 2024 at 1:51 PM CET, Daniel Lezcano wrote:
>> On 16/01/2024 13:37, Luca Weiss wrote:
>>> On Tue Jan 16, 2024 at 12:59 PM CET, Daniel Lezcano wrote:
>>>> Running an Antutu benchmark makes the board to do a hard reboot.
>>>>
>>>> Cause: it appears the gpu-bottom and gpu-top temperature sensors are showing
>>>> too high temperatures, above 115°C.
>>>>
>>>> Out of tree configuratons show the gpu thermal zone is configured to
>>>> be mitigated at 85°C with devfreq.
>>>>
>>>> Add the DT snippet to enable the thermal mitigation on the sdm845
>>>> based board.
>>>>
>>>> Fixes: c79800103eb18 ("arm64: dts: sdm845: Add gpu and gmu device nodes")
>>>> Cc: Amit Pundir <[email protected]>
>>>> Signed-off-by: Daniel Lezcano <[email protected]>
>>>
>>> A part of this is already included with this patch:
>>> https://lore.kernel.org/linux-arm-msm/[email protected]/
>>>
>>> Maybe rebase on top of that one and add the 85degC trip point or
>>> something?
>>
>> Actually, I think the patch is wrong.
>
> I recommend telling Konrad in that patch then, not me :)

That's good Konrad is in the recipient list :)

>> The cooling effect does not operate on 'hot' trip point type as it is
>> considered as a critical trip point. The governor is not invoked, so no
>> mitigation happen. The 'hot' trip point type results in sending a
>> notification to userspace to give the last chance to do something before
>> 'critical' is reached where the system is shut down.
>>
>> I suggest to revert it and pick the one I proposed.
>
> It hasn't been applied yet so it can be fixed in v2 there.

The patch was submitted without testing AFAICT. So it is preferable to
pick the one I sent which was tested by Amit and me.



--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2024-01-16 18:18:56

by Steev Klimaszewski

[permalink] [raw]
Subject: Re: [PATCH] arm64: dts: qcom: sdm845: Fix wild reboot during Antutu test

Hi

Shouldn't the patch subject line be changed? Reading the git log,
"fix wild reboot during antutu test" doesn't tell me that much; I
would think something like "Enable thermal mitigation for sdm845 gpu"
might be better for someone reading through the logs later

--steev

2024-01-28 18:00:38

by Bjorn Andersson

[permalink] [raw]
Subject: Re: Re: [PATCH] arm64: dts: qcom: sdm845: Fix wild reboot during Antutu test

On Tue, Jan 16, 2024 at 04:38:33PM +0100, Daniel Lezcano wrote:
> On 16/01/2024 15:03, Luca Weiss wrote:
> > On Tue Jan 16, 2024 at 1:51 PM CET, Daniel Lezcano wrote:
> > > On 16/01/2024 13:37, Luca Weiss wrote:
> > > > On Tue Jan 16, 2024 at 12:59 PM CET, Daniel Lezcano wrote:
> > > > > Running an Antutu benchmark makes the board to do a hard reboot.
> > > > >
> > > > > Cause: it appears the gpu-bottom and gpu-top temperature sensors are showing
> > > > > too high temperatures, above 115?C.
> > > > >
> > > > > Out of tree configuratons show the gpu thermal zone is configured to
> > > > > be mitigated at 85?C with devfreq.
> > > > >
> > > > > Add the DT snippet to enable the thermal mitigation on the sdm845
> > > > > based board.
> > > > >
> > > > > Fixes: c79800103eb18 ("arm64: dts: sdm845: Add gpu and gmu device nodes")
> > > > > Cc: Amit Pundir <[email protected]>
> > > > > Signed-off-by: Daniel Lezcano <[email protected]>
> > > >
> > > > A part of this is already included with this patch:
> > > > https://lore.kernel.org/linux-arm-msm/[email protected]/
> > > >
> > > > Maybe rebase on top of that one and add the 85degC trip point or
> > > > something?
> > >
> > > Actually, I think the patch is wrong.
> >
> > I recommend telling Konrad in that patch then, not me :)
>
> That's good Konrad is in the recipient list :)
>
> > > The cooling effect does not operate on 'hot' trip point type as it is
> > > considered as a critical trip point. The governor is not invoked, so no
> > > mitigation happen. The 'hot' trip point type results in sending a
> > > notification to userspace to give the last chance to do something before
> > > 'critical' is reached where the system is shut down.
> > >
> > > I suggest to revert it and pick the one I proposed.
> >
> > It hasn't been applied yet so it can be fixed in v2 there.
>
> The patch was submitted without testing AFAICT. So it is preferable to pick
> the one I sent which was tested by Amit and me.
>

I would have loved to have that feedback in the thread that is wrong!

Due to my lack of understanding of this detail, and only positive
reviews I merged said series. Please fix your patch and rebase it on top
of linux-next.

Thanks,
Bjorn