2020-10-29 10:07:06

by Peter Ujfalusi

[permalink] [raw]
Subject: [PATCH] thermal: ti-soc-thermal: Disable the CPU PM notifier for OMAP4430

It has been observed that on OMAP4430 (ES2.0, ES2.1 and ES2.3) the enabled
notifier causes errors on the DTEMP readout values:

ti-soc-thermal 4a002260.bandgap: in range ADC val: 52
ti-soc-thermal 4a002260.bandgap: in range ADC val: 64
ti-soc-thermal 4a002260.bandgap: in range ADC val: 64
ti-soc-thermal 4a002260.bandgap: out of range ADC val: 0
thermal thermal_zone0: failed to read out thermal zone (-5)
ti-soc-thermal 4a002260.bandgap: out of range ADC val: 0
thermal thermal_zone0: failed to read out thermal zone (-5)
ti-soc-thermal 4a002260.bandgap: out of range ADC val: 4
thermal thermal_zone0: failed to read out thermal zone (-5)
ti-soc-thermal 4a002260.bandgap: in range ADC val: 100

raw 100 translates to 133 Celsius on omap4-sdp, triggering shutdown due to
critical temperature.

When the notifier is disable for OMAP4430 the DTEMP values are stable:
ti-soc-thermal 4a002260.bandgap: in range ADC val: 56
ti-soc-thermal 4a002260.bandgap: in range ADC val: 56
ti-soc-thermal 4a002260.bandgap: in range ADC val: 57
ti-soc-thermal 4a002260.bandgap: in range ADC val: 57
ti-soc-thermal 4a002260.bandgap: in range ADC val: 56

Fixes: 5093402e5b44 ("thermal: ti-soc-thermal: Enable addition power management")
Signed-off-by: Peter Ujfalusi <[email protected]>
---
Hi,

my omap4-sdp (Blaze) was shutting down randomly due to critical temperature with
5.10-rc1 and I have bisected it back to 5093402e5b44.

Disabling the notifier fixes the random shutdowns on OMAP4430 (ES2.0 and ES2.1)
but it does not cause any issues on OMAP4460 (PandaES) or OMAP3630 (BeagleXM).
Tony's duovero with OMAP4430 ES2.3 did not ninja-shutdown, but he also have
constant and steady stream of:
thermal thermal_zone0: failed to read out thermal zone (-5)

pointing to similar issue.

Regards,
Peter

drivers/thermal/ti-soc-thermal/ti-bandgap.c | 18 ++++++++++++++++--
1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.c b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
index 5e596168ba73..dcac99f327b0 100644
--- a/drivers/thermal/ti-soc-thermal/ti-bandgap.c
+++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
@@ -20,6 +20,7 @@
#include <linux/err.h>
#include <linux/types.h>
#include <linux/spinlock.h>
+#include <linux/sys_soc.h>
#include <linux/reboot.h>
#include <linux/of_device.h>
#include <linux/of_platform.h>
@@ -864,6 +865,17 @@ static struct ti_bandgap *ti_bandgap_build(struct platform_device *pdev)
return bgp;
}

+/*
+ * List of SoCs on which the CPU PM notifier can cause erros on the DTEMP
+ * readout.
+ * Enabled notifier on these machines results in erroneous, random values which
+ * could trigger unexpected thermal shutdown.
+ */
+static const struct soc_device_attribute soc_no_cpu_notifier[] = {
+ { .machine = "OMAP4430" },
+ { /* sentinel */ },
+};
+
/*** Device driver call backs ***/

static
@@ -1020,7 +1032,8 @@ int ti_bandgap_probe(struct platform_device *pdev)

#ifdef CONFIG_PM_SLEEP
bgp->nb.notifier_call = bandgap_omap_cpu_notifier;
- cpu_pm_register_notifier(&bgp->nb);
+ if (!soc_device_match(soc_no_cpu_notifier))
+ cpu_pm_register_notifier(&bgp->nb);
#endif

return 0;
@@ -1056,7 +1069,8 @@ int ti_bandgap_remove(struct platform_device *pdev)
struct ti_bandgap *bgp = platform_get_drvdata(pdev);
int i;

- cpu_pm_unregister_notifier(&bgp->nb);
+ if (!soc_device_match(soc_no_cpu_notifier))
+ cpu_pm_unregister_notifier(&bgp->nb);

/* Remove sensor interfaces */
for (i = 0; i < bgp->conf->sensor_count; i++) {
--
Peter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki


2020-10-29 10:53:26

by Tony Lindgren

[permalink] [raw]
Subject: Re: [PATCH] thermal: ti-soc-thermal: Disable the CPU PM notifier for OMAP4430

* Peter Ujfalusi <[email protected]> [201029 10:03]:
> Disabling the notifier fixes the random shutdowns on OMAP4430 (ES2.0 and ES2.1)
> but it does not cause any issues on OMAP4460 (PandaES) or OMAP3630 (BeagleXM).
> Tony's duovero with OMAP4430 ES2.3 did not ninja-shutdown, but he also have
> constant and steady stream of:
> thermal thermal_zone0: failed to read out thermal zone (-5)

Works for me and I've verified duovero still keeps hitting core ret idle:

Tested-by: Tony Lindgren <[email protected]>

Regards,

Tony

2020-11-03 06:43:27

by Peter Ujfalusi

[permalink] [raw]
Subject: Re: [PATCH] thermal: ti-soc-thermal: Disable the CPU PM notifier for OMAP4430

Eduardo, Keerthy,

On 29/10/2020 12.51, Tony Lindgren wrote:
> * Peter Ujfalusi <[email protected]> [201029 10:03]:
>> Disabling the notifier fixes the random shutdowns on OMAP4430 (ES2.0 and ES2.1)
>> but it does not cause any issues on OMAP4460 (PandaES) or OMAP3630 (BeagleXM).
>> Tony's duovero with OMAP4430 ES2.3 did not ninja-shutdown, but he also have
>> constant and steady stream of:
>> thermal thermal_zone0: failed to read out thermal zone (-5)
>
> Works for me and I've verified duovero still keeps hitting core ret idle:

Can you pick this one up for 5.10 to make omap4430-sdp to be usable (to
not shut down randomly).
The regression was introduced in 5.10-rc1.

> Tested-by: Tony Lindgren <[email protected]>
>
> Regards,
>
> Tony
>

- Péter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki

2020-11-03 06:55:06

by Keerthy

[permalink] [raw]
Subject: Re: [PATCH] thermal: ti-soc-thermal: Disable the CPU PM notifier for OMAP4430



On 11/3/2020 12:12 PM, Peter Ujfalusi wrote:
> Eduardo, Keerthy,
>
> On 29/10/2020 12.51, Tony Lindgren wrote:
>> * Peter Ujfalusi <[email protected]> [201029 10:03]:
>>> Disabling the notifier fixes the random shutdowns on OMAP4430 (ES2.0 and ES2.1)
>>> but it does not cause any issues on OMAP4460 (PandaES) or OMAP3630 (BeagleXM).
>>> Tony's duovero with OMAP4430 ES2.3 did not ninja-shutdown, but he also have
>>> constant and steady stream of:
>>> thermal thermal_zone0: failed to read out thermal zone (-5)
>>
>> Works for me and I've verified duovero still keeps hitting core ret idle:
>
> Can you pick this one up for 5.10 to make omap4430-sdp to be usable (to
> not shut down randomly).
> The regression was introduced in 5.10-rc1.

Peter,

Thanks for the fix.

Acked-by: Keerthy <[email protected]>

Best Regards,
Keerthy

>
>> Tested-by: Tony Lindgren <[email protected]>
>>
>> Regards,
>>
>> Tony
>>
>
> - Péter
>
> Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
> Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki
>

2020-11-12 11:34:14

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH] thermal: ti-soc-thermal: Disable the CPU PM notifier for OMAP4430

On 03/11/2020 07:42, Peter Ujfalusi wrote:
> Eduardo, Keerthy,
>
> On 29/10/2020 12.51, Tony Lindgren wrote:
>> * Peter Ujfalusi <[email protected]> [201029 10:03]:
>>> Disabling the notifier fixes the random shutdowns on OMAP4430 (ES2.0 and ES2.1)
>>> but it does not cause any issues on OMAP4460 (PandaES) or OMAP3630 (BeagleXM).
>>> Tony's duovero with OMAP4430 ES2.3 did not ninja-shutdown, but he also have
>>> constant and steady stream of:
>>> thermal thermal_zone0: failed to read out thermal zone (-5)
>>
>> Works for me and I've verified duovero still keeps hitting core ret idle:
>
> Can you pick this one up for 5.10 to make omap4430-sdp to be usable (to
> not shut down randomly).
> The regression was introduced in 5.10-rc1.
>
>> Tested-by: Tony Lindgren <[email protected]>

Applied as a fix for v5.10-rc



--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog