2020-12-30 08:46:25

by Tony Lindgren

[permalink] [raw]
Subject: [PATCH 1/3] thermal: ti-soc-thermal: Fix stuck sensor with continuous mode for 4430

At least for 4430, trying to use the single conversion mode eventually
hangs the thermal sensor. This can be quite easily seen with errors:

thermal thermal_zone0: failed to read out thermal zone (-5)

Also, trying to read the temperature shows a stuck value with:

$ while true; do cat /sys/class/thermal/thermal_zone0/temp; done

Where the temperature is not rising at all with the busy loop.

Additionally, the EOCZ (end of conversion) bit is not rising on 4430 in
single conversion mode while it works fine in continuous conversion mode.
It is also possible that the hung temperature sensor can affect the
thermal shutdown alert too.

Let's fix the issue by adding TI_BANDGAP_FEATURE_CONT_MODE_ONLY flag and
use it for 4430.

Note that we also need to add udelay to for the EOCZ (end of conversion)
bit polling as otherwise we have it time out too early on 4430. We'll be
changing the loop to use iopoll in the following clean-up patch.

Cc: Adam Ford <[email protected]>
Cc: Carl Philipp Klemm <[email protected]>
Cc: Eduardo Valentin <[email protected]>
Cc: Merlijn Wajer <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Peter Ujfalusi <[email protected]>
Cc: Sebastian Reichel <[email protected]>
Signed-off-by: Tony Lindgren <[email protected]>
---
drivers/thermal/ti-soc-thermal/omap4-thermal-data.c | 3 ++-
drivers/thermal/ti-soc-thermal/ti-bandgap.c | 9 +++++++--
drivers/thermal/ti-soc-thermal/ti-bandgap.h | 2 ++
3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
--- a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
+++ b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
@@ -58,7 +58,8 @@ omap4430_adc_to_temp[OMAP4430_ADC_END_VALUE - OMAP4430_ADC_START_VALUE + 1] = {
const struct ti_bandgap_data omap4430_data = {
.features = TI_BANDGAP_FEATURE_MODE_CONFIG |
TI_BANDGAP_FEATURE_CLK_CTRL |
- TI_BANDGAP_FEATURE_POWER_SWITCH,
+ TI_BANDGAP_FEATURE_POWER_SWITCH |
+ TI_BANDGAP_FEATURE_CONT_MODE_ONLY,
.fclock_name = "bandgap_fclk",
.div_ck_name = "bandgap_fclk",
.conv_table = omap4430_adc_to_temp,
diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.c b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
--- a/drivers/thermal/ti-soc-thermal/ti-bandgap.c
+++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
@@ -15,6 +15,7 @@
#include <linux/kernel.h>
#include <linux/interrupt.h>
#include <linux/clk.h>
+#include <linux/delay.h>
#include <linux/gpio/consumer.h>
#include <linux/platform_device.h>
#include <linux/err.h>
@@ -605,8 +606,10 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
u32 counter = 1000;
struct temp_sensor_registers *tsr;

- /* Select single conversion mode */
- if (TI_BANDGAP_HAS(bgp, MODE_CONFIG))
+ /* Select continuous or single conversion mode */
+ if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY))
+ RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1);
+ else if (TI_BANDGAP_HAS(bgp, MODE_CONFIG))
RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0);

/* Start of Conversion = 1 */
@@ -619,6 +622,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
if (ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) &
tsr->bgap_eocz_mask)
break;
+ udelay(1);
}

/* Start of Conversion = 0 */
@@ -630,6 +634,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
if (!(ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) &
tsr->bgap_eocz_mask))
break;
+ udelay(1);
}

return 0;
diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.h b/drivers/thermal/ti-soc-thermal/ti-bandgap.h
--- a/drivers/thermal/ti-soc-thermal/ti-bandgap.h
+++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.h
@@ -280,6 +280,7 @@ struct ti_temp_sensor {
* has Errata 814
* TI_BANDGAP_FEATURE_UNRELIABLE - used when the sensor readings are too
* inaccurate.
+ * TI_BANDGAP_FEATURE_CONT_MODE_ONLY - used when single mode hangs the sensor
* TI_BANDGAP_HAS(b, f) - macro to check if a bandgap device is capable of a
* specific feature (above) or not. Return non-zero, if yes.
*/
@@ -295,6 +296,7 @@ struct ti_temp_sensor {
#define TI_BANDGAP_FEATURE_HISTORY_BUFFER BIT(9)
#define TI_BANDGAP_FEATURE_ERRATA_814 BIT(10)
#define TI_BANDGAP_FEATURE_UNRELIABLE BIT(11)
+#define TI_BANDGAP_FEATURE_CONT_MODE_ONLY BIT(12)
#define TI_BANDGAP_HAS(b, f) \
((b)->conf->features & TI_BANDGAP_FEATURE_ ## f)

--
2.29.2


2020-12-30 08:46:38

by Tony Lindgren

[permalink] [raw]
Subject: [PATCH 3/3] thermal: ti-soc-thermal: Use non-inverted define for omap4

When we set bit 10 high we use continuous mode and not single
mode. Let's correct this to avoid confusion. No functional
changes here, the code does the right thing with bit 10.

Cc: Adam Ford <[email protected]>
Cc: Carl Philipp Klemm <[email protected]>
Cc: Eduardo Valentin <[email protected]>
Cc: Merlijn Wajer <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Peter Ujfalusi <[email protected]>
Cc: Sebastian Reichel <[email protected]>
Signed-off-by: Tony Lindgren <[email protected]>
---
drivers/thermal/ti-soc-thermal/omap4-thermal-data.c | 4 ++--
drivers/thermal/ti-soc-thermal/omap4xxx-bandgap.h | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
--- a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
+++ b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
@@ -24,7 +24,7 @@ omap4430_mpu_temp_sensor_registers = {
.bgap_dtemp_mask = OMAP4430_BGAP_TEMP_SENSOR_DTEMP_MASK,

.bgap_mode_ctrl = OMAP4430_TEMP_SENSOR_CTRL_OFFSET,
- .mode_ctrl_mask = OMAP4430_SINGLE_MODE_MASK,
+ .mode_ctrl_mask = OMAP4430_CONTINUOUS_MODE_MASK,

.bgap_efuse = OMAP4430_FUSE_OPP_BGAP,
};
@@ -97,7 +97,7 @@ omap4460_mpu_temp_sensor_registers = {
.mask_cold_mask = OMAP4460_MASK_COLD_MASK,

.bgap_mode_ctrl = OMAP4460_BGAP_CTRL_OFFSET,
- .mode_ctrl_mask = OMAP4460_SINGLE_MODE_MASK,
+ .mode_ctrl_mask = OMAP4460_CONTINUOUS_MODE_MASK,

.bgap_counter = OMAP4460_BGAP_COUNTER_OFFSET,
.counter_mask = OMAP4460_COUNTER_MASK,
diff --git a/drivers/thermal/ti-soc-thermal/omap4xxx-bandgap.h b/drivers/thermal/ti-soc-thermal/omap4xxx-bandgap.h
--- a/drivers/thermal/ti-soc-thermal/omap4xxx-bandgap.h
+++ b/drivers/thermal/ti-soc-thermal/omap4xxx-bandgap.h
@@ -40,7 +40,7 @@
/* OMAP4430.TEMP_SENSOR bits */
#define OMAP4430_BGAP_TEMPSOFF_MASK BIT(12)
#define OMAP4430_BGAP_TSHUT_MASK BIT(11)
-#define OMAP4430_SINGLE_MODE_MASK BIT(10)
+#define OMAP4430_CONTINUOUS_MODE_MASK BIT(10)
#define OMAP4430_BGAP_TEMP_SENSOR_SOC_MASK BIT(9)
#define OMAP4430_BGAP_TEMP_SENSOR_EOCZ_MASK BIT(8)
#define OMAP4430_BGAP_TEMP_SENSOR_DTEMP_MASK (0xff << 0)
@@ -113,7 +113,7 @@
#define OMAP4460_BGAP_TEMP_SENSOR_DTEMP_MASK (0x3ff << 0)

/* OMAP4460.BANDGAP_CTRL bits */
-#define OMAP4460_SINGLE_MODE_MASK BIT(31)
+#define OMAP4460_CONTINUOUS_MODE_MASK BIT(31)
#define OMAP4460_MASK_HOT_MASK BIT(1)
#define OMAP4460_MASK_COLD_MASK BIT(0)

--
2.29.2

2020-12-30 08:48:11

by Tony Lindgren

[permalink] [raw]
Subject: [PATCH 2/3] thermal: ti-soc-thermal: Simplify polling with iopoll

We can use iopoll for checking the EOCZ (end of conversion) bit.

Cc: Adam Ford <[email protected]>
Cc: Carl Philipp Klemm <[email protected]>
Cc: Eduardo Valentin <[email protected]>
Cc: Merlijn Wajer <[email protected]>
Cc: Pavel Machek <[email protected]>
Cc: Peter Ujfalusi <[email protected]>
Cc: Sebastian Reichel <[email protected]>
Signed-off-by: Tony Lindgren <[email protected]>
---
drivers/thermal/ti-soc-thermal/ti-bandgap.c | 30 ++++++++++-----------
1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.c b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
--- a/drivers/thermal/ti-soc-thermal/ti-bandgap.c
+++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
@@ -15,7 +15,6 @@
#include <linux/kernel.h>
#include <linux/interrupt.h>
#include <linux/clk.h>
-#include <linux/delay.h>
#include <linux/gpio/consumer.h>
#include <linux/platform_device.h>
#include <linux/err.h>
@@ -27,6 +26,7 @@
#include <linux/of_platform.h>
#include <linux/of_irq.h>
#include <linux/io.h>
+#include <linux/iopoll.h>
#include <linux/cpu_pm.h>
#include <linux/device.h>
#include <linux/pm_runtime.h>
@@ -603,8 +603,9 @@ void *ti_bandgap_get_sensor_data(struct ti_bandgap *bgp, int id)
static int
ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
{
- u32 counter = 1000;
struct temp_sensor_registers *tsr;
+ int error;
+ u32 val;

/* Select continuous or single conversion mode */
if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY))
@@ -615,27 +616,24 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
/* Start of Conversion = 1 */
RMW_BITS(bgp, id, temp_sensor_ctrl, bgap_soc_mask, 1);

- /* Wait for EOCZ going up */
tsr = bgp->conf->sensors[id].registers;

- while (--counter) {
- if (ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) &
- tsr->bgap_eocz_mask)
- break;
- udelay(1);
- }
+ /* Wait for EOCZ going up */
+ error = readl_poll_timeout_atomic(bgp->base + tsr->temp_sensor_ctrl,
+ val, val & tsr->bgap_eocz_mask,
+ 1, 1000);
+ if (error)
+ dev_warn(bgp->dev, "eocz timed out waiting high\n");

/* Start of Conversion = 0 */
RMW_BITS(bgp, id, temp_sensor_ctrl, bgap_soc_mask, 0);

/* Wait for EOCZ going down */
- counter = 1000;
- while (--counter) {
- if (!(ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) &
- tsr->bgap_eocz_mask))
- break;
- udelay(1);
- }
+ error = readl_poll_timeout_atomic(bgp->base + tsr->temp_sensor_ctrl,
+ val, !(val & tsr->bgap_eocz_mask),
+ 1, 1000);
+ if (error)
+ dev_warn(bgp->dev, "eocz timed out waiting low\n");

return 0;
}
--
2.29.2

2020-12-30 12:57:54

by Adam Ford

[permalink] [raw]
Subject: Re: [PATCH 1/3] thermal: ti-soc-thermal: Fix stuck sensor with continuous mode for 4430

On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <[email protected]> wrote:
>
> At least for 4430, trying to use the single conversion mode eventually
> hangs the thermal sensor. This can be quite easily seen with errors:
>
> thermal thermal_zone0: failed to read out thermal zone (-5)
>
> Also, trying to read the temperature shows a stuck value with:
>
> $ while true; do cat /sys/class/thermal/thermal_zone0/temp; done
>
> Where the temperature is not rising at all with the busy loop.
>
> Additionally, the EOCZ (end of conversion) bit is not rising on 4430 in
> single conversion mode while it works fine in continuous conversion mode.
> It is also possible that the hung temperature sensor can affect the
> thermal shutdown alert too.
>
> Let's fix the issue by adding TI_BANDGAP_FEATURE_CONT_MODE_ONLY flag and
> use it for 4430.
>
> Note that we also need to add udelay to for the EOCZ (end of conversion)
> bit polling as otherwise we have it time out too early on 4430. We'll be
> changing the loop to use iopoll in the following clean-up patch.
>
> Cc: Adam Ford <[email protected]>

I don't have an OMAP4, but if you want, I can test a DM3730.

adam

> Cc: Carl Philipp Klemm <[email protected]>
> Cc: Eduardo Valentin <[email protected]>
> Cc: Merlijn Wajer <[email protected]>
> Cc: Pavel Machek <[email protected]>
> Cc: Peter Ujfalusi <[email protected]>
> Cc: Sebastian Reichel <[email protected]>
> Signed-off-by: Tony Lindgren <[email protected]>
> ---
> drivers/thermal/ti-soc-thermal/omap4-thermal-data.c | 3 ++-
> drivers/thermal/ti-soc-thermal/ti-bandgap.c | 9 +++++++--
> drivers/thermal/ti-soc-thermal/ti-bandgap.h | 2 ++
> 3 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
> --- a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
> +++ b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
> @@ -58,7 +58,8 @@ omap4430_adc_to_temp[OMAP4430_ADC_END_VALUE - OMAP4430_ADC_START_VALUE + 1] = {
> const struct ti_bandgap_data omap4430_data = {
> .features = TI_BANDGAP_FEATURE_MODE_CONFIG |
> TI_BANDGAP_FEATURE_CLK_CTRL |
> - TI_BANDGAP_FEATURE_POWER_SWITCH,
> + TI_BANDGAP_FEATURE_POWER_SWITCH |
> + TI_BANDGAP_FEATURE_CONT_MODE_ONLY,
> .fclock_name = "bandgap_fclk",
> .div_ck_name = "bandgap_fclk",
> .conv_table = omap4430_adc_to_temp,
> diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.c b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
> --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.c
> +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
> @@ -15,6 +15,7 @@
> #include <linux/kernel.h>
> #include <linux/interrupt.h>
> #include <linux/clk.h>
> +#include <linux/delay.h>
> #include <linux/gpio/consumer.h>
> #include <linux/platform_device.h>
> #include <linux/err.h>
> @@ -605,8 +606,10 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
> u32 counter = 1000;
> struct temp_sensor_registers *tsr;
>
> - /* Select single conversion mode */
> - if (TI_BANDGAP_HAS(bgp, MODE_CONFIG))
> + /* Select continuous or single conversion mode */
> + if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY))
> + RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1);
> + else if (TI_BANDGAP_HAS(bgp, MODE_CONFIG))
> RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0);
>
> /* Start of Conversion = 1 */
> @@ -619,6 +622,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
> if (ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) &
> tsr->bgap_eocz_mask)
> break;
> + udelay(1);
> }
>
> /* Start of Conversion = 0 */
> @@ -630,6 +634,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
> if (!(ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) &
> tsr->bgap_eocz_mask))
> break;
> + udelay(1);
> }
>
> return 0;
> diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.h b/drivers/thermal/ti-soc-thermal/ti-bandgap.h
> --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.h
> +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.h
> @@ -280,6 +280,7 @@ struct ti_temp_sensor {
> * has Errata 814
> * TI_BANDGAP_FEATURE_UNRELIABLE - used when the sensor readings are too
> * inaccurate.
> + * TI_BANDGAP_FEATURE_CONT_MODE_ONLY - used when single mode hangs the sensor
> * TI_BANDGAP_HAS(b, f) - macro to check if a bandgap device is capable of a
> * specific feature (above) or not. Return non-zero, if yes.
> */
> @@ -295,6 +296,7 @@ struct ti_temp_sensor {
> #define TI_BANDGAP_FEATURE_HISTORY_BUFFER BIT(9)
> #define TI_BANDGAP_FEATURE_ERRATA_814 BIT(10)
> #define TI_BANDGAP_FEATURE_UNRELIABLE BIT(11)
> +#define TI_BANDGAP_FEATURE_CONT_MODE_ONLY BIT(12)
> #define TI_BANDGAP_HAS(b, f) \
> ((b)->conf->features & TI_BANDGAP_FEATURE_ ## f)
>
> --
> 2.29.2

2020-12-30 13:30:25

by H. Nikolaus Schaller

[permalink] [raw]
Subject: Re: [PATCH 1/3] thermal: ti-soc-thermal: Fix stuck sensor with continuous mode for 4430

Hi Adam and Tony,

> Am 30.12.2020 um 13:55 schrieb Adam Ford <[email protected]>:
>
> On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <[email protected]> wrote:
>>
>> At least for 4430, trying to use the single conversion mode eventually
>> hangs the thermal sensor. This can be quite easily seen with errors:
>>
>> thermal thermal_zone0: failed to read out thermal zone (-5)
>>
>> Also, trying to read the temperature shows a stuck value with:
>>
>> $ while true; do cat /sys/class/thermal/thermal_zone0/temp; done
>>
>> Where the temperature is not rising at all with the busy loop.
>>
>> Additionally, the EOCZ (end of conversion) bit is not rising on 4430 in
>> single conversion mode while it works fine in continuous conversion mode.
>> It is also possible that the hung temperature sensor can affect the
>> thermal shutdown alert too.
>>
>> Let's fix the issue by adding TI_BANDGAP_FEATURE_CONT_MODE_ONLY flag and
>> use it for 4430.
>>
>> Note that we also need to add udelay to for the EOCZ (end of conversion)
>> bit polling as otherwise we have it time out too early on 4430. We'll be
>> changing the loop to use iopoll in the following clean-up patch.
>>
>> Cc: Adam Ford <[email protected]>
>
> I don't have an OMAP4, but if you want, I can test a DM3730.

Indeed I remember a similar discussion from the DM3730 [1]. temp values were
always those from the last measurement. E.g. the first one was done
during (cold) boot and the first request after 10 minutes did show a
quite cold system... The next one did show a hot system independent
of what had been between (suspend or high activity).

It seems as if it was even reproducible with a very old kernel on a BeagleBoard.
So it is quite fundamental.

We tried to fix it but did not come to a solution [2]. So we opened an issue
in our tracker [3] and decided to stay with continuous conversion although this
raises idle mode processor load.

BR,
Nikolaus

[1]: https://lists.goldelico.com/pipermail/letux-kernel/2019-September/003958.html
[2]: https://lists.goldelico.com/pipermail/letux-kernel/2019-September/003975.html
[3]: https://projects.goldelico.com/p/gta04-kernel/issues/928/

> adam
>
>> Cc: Carl Philipp Klemm <[email protected]>
>> Cc: Eduardo Valentin <[email protected]>
>> Cc: Merlijn Wajer <[email protected]>
>> Cc: Pavel Machek <[email protected]>
>> Cc: Peter Ujfalusi <[email protected]>
>> Cc: Sebastian Reichel <[email protected]>
>> Signed-off-by: Tony Lindgren <[email protected]>
>> ---
>> drivers/thermal/ti-soc-thermal/omap4-thermal-data.c | 3 ++-
>> drivers/thermal/ti-soc-thermal/ti-bandgap.c | 9 +++++++--
>> drivers/thermal/ti-soc-thermal/ti-bandgap.h | 2 ++
>> 3 files changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
>> --- a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
>> +++ b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
>> @@ -58,7 +58,8 @@ omap4430_adc_to_temp[OMAP4430_ADC_END_VALUE - OMAP4430_ADC_START_VALUE + 1] = {
>> const struct ti_bandgap_data omap4430_data = {
>> .features = TI_BANDGAP_FEATURE_MODE_CONFIG |
>> TI_BANDGAP_FEATURE_CLK_CTRL |
>> - TI_BANDGAP_FEATURE_POWER_SWITCH,
>> + TI_BANDGAP_FEATURE_POWER_SWITCH |
>> + TI_BANDGAP_FEATURE_CONT_MODE_ONLY,
>> .fclock_name = "bandgap_fclk",
>> .div_ck_name = "bandgap_fclk",
>> .conv_table = omap4430_adc_to_temp,
>> diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.c b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
>> --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.c
>> +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
>> @@ -15,6 +15,7 @@
>> #include <linux/kernel.h>
>> #include <linux/interrupt.h>
>> #include <linux/clk.h>
>> +#include <linux/delay.h>
>> #include <linux/gpio/consumer.h>
>> #include <linux/platform_device.h>
>> #include <linux/err.h>
>> @@ -605,8 +606,10 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
>> u32 counter = 1000;
>> struct temp_sensor_registers *tsr;
>>
>> - /* Select single conversion mode */
>> - if (TI_BANDGAP_HAS(bgp, MODE_CONFIG))
>> + /* Select continuous or single conversion mode */
>> + if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY))
>> + RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1);
>> + else if (TI_BANDGAP_HAS(bgp, MODE_CONFIG))
>> RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0);
>>
>> /* Start of Conversion = 1 */
>> @@ -619,6 +622,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
>> if (ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) &
>> tsr->bgap_eocz_mask)
>> break;
>> + udelay(1);
>> }
>>
>> /* Start of Conversion = 0 */
>> @@ -630,6 +634,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
>> if (!(ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) &
>> tsr->bgap_eocz_mask))
>> break;
>> + udelay(1);
>> }
>>
>> return 0;
>> diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.h b/drivers/thermal/ti-soc-thermal/ti-bandgap.h
>> --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.h
>> +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.h
>> @@ -280,6 +280,7 @@ struct ti_temp_sensor {
>> * has Errata 814
>> * TI_BANDGAP_FEATURE_UNRELIABLE - used when the sensor readings are too
>> * inaccurate.
>> + * TI_BANDGAP_FEATURE_CONT_MODE_ONLY - used when single mode hangs the sensor
>> * TI_BANDGAP_HAS(b, f) - macro to check if a bandgap device is capable of a
>> * specific feature (above) or not. Return non-zero, if yes.
>> */
>> @@ -295,6 +296,7 @@ struct ti_temp_sensor {
>> #define TI_BANDGAP_FEATURE_HISTORY_BUFFER BIT(9)
>> #define TI_BANDGAP_FEATURE_ERRATA_814 BIT(10)
>> #define TI_BANDGAP_FEATURE_UNRELIABLE BIT(11)
>> +#define TI_BANDGAP_FEATURE_CONT_MODE_ONLY BIT(12)
>> #define TI_BANDGAP_HAS(b, f) \
>> ((b)->conf->features & TI_BANDGAP_FEATURE_ ## f)
>>
>> --
>> 2.29.2

2020-12-31 12:58:17

by Péter Ujfalusi

[permalink] [raw]
Subject: Re: [PATCH 1/3] thermal: ti-soc-thermal: Fix stuck sensor with continuous mode for 4430

Hi Tony,

On 12/30/20 10:43 AM, Tony Lindgren wrote:
> At least for 4430, trying to use the single conversion mode eventually
> hangs the thermal sensor. This can be quite easily seen with errors:
>
> thermal thermal_zone0: failed to read out thermal zone (-5)
>
> Also, trying to read the temperature shows a stuck value with:
>
> $ while true; do cat /sys/class/thermal/thermal_zone0/temp; done
>
> Where the temperature is not rising at all with the busy loop.
>
> Additionally, the EOCZ (end of conversion) bit is not rising on 4430 in
> single conversion mode while it works fine in continuous conversion mode.
> It is also possible that the hung temperature sensor can affect the
> thermal shutdown alert too.
>
> Let's fix the issue by adding TI_BANDGAP_FEATURE_CONT_MODE_ONLY flag and
> use it for 4430.
>
> Note that we also need to add udelay to for the EOCZ (end of conversion)
> bit polling as otherwise we have it time out too early on 4430. We'll be
> changing the loop to use iopoll in the following clean-up patch.

I don't yet have my setup in working condition, so I can not test these.

> Cc: Adam Ford <[email protected]>
> Cc: Carl Philipp Klemm <[email protected]>
> Cc: Eduardo Valentin <[email protected]>
> Cc: Merlijn Wajer <[email protected]>
> Cc: Pavel Machek <[email protected]>
> Cc: Peter Ujfalusi <[email protected]>
> Cc: Sebastian Reichel <[email protected]>
> Signed-off-by: Tony Lindgren <[email protected]>
> ---
> drivers/thermal/ti-soc-thermal/omap4-thermal-data.c | 3 ++-
> drivers/thermal/ti-soc-thermal/ti-bandgap.c | 9 +++++++--
> drivers/thermal/ti-soc-thermal/ti-bandgap.h | 2 ++
> 3 files changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
> --- a/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
> +++ b/drivers/thermal/ti-soc-thermal/omap4-thermal-data.c
> @@ -58,7 +58,8 @@ omap4430_adc_to_temp[OMAP4430_ADC_END_VALUE - OMAP4430_ADC_START_VALUE + 1] = {
> const struct ti_bandgap_data omap4430_data = {
> .features = TI_BANDGAP_FEATURE_MODE_CONFIG |
> TI_BANDGAP_FEATURE_CLK_CTRL |
> - TI_BANDGAP_FEATURE_POWER_SWITCH,
> + TI_BANDGAP_FEATURE_POWER_SWITCH |
> + TI_BANDGAP_FEATURE_CONT_MODE_ONLY,

Can we add a comment with the observations?

> .fclock_name = "bandgap_fclk",
> .div_ck_name = "bandgap_fclk",
> .conv_table = omap4430_adc_to_temp,
> diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.c b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
> --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.c
> +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.c
> @@ -15,6 +15,7 @@
> #include <linux/kernel.h>
> #include <linux/interrupt.h>
> #include <linux/clk.h>
> +#include <linux/delay.h>
> #include <linux/gpio/consumer.h>
> #include <linux/platform_device.h>
> #include <linux/err.h>
> @@ -605,8 +606,10 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
> u32 counter = 1000;
> struct temp_sensor_registers *tsr;
>
> - /* Select single conversion mode */
> - if (TI_BANDGAP_HAS(bgp, MODE_CONFIG))
> + /* Select continuous or single conversion mode */
> + if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY))
> + RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1);
> + else if (TI_BANDGAP_HAS(bgp, MODE_CONFIG))
> RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0);

Would not be better to:
if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) {
if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY))
RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1);
else
RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0);
}

One can only switch to cont/single mode if the mode config is possible.

>
> /* Start of Conversion = 1 */
> @@ -619,6 +622,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
> if (ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) &
> tsr->bgap_eocz_mask)
> break;
> + udelay(1);
> }
>
> /* Start of Conversion = 0 */
> @@ -630,6 +634,7 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
> if (!(ti_bandgap_readl(bgp, tsr->temp_sensor_ctrl) &
> tsr->bgap_eocz_mask))
> break;
> + udelay(1);
> }
>
> return 0;
> diff --git a/drivers/thermal/ti-soc-thermal/ti-bandgap.h b/drivers/thermal/ti-soc-thermal/ti-bandgap.h
> --- a/drivers/thermal/ti-soc-thermal/ti-bandgap.h
> +++ b/drivers/thermal/ti-soc-thermal/ti-bandgap.h
> @@ -280,6 +280,7 @@ struct ti_temp_sensor {
> * has Errata 814
> * TI_BANDGAP_FEATURE_UNRELIABLE - used when the sensor readings are too
> * inaccurate.
> + * TI_BANDGAP_FEATURE_CONT_MODE_ONLY - used when single mode hangs the sensor
> * TI_BANDGAP_HAS(b, f) - macro to check if a bandgap device is capable of a
> * specific feature (above) or not. Return non-zero, if yes.
> */
> @@ -295,6 +296,7 @@ struct ti_temp_sensor {
> #define TI_BANDGAP_FEATURE_HISTORY_BUFFER BIT(9)
> #define TI_BANDGAP_FEATURE_ERRATA_814 BIT(10)
> #define TI_BANDGAP_FEATURE_UNRELIABLE BIT(11)
> +#define TI_BANDGAP_FEATURE_CONT_MODE_ONLY BIT(12)
> #define TI_BANDGAP_HAS(b, f) \
> ((b)->conf->features & TI_BANDGAP_FEATURE_ ## f)
>
>

--
Péter

2021-01-08 07:21:27

by Tony Lindgren

[permalink] [raw]
Subject: Re: [PATCH 1/3] thermal: ti-soc-thermal: Fix stuck sensor with continuous mode for 4430

* Péter Ujfalusi <[email protected]> [201231 12:55]:
> On 12/30/20 10:43 AM, Tony Lindgren wrote:
> > @@ -58,7 +58,8 @@ omap4430_adc_to_temp[OMAP4430_ADC_END_VALUE - OMAP4430_ADC_START_VALUE + 1] = {
> > const struct ti_bandgap_data omap4430_data = {
> > .features = TI_BANDGAP_FEATURE_MODE_CONFIG |
> > TI_BANDGAP_FEATURE_CLK_CTRL |
> > - TI_BANDGAP_FEATURE_POWER_SWITCH,
> > + TI_BANDGAP_FEATURE_POWER_SWITCH |
> > + TI_BANDGAP_FEATURE_CONT_MODE_ONLY,
>
> Can we add a comment with the observations?

Sure, and I also noticed that the timeout triggers also on dra7
too. I need to recheck what all are affected.. At least we now
see warnings on the SoCs affected.

> > @@ -605,8 +606,10 @@ ti_bandgap_force_single_read(struct ti_bandgap *bgp, int id)
> > u32 counter = 1000;
> > struct temp_sensor_registers *tsr;
> >
> > - /* Select single conversion mode */
> > - if (TI_BANDGAP_HAS(bgp, MODE_CONFIG))
> > + /* Select continuous or single conversion mode */
> > + if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY))
> > + RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1);
> > + else if (TI_BANDGAP_HAS(bgp, MODE_CONFIG))
> > RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0);
>
> Would not be better to:
> if (TI_BANDGAP_HAS(bgp, MODE_CONFIG)) {
> if (TI_BANDGAP_HAS(bgp, CONT_MODE_ONLY))
> RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 1);
> else
> RMW_BITS(bgp, id, bgap_mode_ctrl, mode_ctrl_mask, 0);
> }
>
> One can only switch to cont/single mode if the mode config is possible.

Yup makes sense thanks for spotting that.

Regards,

Tony

2021-01-08 07:24:59

by Tony Lindgren

[permalink] [raw]
Subject: Re: [PATCH 1/3] thermal: ti-soc-thermal: Fix stuck sensor with continuous mode for 4430

* H. Nikolaus Schaller <[email protected]> [201230 13:29]:
> > Am 30.12.2020 um 13:55 schrieb Adam Ford <[email protected]>:
> > On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <[email protected]> wrote:
> >>
> >> At least for 4430, trying to use the single conversion mode eventually
> >> hangs the thermal sensor. This can be quite easily seen with errors:
> >>
> >> thermal thermal_zone0: failed to read out thermal zone (-5)
...

> > I don't have an OMAP4, but if you want, I can test a DM3730.
>
> Indeed I remember a similar discussion from the DM3730 [1]. temp values were
> always those from the last measurement. E.g. the first one was done
> during (cold) boot and the first request after 10 minutes did show a
> quite cold system... The next one did show a hot system independent
> of what had been between (suspend or high activity).
>
> It seems as if it was even reproducible with a very old kernel on a BeagleBoard.
> So it is quite fundamental.
>
> We tried to fix it but did not come to a solution [2]. So we opened an issue
> in our tracker [3] and decided to stay with continuous conversion although this
> raises idle mode processor load.

Hmm so maybe eocz high always times out in single mode since it also
triggers at least on dra7?

Yes it would be great if you guys can the $subject patch a try at
least on your omap36xx and omap5 boards and see if you see eocz
time out warnings in dmesg.

Regards,

Tony

2021-01-08 13:47:50

by Adam Ford

[permalink] [raw]
Subject: Re: [PATCH 1/3] thermal: ti-soc-thermal: Fix stuck sensor with continuous mode for 4430

On Fri, Jan 8, 2021 at 1:22 AM Tony Lindgren <[email protected]> wrote:
>
> * H. Nikolaus Schaller <[email protected]> [201230 13:29]:
> > > Am 30.12.2020 um 13:55 schrieb Adam Ford <[email protected]>:
> > > On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <[email protected]> wrote:
> > >>
> > >> At least for 4430, trying to use the single conversion mode eventually
> > >> hangs the thermal sensor. This can be quite easily seen with errors:
> > >>
> > >> thermal thermal_zone0: failed to read out thermal zone (-5)
> ...
>
> > > I don't have an OMAP4, but if you want, I can test a DM3730.
> >
> > Indeed I remember a similar discussion from the DM3730 [1]. temp values were
> > always those from the last measurement. E.g. the first one was done
> > during (cold) boot and the first request after 10 minutes did show a
> > quite cold system... The next one did show a hot system independent
> > of what had been between (suspend or high activity).
> >
> > It seems as if it was even reproducible with a very old kernel on a BeagleBoard.
> > So it is quite fundamental.
> >
> > We tried to fix it but did not come to a solution [2]. So we opened an issue
> > in our tracker [3] and decided to stay with continuous conversion although this
> > raises idle mode processor load.
>
> Hmm so maybe eocz high always times out in single mode since it also
> triggers at least on dra7?
>
> Yes it would be great if you guys can the $subject patch a try at
> least on your omap36xx and omap5 boards and see if you see eocz
> time out warnings in dmesg.

I should be able to try it on the dm3730 logicpd-torpedo kit this weekend.

adam
>
> Regards,
>
> Tony

2021-01-08 18:33:52

by Adam Ford

[permalink] [raw]
Subject: Re: [PATCH 1/3] thermal: ti-soc-thermal: Fix stuck sensor with continuous mode for 4430

On Fri, Jan 8, 2021 at 7:45 AM Adam Ford <[email protected]> wrote:
>
> On Fri, Jan 8, 2021 at 1:22 AM Tony Lindgren <[email protected]> wrote:
> >
> > * H. Nikolaus Schaller <[email protected]> [201230 13:29]:
> > > > Am 30.12.2020 um 13:55 schrieb Adam Ford <[email protected]>:
> > > > On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <[email protected]> wrote:
> > > >>
> > > >> At least for 4430, trying to use the single conversion mode eventually
> > > >> hangs the thermal sensor. This can be quite easily seen with errors:
> > > >>
> > > >> thermal thermal_zone0: failed to read out thermal zone (-5)
> > ...
> >
> > > > I don't have an OMAP4, but if you want, I can test a DM3730.
> > >
> > > Indeed I remember a similar discussion from the DM3730 [1]. temp values were
> > > always those from the last measurement. E.g. the first one was done
> > > during (cold) boot and the first request after 10 minutes did show a
> > > quite cold system... The next one did show a hot system independent
> > > of what had been between (suspend or high activity).
> > >
> > > It seems as if it was even reproducible with a very old kernel on a BeagleBoard.
> > > So it is quite fundamental.
> > >
> > > We tried to fix it but did not come to a solution [2]. So we opened an issue
> > > in our tracker [3] and decided to stay with continuous conversion although this
> > > raises idle mode processor load.
> >
> > Hmm so maybe eocz high always times out in single mode since it also
> > triggers at least on dra7?
> >
> > Yes it would be great if you guys can the $subject patch a try at
> > least on your omap36xx and omap5 boards and see if you see eocz
> > time out warnings in dmesg.
>
> I should be able to try it on the dm3730 logicpd-torpedo kit this weekend.

I am going to be a bit delayed testing this. I cannot boot omap2plus
using Linux version 5.11.0-rc2.

[ 2.666748] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xbc
[ 2.673309] nand: Micron MT29F4G16ABBDA3W
[ 2.677368] nand: 512 MiB, SLC, erase size: 128 KiB, page size:
2048, OOB size: 64
[ 2.685119] nand: using OMAP_ECC_BCH8_CODE_HW_DETECTION_SW
[ 2.693237] Invalid ECC layout
[ 2.696350] omap2-nand 30000000.nand: unable to use BCH library
[ 2.702575] omap2-nand: probe of 30000000.nand failed with error -22
[ 2.716094] 8<--- cut here ---
[ 2.719207] Unable to handle kernel NULL pointer dereference at
virtual address 00000018
[ 2.727600] pgd = (ptrval)
...
[ 3.050933] ---[ end trace 59640c7399a80a07 ]---
[ 3.055603] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0000000b
[ 3.063323] ---[ end Kernel panic - not syncing: Attempted to kill
init! exitcode=0x0000000b ]---

Once I get past this, I'll try to test the thermal stuff.

adam

>
> adam
> >
> > Regards,
> >
> > Tony

2021-01-08 19:45:58

by Adam Ford

[permalink] [raw]
Subject: Re: [PATCH 1/3] thermal: ti-soc-thermal: Fix stuck sensor with continuous mode for 4430

On Fri, Jan 8, 2021 at 12:31 PM Adam Ford <[email protected]> wrote:
>
> On Fri, Jan 8, 2021 at 7:45 AM Adam Ford <[email protected]> wrote:
> >
> > On Fri, Jan 8, 2021 at 1:22 AM Tony Lindgren <[email protected]> wrote:
> > >
> > > * H. Nikolaus Schaller <[email protected]> [201230 13:29]:
> > > > > Am 30.12.2020 um 13:55 schrieb Adam Ford <[email protected]>:
> > > > > On Wed, Dec 30, 2020 at 2:43 AM Tony Lindgren <[email protected]> wrote:
> > > > >>
> > > > >> At least for 4430, trying to use the single conversion mode eventually
> > > > >> hangs the thermal sensor. This can be quite easily seen with errors:
> > > > >>
> > > > >> thermal thermal_zone0: failed to read out thermal zone (-5)
> > > ...
> > >
> > > > > I don't have an OMAP4, but if you want, I can test a DM3730.
> > > >
> > > > Indeed I remember a similar discussion from the DM3730 [1]. temp values were
> > > > always those from the last measurement. E.g. the first one was done
> > > > during (cold) boot and the first request after 10 minutes did show a
> > > > quite cold system... The next one did show a hot system independent
> > > > of what had been between (suspend or high activity).
> > > >
> > > > It seems as if it was even reproducible with a very old kernel on a BeagleBoard.
> > > > So it is quite fundamental.
> > > >
> > > > We tried to fix it but did not come to a solution [2]. So we opened an issue
> > > > in our tracker [3] and decided to stay with continuous conversion although this
> > > > raises idle mode processor load.
> > >
> > > Hmm so maybe eocz high always times out in single mode since it also
> > > triggers at least on dra7?
> > >
> > > Yes it would be great if you guys can the $subject patch a try at
> > > least on your omap36xx and omap5 boards and see if you see eocz
> > > time out warnings in dmesg.


I do see chatter.

[ 15.531005] ti-soc-thermal 48002524.bandgap: eocz timed out waiting low
[ 16.571075] ti-soc-thermal 48002524.bandgap: eocz timed out waiting low
[ 17.610961] ti-soc-thermal 48002524.bandgap: eocz timed out waiting low

and it repeats quite often.

I would say this patch series would cause a regression on the DM3730.

adam


> >
> > I should be able to try it on the dm3730 logicpd-torpedo kit this weekend.
>
> I am going to be a bit delayed testing this. I cannot boot omap2plus
> using Linux version 5.11.0-rc2.
>
> [ 2.666748] nand: device found, Manufacturer ID: 0x2c, Chip ID: 0xbc
> [ 2.673309] nand: Micron MT29F4G16ABBDA3W
> [ 2.677368] nand: 512 MiB, SLC, erase size: 128 KiB, page size:
> 2048, OOB size: 64
> [ 2.685119] nand: using OMAP_ECC_BCH8_CODE_HW_DETECTION_SW
> [ 2.693237] Invalid ECC layout
> [ 2.696350] omap2-nand 30000000.nand: unable to use BCH library
> [ 2.702575] omap2-nand: probe of 30000000.nand failed with error -22
> [ 2.716094] 8<--- cut here ---
> [ 2.719207] Unable to handle kernel NULL pointer dereference at
> virtual address 00000018
> [ 2.727600] pgd = (ptrval)
> ...
> [ 3.050933] ---[ end trace 59640c7399a80a07 ]---
> [ 3.055603] Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x0000000b
> [ 3.063323] ---[ end Kernel panic - not syncing: Attempted to kill
> init! exitcode=0x0000000b ]---
>
> Once I get past this, I'll try to test the thermal stuff.
>
> adam
>
> >
> > adam
> > >
> > > Regards,
> > >
> > > Tony