The AUXADC thermal driver unfortunately has issues with a fixed wait
at probe, as this is not only SoC dependent, but actually depends on
the board (and even aging...): for example, that works fine on the
Chromebook that I have here in my hands but not for the ones in our lab.
Some machines are working fine with that 30ms delay at probe, but some
others are not, hence I started digging in downstream sources here and
there, and found that there actually is a valid temperature range for
at least auxadc-thermal *v1* and can be actually found in multiple
downstream kernels for MT8173 and MT6795.
As for v2 and v3 thermal IP, I'm sure that the v1 range works fine but
I've "left room" for adding specific ranges for them later: this fix
is urgent, as many MT8173 and MT8183 Chromebooks are failing tests in
KernelCI due to thermal shutdown during boot.
For the KernelCI logs, you can look at [1] for 8173, [2] for 8183.
[1]: https://storage.kernelci.org/next/master/next-20230405/arm64/defconfig+arm64-chromebook/gcc-10/lab-collabora/igt-kms-mediatek-mt8173-elm-hana.html
[2]: https://storage.kernelci.org/next/master/next-20230405/arm64/defconfig+arm64-chromebook/gcc-10/lab-collabora/cros-ec-mt8183-kukui-jacuzzi-juniper-sku16.html
AngeloGioacchino Del Regno (2):
Revert "thermal/drivers/mediatek: Add delay after thermal banks
initialization"
thermal/drivers/mediatek: Add temperature constraints to validate read
drivers/thermal/mediatek/auxadc_thermal.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)
--
2.40.0
The AUXADC thermal v1 allows reading temperature range between -20°C to
150°C and any value out of this range is invalid.
Add new definitions for MT8173_TEMP_{MIN_MAX} and a new small helper
mtk_thermal_temp_is_valid() to check if new readings are in range: if
not, we tell to the API that the reading is invalid by returning
THERMAL_TEMP_INVALID.
It was chosen to introduce the helper function because, even though this
temperature range is realistically ok for all, it comes from a downstream
kernel driver for version 1, but here we also support v2 and v3 which may
may have wider constraints.
Signed-off-by: AngeloGioacchino Del Regno <[email protected]>
---
drivers/thermal/mediatek/auxadc_thermal.c | 24 +++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)
diff --git a/drivers/thermal/mediatek/auxadc_thermal.c b/drivers/thermal/mediatek/auxadc_thermal.c
index 3c959a827451..e908c8e9d558 100644
--- a/drivers/thermal/mediatek/auxadc_thermal.c
+++ b/drivers/thermal/mediatek/auxadc_thermal.c
@@ -116,6 +116,10 @@
/* The calibration coefficient of sensor */
#define MT8173_CALIBRATION 165
+/* Valid temperatures range */
+#define MT8173_TEMP_MIN -20000
+#define MT8173_TEMP_MAX 150000
+
/*
* Layout of the fuses providing the calibration data
* These macros could be used for MT8183, MT8173, MT2701, and MT2712.
@@ -689,6 +693,11 @@ static const struct mtk_thermal_data mt7986_thermal_data = {
.version = MTK_THERMAL_V3,
};
+static bool mtk_thermal_temp_is_valid(int temp)
+{
+ return (temp >= MT8173_TEMP_MIN) && (temp <= MT8173_TEMP_MAX);
+}
+
/**
* raw_to_mcelsius_v1 - convert a raw ADC value to mcelsius
* @mt: The thermal controller
@@ -815,14 +824,17 @@ static int mtk_thermal_bank_temperature(struct mtk_thermal_bank *bank)
temp = mt->raw_to_mcelsius(
mt, conf->bank_data[bank->id].sensors[i], raw);
-
/*
- * The first read of a sensor often contains very high bogus
- * temperature value. Filter these out so that the system does
- * not immediately shut down.
+ * Depending on the filt/sen intervals and ADC polling time,
+ * we may need up to 60 milliseconds after initialization: this
+ * will result in the first reading containing an out of range
+ * temperature value.
+ * Validate the reading to both address the aforementioned issue
+ * and to eventually avoid bogus readings during runtime in the
+ * event that the AUXADC gets unstable due to high EMI, etc.
*/
- if (temp > 200000)
- temp = 0;
+ if (!mtk_thermal_temp_is_valid(temp))
+ temp = THERMAL_TEMP_INVALID;
if (temp > max)
max = temp;
--
2.40.0
Some more testing revealed that this commit introduces a regression on some
MT8173 Chromebooks and at least on one MT6795 Sony Xperia M5 smartphone due
to the delay being apparently variable and machine specific.
Another solution would be to delay for a bit more (~70ms) but this is not
feasible for two reasons: first of all, we're adding an even bigger delay
in a probe function; second, some machines need less, some may need even
more, making the msleep at probe solution highly suboptimal.
This reverts commit 10debf8c2da8011c8009dd4b3f6d0ab85891c81b.
Fixes: 10debf8c2da8 ("thermal/drivers/mediatek: Add delay after thermal banks initialization")
Reported-by: "kernelci.org bot" <[email protected]>
Signed-off-by: AngeloGioacchino Del Regno <[email protected]>
---
drivers/thermal/mediatek/auxadc_thermal.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/thermal/mediatek/auxadc_thermal.c b/drivers/thermal/mediatek/auxadc_thermal.c
index b6bb9eaafb74..3c959a827451 100644
--- a/drivers/thermal/mediatek/auxadc_thermal.c
+++ b/drivers/thermal/mediatek/auxadc_thermal.c
@@ -816,6 +816,14 @@ static int mtk_thermal_bank_temperature(struct mtk_thermal_bank *bank)
mt, conf->bank_data[bank->id].sensors[i], raw);
+ /*
+ * The first read of a sensor often contains very high bogus
+ * temperature value. Filter these out so that the system does
+ * not immediately shut down.
+ */
+ if (temp > 200000)
+ temp = 0;
+
if (temp > max)
max = temp;
}
@@ -1273,9 +1281,6 @@ static int mtk_thermal_probe(struct platform_device *pdev)
platform_set_drvdata(pdev, mt);
- /* Delay for thermal banks to be ready */
- msleep(30);
-
tzdev = devm_thermal_of_zone_register(&pdev->dev, 0, mt,
&mtk_thermal_ops);
if (IS_ERR(tzdev)) {
--
2.40.0
On 19/04/2023 08:11, AngeloGioacchino Del Regno wrote:
> The AUXADC thermal driver unfortunately has issues with a fixed wait
> at probe, as this is not only SoC dependent, but actually depends on
> the board (and even aging...): for example, that works fine on the
> Chromebook that I have here in my hands but not for the ones in our lab.
>
> Some machines are working fine with that 30ms delay at probe, but some
> others are not, hence I started digging in downstream sources here and
> there, and found that there actually is a valid temperature range for
> at least auxadc-thermal *v1* and can be actually found in multiple
> downstream kernels for MT8173 and MT6795.
>
> As for v2 and v3 thermal IP, I'm sure that the v1 range works fine but
> I've "left room" for adding specific ranges for them later: this fix
> is urgent, as many MT8173 and MT8183 Chromebooks are failing tests in
> KernelCI due to thermal shutdown during boot.
>
> For the KernelCI logs, you can look at [1] for 8173, [2] for 8183.
>
> [1]: https://storage.kernelci.org/next/master/next-20230405/arm64/defconfig+arm64-chromebook/gcc-10/lab-collabora/igt-kms-mediatek-mt8173-elm-hana.html
> [2]: https://storage.kernelci.org/next/master/next-20230405/arm64/defconfig+arm64-chromebook/gcc-10/lab-collabora/cros-ec-mt8183-kukui-jacuzzi-juniper-sku16.html
>
Applied, thanks
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog