Subject: [PATCH v2] thermal: Fix a NULL pointer dereference

of_parse_thermal_zones() parses the thermal-zones node and registers a
thermal_zone device for each subnode. However, if a thermal zone is
consuming a thermal sensor and that thermal sensor device hasn't probed
yet, an attempt to set trip_point_*_temp for that thermal zone device
can cause a NULL pointer dereference. Fix it.

console:/sys/class/thermal/thermal_zone87 # echo 120000 > trip_point_0_temp
...
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
...
Call trace:
of_thermal_set_trip_temp+0x40/0xc4
trip_point_temp_store+0xc0/0x1dc
dev_attr_store+0x38/0x88
sysfs_kf_write+0x64/0xc0
kernfs_fop_write_iter+0x108/0x1d0
vfs_write+0x2f4/0x368
ksys_write+0x7c/0xec
__arm64_sys_write+0x20/0x30
el0_svc_common.llvm.7279915941325364641+0xbc/0x1bc
do_el0_svc+0x28/0xa0
el0_svc+0x14/0x24
el0_sync_handler+0x88/0xec
el0_sync+0x1c0/0x200

While at it, fix the possible NULL pointer dereference in other
functions as well: of_thermal_get_temp(), of_thermal_set_emul_temp(),
of_thermal_get_trend().

Cc: [email protected]
Suggested-by: David Collins <[email protected]>
Signed-off-by: Subbaraman Narayanamurthy <[email protected]>
---
Changes for v2:
- Added checks in of_thermal_get_temp(), of_thermal_set_emul_temp(), of_thermal_get_trend().

drivers/thermal/thermal_of.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c
index 6379f26..9233f7e 100644
--- a/drivers/thermal/thermal_of.c
+++ b/drivers/thermal/thermal_of.c
@@ -89,7 +89,7 @@ static int of_thermal_get_temp(struct thermal_zone_device *tz,
{
struct __thermal_zone *data = tz->devdata;

- if (!data->ops->get_temp)
+ if (!data->ops || !data->ops->get_temp)
return -EINVAL;

return data->ops->get_temp(data->sensor_data, temp);
@@ -186,6 +186,9 @@ static int of_thermal_set_emul_temp(struct thermal_zone_device *tz,
{
struct __thermal_zone *data = tz->devdata;

+ if (!data->ops || !data->ops->set_emul_temp)
+ return -EINVAL;
+
return data->ops->set_emul_temp(data->sensor_data, temp);
}

@@ -194,7 +197,7 @@ static int of_thermal_get_trend(struct thermal_zone_device *tz, int trip,
{
struct __thermal_zone *data = tz->devdata;

- if (!data->ops->get_trend)
+ if (!data->ops || !data->ops->get_trend)
return -EINVAL;

return data->ops->get_trend(data->sensor_data, trip, trend);
@@ -301,7 +304,7 @@ static int of_thermal_set_trip_temp(struct thermal_zone_device *tz, int trip,
if (trip >= data->ntrips || trip < 0)
return -EDOM;

- if (data->ops->set_trip_temp) {
+ if (data->ops && data->ops->set_trip_temp) {
int ret;

ret = data->ops->set_trip_temp(data->sensor_data, trip, temp);
--
2.7.4


2021-09-17 13:35:32

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH v2] thermal: Fix a NULL pointer dereference

On 07/09/2021 21:01, Subbaraman Narayanamurthy wrote:
> of_parse_thermal_zones() parses the thermal-zones node and registers a
> thermal_zone device for each subnode. However, if a thermal zone is
> consuming a thermal sensor and that thermal sensor device hasn't probed
> yet, an attempt to set trip_point_*_temp for that thermal zone device
> can cause a NULL pointer dereference. Fix it.
>
> console:/sys/class/thermal/thermal_zone87 # echo 120000 > trip_point_0_temp
> ...
> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
> ...
> Call trace:
> of_thermal_set_trip_temp+0x40/0xc4
> trip_point_temp_store+0xc0/0x1dc
> dev_attr_store+0x38/0x88
> sysfs_kf_write+0x64/0xc0
> kernfs_fop_write_iter+0x108/0x1d0
> vfs_write+0x2f4/0x368
> ksys_write+0x7c/0xec
> __arm64_sys_write+0x20/0x30
> el0_svc_common.llvm.7279915941325364641+0xbc/0x1bc
> do_el0_svc+0x28/0xa0
> el0_svc+0x14/0x24
> el0_sync_handler+0x88/0xec
> el0_sync+0x1c0/0x200
>
> While at it, fix the possible NULL pointer dereference in other
> functions as well: of_thermal_get_temp(), of_thermal_set_emul_temp(),
> of_thermal_get_trend().
>
> Cc: [email protected]
> Suggested-by: David Collins <[email protected]>
> Signed-off-by: Subbaraman Narayanamurthy <[email protected]>
> ---
> Changes for v2:
> - Added checks in of_thermal_get_temp(), of_thermal_set_emul_temp(), of_thermal_get_trend().
>
> drivers/thermal/thermal_of.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c
> index 6379f26..9233f7e 100644
> --- a/drivers/thermal/thermal_of.c
> +++ b/drivers/thermal/thermal_of.c
> @@ -89,7 +89,7 @@ static int of_thermal_get_temp(struct thermal_zone_device *tz,
> {
> struct __thermal_zone *data = tz->devdata;
>
> - if (!data->ops->get_temp)
> + if (!data->ops || !data->ops->get_temp)

comment (1)

AFAICT, if data->ops != NULL then data->ops->get_temp is also != NULL
because of the code allocating/freeing the ops structure.

The tests can be replaced by (!data->ops), no ?

> return -EINVAL;
>
> return data->ops->get_temp(data->sensor_data, temp);
> @@ -186,6 +186,9 @@ static int of_thermal_set_emul_temp(struct thermal_zone_device *tz,
> {
> struct __thermal_zone *data = tz->devdata;
>
> + if (!data->ops || !data->ops->set_emul_temp)
> + return -EINVAL;
> +

comment (2)

The test looks pointless here (I mean both of them).

If of_thermal_set_emul_temp() is called it is because the callback was
set in thermal_zone_of_add_sensor().

This one does:

tz->ops = ops;

and
if (ops->set_emul_temp)
tzd->ops->set_emul_temp = of_thermal_set_emul_temp;

If I'm not wrong if we are called, then data->ops &&
data->ops->set_emul_temp is always true, right ?


> return data->ops->set_emul_temp(data->sensor_data, temp);
> }
>
> @@ -194,7 +197,7 @@ static int of_thermal_get_trend(struct thermal_zone_device *tz, int trip,
> {
> struct __thermal_zone *data = tz->devdata;
>
> - if (!data->ops->get_trend)
> + if (!data->ops || !data->ops->get_trend)
> return -EINVAL;

Same comment as (1)

> return data->ops->get_trend(data->sensor_data, trip, trend);
> @@ -301,7 +304,7 @@ static int of_thermal_set_trip_temp(struct thermal_zone_device *tz, int trip,
> if (trip >= data->ntrips || trip < 0)
> return -EDOM;
>
> - if (data->ops->set_trip_temp) {
> + if (data->ops && data->ops->set_trip_temp) {

Same comment as (2)

> int ret;
>
> ret = data->ops->set_trip_temp(data->sensor_data, trip, temp);
>


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Subject: Re: [PATCH v2] thermal: Fix a NULL pointer dereference

On 9/17/21 2:31 AM, Daniel Lezcano wrote:
> On 07/09/2021 21:01, Subbaraman Narayanamurthy wrote:
>> of_parse_thermal_zones() parses the thermal-zones node and registers a
>> thermal_zone device for each subnode. However, if a thermal zone is
>> consuming a thermal sensor and that thermal sensor device hasn't probed
>> yet, an attempt to set trip_point_*_temp for that thermal zone device
>> can cause a NULL pointer dereference. Fix it.
>>
>> console:/sys/class/thermal/thermal_zone87 # echo 120000 > trip_point_0_temp
>> ...
>> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
>> ...
>> Call trace:
>> of_thermal_set_trip_temp+0x40/0xc4
>> trip_point_temp_store+0xc0/0x1dc
>> dev_attr_store+0x38/0x88
>> sysfs_kf_write+0x64/0xc0
>> kernfs_fop_write_iter+0x108/0x1d0
>> vfs_write+0x2f4/0x368
>> ksys_write+0x7c/0xec
>> __arm64_sys_write+0x20/0x30
>> el0_svc_common.llvm.7279915941325364641+0xbc/0x1bc
>> do_el0_svc+0x28/0xa0
>> el0_svc+0x14/0x24
>> el0_sync_handler+0x88/0xec
>> el0_sync+0x1c0/0x200
>>
>> While at it, fix the possible NULL pointer dereference in other
>> functions as well: of_thermal_get_temp(), of_thermal_set_emul_temp(),
>> of_thermal_get_trend().
>>
>> Cc: [email protected]
>> Suggested-by: David Collins <[email protected]>
>> Signed-off-by: Subbaraman Narayanamurthy <[email protected]>
>> ---
>> Changes for v2:
>> - Added checks in of_thermal_get_temp(), of_thermal_set_emul_temp(), of_thermal_get_trend().
>>
>> drivers/thermal/thermal_of.c | 9 ++++++---
>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c
>> index 6379f26..9233f7e 100644
>> --- a/drivers/thermal/thermal_of.c
>> +++ b/drivers/thermal/thermal_of.c
>> @@ -89,7 +89,7 @@ static int of_thermal_get_temp(struct thermal_zone_device *tz,
>> {
>> struct __thermal_zone *data = tz->devdata;
>>
>> - if (!data->ops->get_temp)
>> + if (!data->ops || !data->ops->get_temp)
> comment (1)
>
> AFAICT, if data->ops != NULL then data->ops->get_temp is also != NULL
> because of the code allocating/freeing the ops structure.
>
> The tests can be replaced by (!data->ops), no ?

Thanks Daniel for reviewing the patch.

I agree that even if a sensor module is unregistered, that would call
"thermal_zone_of_sensor_unregister" which would eventually set NULL on
get_temp() and get_trend() and "tzd->ops" as well.

However, of_thermal_get_temp() is trying to call "data->ops->get_temp"
which comes from a sensor driver when it registers. There is no
guarantee that it would be non-NULL right?

Thinking of which, I think having both checks would be valid.

>
>> return -EINVAL;
>>
>> return data->ops->get_temp(data->sensor_data, temp);
>> @@ -186,6 +186,9 @@ static int of_thermal_set_emul_temp(struct thermal_zone_device *tz,
>> {
>> struct __thermal_zone *data = tz->devdata;
>>
>> + if (!data->ops || !data->ops->set_emul_temp)
>> + return -EINVAL;
>> +
> comment (2)
>
> The test looks pointless here (I mean both of them).
>
> If of_thermal_set_emul_temp() is called it is because the callback was
> set in thermal_zone_of_add_sensor().
>
> This one does:
>
> tz->ops = ops;
>
> and
> if (ops->set_emul_temp)
> tzd->ops->set_emul_temp = of_thermal_set_emul_temp;
>
> If I'm not wrong if we are called, then data->ops &&
> data->ops->set_emul_temp is always true, right ?
>

I've not exercised this condition yet. However, the original problem
we've observed was when thermal HAL was trying to set trip thresholds
on a thermal zone device for which the sensor device is not probed yet.
This had happened randomly because of vendor modules taking time to be
loaded and probed. I don't know if there would be any userspace entity
that can try to set emulated temperature for a thermal zone even before
a sensor device is probed.

Without a sensor driver probed, "tz->ops" would not have a valid pointer
right? So, I think checking for "data->ops" should be good.

Another possibility is, a sensor might not have "set_emul_temp" callback.
So checking for "ops->set_emul_temp" should be still valid.

>> return data->ops->set_emul_temp(data->sensor_data, temp);
>> }
>>
>> @@ -194,7 +197,7 @@ static int of_thermal_get_trend(struct thermal_zone_device *tz, int trip,
>> {
>> struct __thermal_zone *data = tz->devdata;
>>
>> - if (!data->ops->get_trend)
>> + if (!data->ops || !data->ops->get_trend)
>> return -EINVAL;
> Same comment as (1)

of_thermal_get_trend() is trying to call "data->ops->get_trend" which
comes from a sensor driver when it registers. From what I can see,
there are lot of drivers which don't pass "get_trend" in their ops.
So there is no guarantee that it would be non-NULL right?

Thinking of which, I think having both checks would be valid.

>
>> return data->ops->get_trend(data->sensor_data, trip, trend);
>> @@ -301,7 +304,7 @@ static int of_thermal_set_trip_temp(struct thermal_zone_device *tz, int trip,
>> if (trip >= data->ntrips || trip < 0)
>> return -EDOM;
>>
>> - if (data->ops->set_trip_temp) {
>> + if (data->ops && data->ops->set_trip_temp) {
> Same comment as (2)

Without a sensor driver probed, "tz->ops" would not have a valid pointer right?
So, I think checking for "data->ops" should be good. Another possibility is, a
sensor device might not have "set_trip_temp" callback but just "set_trips".

So checking for "data->ops->set_trip_temp" might be still valid.

>
>> int ret;
>>
>> ret = data->ops->set_trip_temp(data->sensor_data, trip, temp);
>>
>

Subject: Re: [PATCH v2] thermal: Fix a NULL pointer dereference

On 9/17/21 1:06 PM, Subbaraman Narayanamurthy wrote:
> On 9/17/21 2:31 AM, Daniel Lezcano wrote:
>> On 07/09/2021 21:01, Subbaraman Narayanamurthy wrote:
>>> of_parse_thermal_zones() parses the thermal-zones node and registers a
>>> thermal_zone device for each subnode. However, if a thermal zone is
>>> consuming a thermal sensor and that thermal sensor device hasn't probed
>>> yet, an attempt to set trip_point_*_temp for that thermal zone device
>>> can cause a NULL pointer dereference. Fix it.
>>>
>>> console:/sys/class/thermal/thermal_zone87 # echo 120000 > trip_point_0_temp
>>> ...
>>> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
>>> ...
>>> Call trace:
>>> of_thermal_set_trip_temp+0x40/0xc4
>>> trip_point_temp_store+0xc0/0x1dc
>>> dev_attr_store+0x38/0x88
>>> sysfs_kf_write+0x64/0xc0
>>> kernfs_fop_write_iter+0x108/0x1d0
>>> vfs_write+0x2f4/0x368
>>> ksys_write+0x7c/0xec
>>> __arm64_sys_write+0x20/0x30
>>> el0_svc_common.llvm.7279915941325364641+0xbc/0x1bc
>>> do_el0_svc+0x28/0xa0
>>> el0_svc+0x14/0x24
>>> el0_sync_handler+0x88/0xec
>>> el0_sync+0x1c0/0x200
>>>
>>> While at it, fix the possible NULL pointer dereference in other
>>> functions as well: of_thermal_get_temp(), of_thermal_set_emul_temp(),
>>> of_thermal_get_trend().
>>>
>>> Cc: [email protected]
>>> Suggested-by: David Collins <[email protected]>
>>> Signed-off-by: Subbaraman Narayanamurthy <[email protected]>
>>> ---
>>> Changes for v2:
>>> - Added checks in of_thermal_get_temp(), of_thermal_set_emul_temp(), of_thermal_get_trend().
>>>
>>> drivers/thermal/thermal_of.c | 9 ++++++---
>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c
>>> index 6379f26..9233f7e 100644
>>> --- a/drivers/thermal/thermal_of.c
>>> +++ b/drivers/thermal/thermal_of.c
>>> @@ -89,7 +89,7 @@ static int of_thermal_get_temp(struct thermal_zone_device *tz,
>>> {
>>> struct __thermal_zone *data = tz->devdata;
>>>
>>> - if (!data->ops->get_temp)
>>> + if (!data->ops || !data->ops->get_temp)
>> comment (1)
>>
>> AFAICT, if data->ops != NULL then data->ops->get_temp is also != NULL
>> because of the code allocating/freeing the ops structure.
>>
>> The tests can be replaced by (!data->ops), no ?
> Thanks Daniel for reviewing the patch.
>
> I agree that even if a sensor module is unregistered, that would call
> "thermal_zone_of_sensor_unregister" which would eventually set NULL on
> get_temp() and get_trend() and "tzd->ops" as well.
>
> However, of_thermal_get_temp() is trying to call "data->ops->get_temp"
> which comes from a sensor driver when it registers. There is no
> guarantee that it would be non-NULL right?
>
> Thinking of which, I think having both checks would be valid.

Hi Daniel,
Do you still have any concerns with this change?

Thanks,
Subbaraman

>>> return -EINVAL;
>>>
>>> return data->ops->get_temp(data->sensor_data, temp);
>>> @@ -186,6 +186,9 @@ static int of_thermal_set_emul_temp(struct thermal_zone_device *tz,
>>> {
>>> struct __thermal_zone *data = tz->devdata;
>>>
>>> + if (!data->ops || !data->ops->set_emul_temp)
>>> + return -EINVAL;
>>> +
>> comment (2)
>>
>> The test looks pointless here (I mean both of them).
>>
>> If of_thermal_set_emul_temp() is called it is because the callback was
>> set in thermal_zone_of_add_sensor().
>>
>> This one does:
>>
>> tz->ops = ops;
>>
>> and
>> if (ops->set_emul_temp)
>> tzd->ops->set_emul_temp = of_thermal_set_emul_temp;
>>
>> If I'm not wrong if we are called, then data->ops &&
>> data->ops->set_emul_temp is always true, right ?
>>
> I've not exercised this condition yet. However, the original problem
> we've observed was when thermal HAL was trying to set trip thresholds
> on a thermal zone device for which the sensor device is not probed yet.
> This had happened randomly because of vendor modules taking time to be
> loaded and probed. I don't know if there would be any userspace entity
> that can try to set emulated temperature for a thermal zone even before
> a sensor device is probed.
>
> Without a sensor driver probed, "tz->ops" would not have a valid pointer
> right? So, I think checking for "data->ops" should be good.
>
> Another possibility is, a sensor might not have "set_emul_temp" callback.
> So checking for "ops->set_emul_temp" should be still valid.
>
>>> return data->ops->set_emul_temp(data->sensor_data, temp);
>>> }
>>>
>>> @@ -194,7 +197,7 @@ static int of_thermal_get_trend(struct thermal_zone_device *tz, int trip,
>>> {
>>> struct __thermal_zone *data = tz->devdata;
>>>
>>> - if (!data->ops->get_trend)
>>> + if (!data->ops || !data->ops->get_trend)
>>> return -EINVAL;
>> Same comment as (1)
> of_thermal_get_trend() is trying to call "data->ops->get_trend" which
> comes from a sensor driver when it registers. From what I can see,
> there are lot of drivers which don't pass "get_trend" in their ops.
> So there is no guarantee that it would be non-NULL right?
>
> Thinking of which, I think having both checks would be valid.
>
>>> return data->ops->get_trend(data->sensor_data, trip, trend);
>>> @@ -301,7 +304,7 @@ static int of_thermal_set_trip_temp(struct thermal_zone_device *tz, int trip,
>>> if (trip >= data->ntrips || trip < 0)
>>> return -EDOM;
>>>
>>> - if (data->ops->set_trip_temp) {
>>> + if (data->ops && data->ops->set_trip_temp) {
>> Same comment as (2)
> Without a sensor driver probed, "tz->ops" would not have a valid pointer right?
> So, I think checking for "data->ops" should be good. Another possibility is, a
> sensor device might not have "set_trip_temp" callback but just "set_trips".
>
> So checking for "data->ops->set_trip_temp" might be still valid.
>
>>> int ret;
>>>
>>> ret = data->ops->set_trip_temp(data->sensor_data, trip, temp);
>>>

2021-10-06 09:42:01

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH v2] thermal: Fix a NULL pointer dereference

On 06/10/2021 00:09, Subbaraman Narayanamurthy wrote:

[ ... ]

>>> The tests can be replaced by (!data->ops), no ?
>> Thanks Daniel for reviewing the patch.
>>
>> I agree that even if a sensor module is unregistered, that would call
>> "thermal_zone_of_sensor_unregister" which would eventually set NULL on
>> get_temp() and get_trend() and "tzd->ops" as well.
>>
>> However, of_thermal_get_temp() is trying to call "data->ops->get_temp"
>> which comes from a sensor driver when it registers. There is no
>> guarantee that it would be non-NULL right?
>>
>> Thinking of which, I think having both checks would be valid.
>
> Hi Daniel,
> Do you still have any concerns with this change?

Yes, let me answer to the initial patch.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2021-10-06 11:11:17

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH v2] thermal: Fix a NULL pointer dereference

On 07/09/2021 21:01, Subbaraman Narayanamurthy wrote:
> of_parse_thermal_zones() parses the thermal-zones node and registers a
> thermal_zone device for each subnode. However, if a thermal zone is
> consuming a thermal sensor and that thermal sensor device hasn't probed
> yet, an attempt to set trip_point_*_temp for that thermal zone device
> can cause a NULL pointer dereference. Fix it.
>
> console:/sys/class/thermal/thermal_zone87 # echo 120000 > trip_point_0_temp
> ...
> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020

I'm still not convinced by the changes.

Could please tell the commit-id where this is happening and give the
procedure to reproduce the bug ?


> ...
> Call trace:
> of_thermal_set_trip_temp+0x40/0xc4
> trip_point_temp_store+0xc0/0x1dc
> dev_attr_store+0x38/0x88
> sysfs_kf_write+0x64/0xc0
> kernfs_fop_write_iter+0x108/0x1d0
> vfs_write+0x2f4/0x368
> ksys_write+0x7c/0xec
> __arm64_sys_write+0x20/0x30
> el0_svc_common.llvm.7279915941325364641+0xbc/0x1bc
> do_el0_svc+0x28/0xa0
> el0_svc+0x14/0x24
> el0_sync_handler+0x88/0xec
> el0_sync+0x1c0/0x200
>
> While at it, fix the possible NULL pointer dereference in other
> functions as well: of_thermal_get_temp(), of_thermal_set_emul_temp(),
> of_thermal_get_trend().
>
> Cc: [email protected]
> Suggested-by: David Collins <[email protected]>
> Signed-off-by: Subbaraman Narayanamurthy <[email protected]>
> ---
> Changes for v2:
> - Added checks in of_thermal_get_temp(), of_thermal_set_emul_temp(), of_thermal_get_trend().
>
> drivers/thermal/thermal_of.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c
> index 6379f26..9233f7e 100644
> --- a/drivers/thermal/thermal_of.c
> +++ b/drivers/thermal/thermal_of.c
> @@ -89,7 +89,7 @@ static int of_thermal_get_temp(struct thermal_zone_device *tz,
> {
> struct __thermal_zone *data = tz->devdata;
>
> - if (!data->ops->get_temp)
> + if (!data->ops || !data->ops->get_temp)
> return -EINVAL;
>
> return data->ops->get_temp(data->sensor_data, temp);
> @@ -186,6 +186,9 @@ static int of_thermal_set_emul_temp(struct thermal_zone_device *tz,
> {
> struct __thermal_zone *data = tz->devdata;
>
> + if (!data->ops || !data->ops->set_emul_temp)
> + return -EINVAL;
> +
> return data->ops->set_emul_temp(data->sensor_data, temp);
> }
>
> @@ -194,7 +197,7 @@ static int of_thermal_get_trend(struct thermal_zone_device *tz, int trip,
> {
> struct __thermal_zone *data = tz->devdata;
>
> - if (!data->ops->get_trend)
> + if (!data->ops || !data->ops->get_trend)
> return -EINVAL;
>
> return data->ops->get_trend(data->sensor_data, trip, trend);
> @@ -301,7 +304,7 @@ static int of_thermal_set_trip_temp(struct thermal_zone_device *tz, int trip,
> if (trip >= data->ntrips || trip < 0)
> return -EDOM;
>
> - if (data->ops->set_trip_temp) {
> + if (data->ops && data->ops->set_trip_temp) {
> int ret;
>
> ret = data->ops->set_trip_temp(data->sensor_data, trip, temp);
>


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Subject: Re: [PATCH v2] thermal: Fix a NULL pointer dereference

On 10/6/21 4:08 AM, Daniel Lezcano wrote:
> On 07/09/2021 21:01, Subbaraman Narayanamurthy wrote:
>> of_parse_thermal_zones() parses the thermal-zones node and registers a
>> thermal_zone device for each subnode. However, if a thermal zone is
>> consuming a thermal sensor and that thermal sensor device hasn't probed
>> yet, an attempt to set trip_point_*_temp for that thermal zone device
>> can cause a NULL pointer dereference. Fix it.
>>
>> console:/sys/class/thermal/thermal_zone87 # echo 120000 > trip_point_0_temp
>> ...
>> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
> I'm still not convinced by the changes.
>
> Could please tell the commit-id where this is happening and give the
> procedure to reproduce the bug ?
>

Here is the commit id where this problem was reported.
https://android.googlesource.com/kernel/common/+/997d24a932a9b6e2040f39a8dd76e873e6519a1c

BTW, this problem was not 100% reproducible but seems to be a race condition when vendor modules are loaded and thermal HAL or userspace thermal SW is attempting to set trip points on one of the thermal zones and that sensor driver supplying that thermal zone have not completed probing.

I was able to reproduce the problem manually by disabling the pmk8350_adc_tm device in device tree which supplies to some thermal zone devices (e.g. xo-therm below).

pmk8350_adc_tm: adc_tm@3400 {                                  
    compatible = "qcom,adc-tm7";                           
    reg = <0x3400>;                                        
    interrupts = <0x0 0x34 0x0 IRQ_TYPE_EDGE_RISING>;         
    interrupt-names = "threshold";                         
    #address-cells = <1>;                                  
    #size-cells = <0>;                                     
    #thermal-sensor-cells = <1>;                           
    status = "disabled"; /* This is what I've added to simulate the problem */                                  
};

&thermal_zones {
...
        xo-therm {                                                             
                polling-delay-passive = <0>;                                   
                polling-delay = <0>;                                           
                thermal-sensors = <&pmk8350_adc_tm PMK8350_ADC7_AMUX_THM1_100K_PU>;
                trips {
                    ...
                };
};

With this and reverting my change (which got picked up in internal tree), I can see this.

/sys/class/thermal # cat thermal_zone87/type                                   
xo-therm
                                                                       
/sys/class/thermal # cd thermal_zone87                                         
/sys/devices/virtual/thermal/thermal_zone87 # ls                               
available_policies  cdev5_trip_point  mode               trip_point_4_hyst        
cdev0               cdev5_weight      offset             trip_point_4_temp        
cdev0_trip_point    cdev6             policy             trip_point_4_type        
cdev0_weight        cdev6_trip_point  power              trip_point_5_hyst        
cdev1               cdev6_weight      slope              trip_point_5_temp        
cdev10              cdev7             subsystem          trip_point_5_type        
cdev10_trip_point   cdev7_trip_point  sustainable_power  trip_point_6_hyst        
cdev10_weight       cdev7_weight      temp               trip_point_6_temp        
cdev1_trip_point    cdev8             trip_point_0_hyst  trip_point_6_type        
cdev1_weight        cdev8_trip_point  trip_point_0_temp  trip_point_7_hyst        
cdev2               cdev8_weight      trip_point_0_type  trip_point_7_temp        
cdev2_trip_point    cdev9             trip_point_1_hyst  trip_point_7_type        
cdev2_weight        cdev9_trip_point  trip_point_1_temp  trip_point_8_hyst        
cdev3               cdev9_weight      trip_point_1_type  trip_point_8_temp        
cdev3_trip_point    emul_temp         trip_point_2_hyst  trip_point_8_type        
cdev3_weight        integral_cutoff   trip_point_2_temp  type                  
cdev4               k_d               trip_point_2_type  uevent                
cdev4_trip_point    k_i               trip_point_3_hyst                        
cdev4_weight        k_po              trip_point_3_temp                        
cdev5               k_pu              trip_point_3_type
                        
/sys/devices/virtual/thermal/thermal_zone87 # cat trip_point_0_temp            
125000                                                                         

/sys/devices/virtual/thermal/thermal_zone87 # echo 120000 > trip_point_0_temp  
[  184.290964][  T211] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
[  184.300896][  T211] Mem abort info:                                         
[  184.304486][  T211]   ESR = 0x96000006                                      
[  184.308348][  T211]   EC = 0x25: DABT (current EL), IL = 32 bits            
[  184.314531][  T211]   SET = 0, FnV = 0                                      
[  184.318384][  T211]   EA = 0, S1PTW = 0                                     
[  184.322323][  T211] Data abort info:                                        
[  184.325993][  T211]   ISV = 0, ISS = 0x00000006                             
[  184.330655][  T211]   CM = 0, WnR = 0                                       
[  184.334425][  T211] user pgtable: 4k pages, 39-bit VAs, pgdp=000000081a7a2000
[  184.341750][  T211] [0000000000000020] pgd=000000081a7a7003, p4d=000000081a7a7003, pud=000000081a7a7003, pmd=0000000000000000
[  184.353359][  T211] Internal error: Oops: 96000006 [#1] PREEMPT SMP         
[  184.359797][  T211] Dumping ftrace buffer:                                  
[  184.364001][  T211]    (ftrace buffer empty)

Hope this helps.

>> ...
>> Call trace:
>> of_thermal_set_trip_temp+0x40/0xc4
>> trip_point_temp_store+0xc0/0x1dc
>> dev_attr_store+0x38/0x88
>> sysfs_kf_write+0x64/0xc0
>> kernfs_fop_write_iter+0x108/0x1d0
>> vfs_write+0x2f4/0x368
>> ksys_write+0x7c/0xec
>> __arm64_sys_write+0x20/0x30
>> el0_svc_common.llvm.7279915941325364641+0xbc/0x1bc
>> do_el0_svc+0x28/0xa0
>> el0_svc+0x14/0x24
>> el0_sync_handler+0x88/0xec
>> el0_sync+0x1c0/0x200
>>
>> While at it, fix the possible NULL pointer dereference in other
>> functions as well: of_thermal_get_temp(), of_thermal_set_emul_temp(),
>> of_thermal_get_trend().
>>
>> Cc: [email protected]
>> Suggested-by: David Collins <[email protected]>
>> Signed-off-by: Subbaraman Narayanamurthy <[email protected]>
>> ---
>> Changes for v2:
>> - Added checks in of_thermal_get_temp(), of_thermal_set_emul_temp(), of_thermal_get_trend().
>>
>> drivers/thermal/thermal_of.c | 9 ++++++---
>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c
>> index 6379f26..9233f7e 100644
>> --- a/drivers/thermal/thermal_of.c
>> +++ b/drivers/thermal/thermal_of.c
>> @@ -89,7 +89,7 @@ static int of_thermal_get_temp(struct thermal_zone_device *tz,
>> {
>> struct __thermal_zone *data = tz->devdata;
>>
>> - if (!data->ops->get_temp)
>> + if (!data->ops || !data->ops->get_temp)
>> return -EINVAL;
>>
>> return data->ops->get_temp(data->sensor_data, temp);
>> @@ -186,6 +186,9 @@ static int of_thermal_set_emul_temp(struct thermal_zone_device *tz,
>> {
>> struct __thermal_zone *data = tz->devdata;
>>
>> + if (!data->ops || !data->ops->set_emul_temp)
>> + return -EINVAL;
>> +
>> return data->ops->set_emul_temp(data->sensor_data, temp);
>> }
>>
>> @@ -194,7 +197,7 @@ static int of_thermal_get_trend(struct thermal_zone_device *tz, int trip,
>> {
>> struct __thermal_zone *data = tz->devdata;
>>
>> - if (!data->ops->get_trend)
>> + if (!data->ops || !data->ops->get_trend)
>> return -EINVAL;
>>
>> return data->ops->get_trend(data->sensor_data, trip, trend);
>> @@ -301,7 +304,7 @@ static int of_thermal_set_trip_temp(struct thermal_zone_device *tz, int trip,
>> if (trip >= data->ntrips || trip < 0)
>> return -EDOM;
>>
>> - if (data->ops->set_trip_temp) {
>> + if (data->ops && data->ops->set_trip_temp) {
>> int ret;
>>
>> ret = data->ops->set_trip_temp(data->sensor_data, trip, temp);
>>
>

Subject: Re: [PATCH v2] thermal: Fix a NULL pointer dereference

On 10/8/21 12:50 PM, Subbaraman Narayanamurthy wrote:
> On 10/6/21 4:08 AM, Daniel Lezcano wrote:
>> On 07/09/2021 21:01, Subbaraman Narayanamurthy wrote:
>>> of_parse_thermal_zones() parses the thermal-zones node and registers a
>>> thermal_zone device for each subnode. However, if a thermal zone is
>>> consuming a thermal sensor and that thermal sensor device hasn't probed
>>> yet, an attempt to set trip_point_*_temp for that thermal zone device
>>> can cause a NULL pointer dereference. Fix it.
>>>
>>> console:/sys/class/thermal/thermal_zone87 # echo 120000 > trip_point_0_temp
>>> ...
>>> Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
>> I'm still not convinced by the changes.
>>
>> Could please tell the commit-id where this is happening and give the
>> procedure to reproduce the bug ?
>>
> Here is the commit id where this problem was reported.
> https://android.googlesource.com/kernel/common/+/997d24a932a9b6e2040f39a8dd76e873e6519a1c
>
> BTW, this problem was not 100% reproducible but seems to be a race condition when vendor modules are loaded and thermal HAL or userspace thermal SW is attempting to set trip points on one of the thermal zones and that sensor driver supplying that thermal zone have not completed probing.
>
> I was able to reproduce the problem manually by disabling the pmk8350_adc_tm device in device tree which supplies to some thermal zone devices (e.g. xo-therm below).
>
> pmk8350_adc_tm: adc_tm@3400 {                                  
>     compatible = "qcom,adc-tm7";                           
>     reg = <0x3400>;                                        
>     interrupts = <0x0 0x34 0x0 IRQ_TYPE_EDGE_RISING>;         
>     interrupt-names = "threshold";                         
>     #address-cells = <1>;                                  
>     #size-cells = <0>;                                     
>     #thermal-sensor-cells = <1>;                           
>     status = "disabled"; /* This is what I've added to simulate the problem */                                  
> };
>
> &thermal_zones {
> ...
>         xo-therm {                                                             
>                 polling-delay-passive = <0>;                                   
>                 polling-delay = <0>;                                           
>                 thermal-sensors = <&pmk8350_adc_tm PMK8350_ADC7_AMUX_THM1_100K_PU>;
>                 trips {
>                     ...
>                 };
> };
>
> With this and reverting my change (which got picked up in internal tree), I can see this.
>
> /sys/class/thermal # cat thermal_zone87/type                                   
> xo-therm
>                                                                        
> /sys/class/thermal # cd thermal_zone87                                         
> /sys/devices/virtual/thermal/thermal_zone87 # ls                               
> available_policies  cdev5_trip_point  mode               trip_point_4_hyst        
> cdev0               cdev5_weight      offset             trip_point_4_temp        
> cdev0_trip_point    cdev6             policy             trip_point_4_type        
> cdev0_weight        cdev6_trip_point  power              trip_point_5_hyst        
> cdev1               cdev6_weight      slope              trip_point_5_temp        
> cdev10              cdev7             subsystem          trip_point_5_type        
> cdev10_trip_point   cdev7_trip_point  sustainable_power  trip_point_6_hyst        
> cdev10_weight       cdev7_weight      temp               trip_point_6_temp        
> cdev1_trip_point    cdev8             trip_point_0_hyst  trip_point_6_type        
> cdev1_weight        cdev8_trip_point  trip_point_0_temp  trip_point_7_hyst        
> cdev2               cdev8_weight      trip_point_0_type  trip_point_7_temp        
> cdev2_trip_point    cdev9             trip_point_1_hyst  trip_point_7_type        
> cdev2_weight        cdev9_trip_point  trip_point_1_temp  trip_point_8_hyst        
> cdev3               cdev9_weight      trip_point_1_type  trip_point_8_temp        
> cdev3_trip_point    emul_temp         trip_point_2_hyst  trip_point_8_type        
> cdev3_weight        integral_cutoff   trip_point_2_temp  type                  
> cdev4               k_d               trip_point_2_type  uevent                
> cdev4_trip_point    k_i               trip_point_3_hyst                        
> cdev4_weight        k_po              trip_point_3_temp                        
> cdev5               k_pu              trip_point_3_type
>                         
> /sys/devices/virtual/thermal/thermal_zone87 # cat trip_point_0_temp            
> 125000                                                                         
>
> /sys/devices/virtual/thermal/thermal_zone87 # echo 120000 > trip_point_0_temp  
> [  184.290964][  T211] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
> [  184.300896][  T211] Mem abort info:                                         
> [  184.304486][  T211]   ESR = 0x96000006                                      
> [  184.308348][  T211]   EC = 0x25: DABT (current EL), IL = 32 bits            
> [  184.314531][  T211]   SET = 0, FnV = 0                                      
> [  184.318384][  T211]   EA = 0, S1PTW = 0                                     
> [  184.322323][  T211] Data abort info:                                        
> [  184.325993][  T211]   ISV = 0, ISS = 0x00000006                             
> [  184.330655][  T211]   CM = 0, WnR = 0                                       
> [  184.334425][  T211] user pgtable: 4k pages, 39-bit VAs, pgdp=000000081a7a2000
> [  184.341750][  T211] [0000000000000020] pgd=000000081a7a7003, p4d=000000081a7a7003, pud=000000081a7a7003, pmd=0000000000000000
> [  184.353359][  T211] Internal error: Oops: 96000006 [#1] PREEMPT SMP         
> [  184.359797][  T211] Dumping ftrace buffer:                                  
> [  184.364001][  T211]    (ftrace buffer empty)
>
> Hope this helps.

Hi Daniel,
Have you got a chance to look at this?

Thanks,
Subbaraman

>
>>> ...
>>> Call trace:
>>> of_thermal_set_trip_temp+0x40/0xc4
>>> trip_point_temp_store+0xc0/0x1dc
>>> dev_attr_store+0x38/0x88
>>> sysfs_kf_write+0x64/0xc0
>>> kernfs_fop_write_iter+0x108/0x1d0
>>> vfs_write+0x2f4/0x368
>>> ksys_write+0x7c/0xec
>>> __arm64_sys_write+0x20/0x30
>>> el0_svc_common.llvm.7279915941325364641+0xbc/0x1bc
>>> do_el0_svc+0x28/0xa0
>>> el0_svc+0x14/0x24
>>> el0_sync_handler+0x88/0xec
>>> el0_sync+0x1c0/0x200
>>>
>>> While at it, fix the possible NULL pointer dereference in other
>>> functions as well: of_thermal_get_temp(), of_thermal_set_emul_temp(),
>>> of_thermal_get_trend().
>>>
>>> Cc: [email protected]
>>> Suggested-by: David Collins <[email protected]>
>>> Signed-off-by: Subbaraman Narayanamurthy <[email protected]>
>>> ---
>>> Changes for v2:
>>> - Added checks in of_thermal_get_temp(), of_thermal_set_emul_temp(), of_thermal_get_trend().
>>>
>>> drivers/thermal/thermal_of.c | 9 ++++++---
>>> 1 file changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/thermal/thermal_of.c b/drivers/thermal/thermal_of.c
>>> index 6379f26..9233f7e 100644
>>> --- a/drivers/thermal/thermal_of.c
>>> +++ b/drivers/thermal/thermal_of.c
>>> @@ -89,7 +89,7 @@ static int of_thermal_get_temp(struct thermal_zone_device *tz,
>>> {
>>> struct __thermal_zone *data = tz->devdata;
>>>
>>> - if (!data->ops->get_temp)
>>> + if (!data->ops || !data->ops->get_temp)
>>> return -EINVAL;
>>>
>>> return data->ops->get_temp(data->sensor_data, temp);
>>> @@ -186,6 +186,9 @@ static int of_thermal_set_emul_temp(struct thermal_zone_device *tz,
>>> {
>>> struct __thermal_zone *data = tz->devdata;
>>>
>>> + if (!data->ops || !data->ops->set_emul_temp)
>>> + return -EINVAL;
>>> +
>>> return data->ops->set_emul_temp(data->sensor_data, temp);
>>> }
>>>
>>> @@ -194,7 +197,7 @@ static int of_thermal_get_trend(struct thermal_zone_device *tz, int trip,
>>> {
>>> struct __thermal_zone *data = tz->devdata;
>>>
>>> - if (!data->ops->get_trend)
>>> + if (!data->ops || !data->ops->get_trend)
>>> return -EINVAL;
>>>
>>> return data->ops->get_trend(data->sensor_data, trip, trend);
>>> @@ -301,7 +304,7 @@ static int of_thermal_set_trip_temp(struct thermal_zone_device *tz, int trip,
>>> if (trip >= data->ntrips || trip < 0)
>>> return -EDOM;
>>>
>>> - if (data->ops->set_trip_temp) {
>>> + if (data->ops && data->ops->set_trip_temp) {
>>> int ret;
>>>
>>> ret = data->ops->set_trip_temp(data->sensor_data, trip, temp);
>>>

2021-10-19 07:37:38

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH v2] thermal: Fix a NULL pointer dereference

On 19/10/2021 03:21, Subbaraman Narayanamurthy wrote:
> On 10/8/21 12:50 PM, Subbaraman Narayanamurthy wrote:
>> On 10/6/21 4:08 AM, Daniel Lezcano wrote:

[ ... ]

>> /sys/devices/virtual/thermal/thermal_zone87 # echo 120000 > trip_point_0_temp  
>> [  184.290964][  T211] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
>> [  184.300896][  T211] Mem abort info:                                         
>> [  184.304486][  T211]   ESR = 0x96000006                                      
>> [  184.308348][  T211]   EC = 0x25: DABT (current EL), IL = 32 bits            
>> [  184.314531][  T211]   SET = 0, FnV = 0                                      
>> [  184.318384][  T211]   EA = 0, S1PTW = 0                                     
>> [  184.322323][  T211] Data abort info:                                        
>> [  184.325993][  T211]   ISV = 0, ISS = 0x00000006                             
>> [  184.330655][  T211]   CM = 0, WnR = 0                                       
>> [  184.334425][  T211] user pgtable: 4k pages, 39-bit VAs, pgdp=000000081a7a2000
>> [  184.341750][  T211] [0000000000000020] pgd=000000081a7a7003, p4d=000000081a7a7003, pud=000000081a7a7003, pmd=0000000000000000
>> [  184.353359][  T211] Internal error: Oops: 96000006 [#1] PREEMPT SMP         
>> [  184.359797][  T211] Dumping ftrace buffer:                                  
>> [  184.364001][  T211]    (ftrace buffer empty)
>>
>> Hope this helps.
>
> Hi Daniel,
> Have you got a chance to look at this?

Hi Subbaraman,

Actually, I think the root problem is the thermal zone is showing up
while there is no sensor associated with it. You can read the
temperature and get a kernel warning also.

That's what should be fixed IMO.


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

Subject: Re: [PATCH v2] thermal: Fix a NULL pointer dereference

On 10/19/21 12:35 AM, Daniel Lezcano wrote:
> On 19/10/2021 03:21, Subbaraman Narayanamurthy wrote:
>> On 10/8/21 12:50 PM, Subbaraman Narayanamurthy wrote:
>>> On 10/6/21 4:08 AM, Daniel Lezcano wrote:
> [ ... ]
>
>>> /sys/devices/virtual/thermal/thermal_zone87 # echo 120000 > trip_point_0_temp  
>>> [  184.290964][  T211] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
>>> [  184.300896][  T211] Mem abort info:                                         
>>> [  184.304486][  T211]   ESR = 0x96000006                                      
>>> [  184.308348][  T211]   EC = 0x25: DABT (current EL), IL = 32 bits            
>>> [  184.314531][  T211]   SET = 0, FnV = 0                                      
>>> [  184.318384][  T211]   EA = 0, S1PTW = 0                                     
>>> [  184.322323][  T211] Data abort info:                                        
>>> [  184.325993][  T211]   ISV = 0, ISS = 0x00000006                             
>>> [  184.330655][  T211]   CM = 0, WnR = 0                                       
>>> [  184.334425][  T211] user pgtable: 4k pages, 39-bit VAs, pgdp=000000081a7a2000
>>> [  184.341750][  T211] [0000000000000020] pgd=000000081a7a7003, p4d=000000081a7a7003, pud=000000081a7a7003, pmd=0000000000000000
>>> [  184.353359][  T211] Internal error: Oops: 96000006 [#1] PREEMPT SMP         
>>> [  184.359797][  T211] Dumping ftrace buffer:                                  
>>> [  184.364001][  T211]    (ftrace buffer empty)
>>>
>>> Hope this helps.
>> Hi Daniel,
>> Have you got a chance to look at this?
> Hi Subbaraman,
>
> Actually, I think the root problem is the thermal zone is showing up
> while there is no sensor associated with it. You can read the
> temperature and get a kernel warning also.
>
> That's what should be fixed IMO.
>

Hi Daniel,

If I understand your statement correctly, are you saying that a thermal_zone device should be created only after a thermal sensor driver supplying it probes?

From what I can see,

thermal_init()
    --> of_parse_thermal_zones()
            --> thermal_zone_device_register()
                    --> thermal_zone_create_device_groups()
                            <followed by>
                        device_register() with thermal_zone%d


This happens way before a thermal sensor driver probes. So, creating a thermal_zone device only after its thermal sensor probes would require more changes to the framework.

Also, I see a similar NULL check exists in the framework code already.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/thermal/thermal_of.c?h=v5.15-rc6#n103

So, extending the logic for other callsites (below) makes sense to me.

- of_thermal_get_temp()
- of_thermal_set_emul_temp()
- of_thermal_get_trend()
- of_thermal_set_trip_temp()

Thanks,
Subbaraman