2024-05-10 14:21:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v1 1/6] thermal: sysfs: Trigger zone temperature updates on sysfs reads

From: Rafael J. Wysocki <[email protected]>

Reading the zone temperature via sysfs causes the driver callback to
be invoked, but it does not cause the thermal zone object to be updated.

This is problematic if the zone temperature read via sysfs differs from
the temperature value stored in the thermal zone object as it may cause
the kernel and user space to act against each other in some cases.

For this reason, make temp_show() trigger a zone temperature update if
the temperature returned by thermal_zone_get_temp() is different from
the temperature value stored in the thermal zone object.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/thermal/thermal_core.c | 2 +-
drivers/thermal/thermal_sysfs.c | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)

Index: linux-pm/drivers/thermal/thermal_sysfs.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_sysfs.c
+++ linux-pm/drivers/thermal/thermal_sysfs.c
@@ -42,6 +42,9 @@ temp_show(struct device *dev, struct dev
if (ret)
return ret;

+ if (temperature != READ_ONCE(tz->temperature))
+ thermal_zone_device_update(tz, THERMAL_EVENT_TEMP_SAMPLE);
+
return sprintf(buf, "%d\n", temperature);
}

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -429,7 +429,7 @@ static void update_temperature(struct th
}

tz->last_temperature = tz->temperature;
- tz->temperature = temp;
+ WRITE_ONCE(tz->temperature, temp);

trace_thermal_temperature(tz);






2024-05-13 07:11:27

by Lukasz Luba

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] thermal: sysfs: Trigger zone temperature updates on sysfs reads

Hi Rafael,

On 5/10/24 15:13, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> Reading the zone temperature via sysfs causes the driver callback to
> be invoked, but it does not cause the thermal zone object to be updated.
>
> This is problematic if the zone temperature read via sysfs differs from
> the temperature value stored in the thermal zone object as it may cause
> the kernel and user space to act against each other in some cases.
>
> For this reason, make temp_show() trigger a zone temperature update if
> the temperature returned by thermal_zone_get_temp() is different from
> the temperature value stored in the thermal zone object.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
> drivers/thermal/thermal_core.c | 2 +-
> drivers/thermal/thermal_sysfs.c | 3 +++
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> Index: linux-pm/drivers/thermal/thermal_sysfs.c
> ===================================================================
> --- linux-pm.orig/drivers/thermal/thermal_sysfs.c
> +++ linux-pm/drivers/thermal/thermal_sysfs.c
> @@ -42,6 +42,9 @@ temp_show(struct device *dev, struct dev
> if (ret)
> return ret;
>
> + if (temperature != READ_ONCE(tz->temperature))
> + thermal_zone_device_update(tz, THERMAL_EVENT_TEMP_SAMPLE);

That's a bit problematic because it will trigger
governor->manage()

In case of IPA governor we relay on constant polling
period. We estimate the past power usage and current
thermal budget, to derive the next period power budget
for devices. I don't know if the internal PID algorithm
will be resilient enough to compensate this asynchronous
trigger caused from user-space.

We choose the period to be at least 1 frame (e.g. ~16ms)
to have good avg usage of CPUs and GPU. TBH I don't know
what would happen if someone reads the temp after e.g. 1ms
of last IPA trigger, but some devices (e.g. GPU) wasn't
loaded in that last 1ms delta...
I'm a bit more relaxed about CPUs because we use utilization
signal from runqueues (like the TEO util gov). That's a moving
avg signal which should keep some history, like low-pass
filter, so information is more resilient in that case.

Could we think about that IPA constant period usage?
I think I understand the proposal of your patch.
We might add a filter inside IPA to ignore such async
triggers in the .manage() callback.
What do you think?

Regards,
Lukasz

2024-05-16 09:04:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] thermal: sysfs: Trigger zone temperature updates on sysfs reads

Hi Lukasz,

On Mon, May 13, 2024 at 9:11 AM Lukasz Luba <[email protected]> wrote:
>
> Hi Rafael,
>
> On 5/10/24 15:13, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >
> > Reading the zone temperature via sysfs causes the driver callback to
> > be invoked, but it does not cause the thermal zone object to be updated.
> >
> > This is problematic if the zone temperature read via sysfs differs from
> > the temperature value stored in the thermal zone object as it may cause
> > the kernel and user space to act against each other in some cases.
> >
> > For this reason, make temp_show() trigger a zone temperature update if
> > the temperature returned by thermal_zone_get_temp() is different from
> > the temperature value stored in the thermal zone object.
> >
> > Signed-off-by: Rafael J. Wysocki <[email protected]>
> > ---
> > drivers/thermal/thermal_core.c | 2 +-
> > drivers/thermal/thermal_sysfs.c | 3 +++
> > 2 files changed, 4 insertions(+), 1 deletion(-)
> >
> > Index: linux-pm/drivers/thermal/thermal_sysfs.c
> > ===================================================================
> > --- linux-pm.orig/drivers/thermal/thermal_sysfs.c
> > +++ linux-pm/drivers/thermal/thermal_sysfs.c
> > @@ -42,6 +42,9 @@ temp_show(struct device *dev, struct dev
> > if (ret)
> > return ret;
> >
> > + if (temperature != READ_ONCE(tz->temperature))
> > + thermal_zone_device_update(tz, THERMAL_EVENT_TEMP_SAMPLE);
>
> That's a bit problematic because it will trigger
> governor->manage()
>
> In case of IPA governor we relay on constant polling
> period. We estimate the past power usage and current
> thermal budget, to derive the next period power budget
> for devices. I don't know if the internal PID algorithm
> will be resilient enough to compensate this asynchronous
> trigger caused from user-space.
>
> We choose the period to be at least 1 frame (e.g. ~16ms)
> to have good avg usage of CPUs and GPU. TBH I don't know
> what would happen if someone reads the temp after e.g. 1ms
> of last IPA trigger, but some devices (e.g. GPU) wasn't
> loaded in that last 1ms delta...
> I'm a bit more relaxed about CPUs because we use utilization
> signal from runqueues (like the TEO util gov). That's a moving
> avg signal which should keep some history, like low-pass
> filter, so information is more resilient in that case.
>
> Could we think about that IPA constant period usage?
> I think I understand the proposal of your patch.
> We might add a filter inside IPA to ignore such async
> triggers in the .manage() callback.
> What do you think?

Thanks for pointing this out.

Actually, the target audience for this change are thermal zones that
are not updated on a regular basis, so what about storing the last
zone temperature update time in the zone object and only running an
update from temp_show() if sufficient time has elapsed since the last
update (say 1 sec)?

2024-05-16 09:47:33

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] thermal: sysfs: Trigger zone temperature updates on sysfs reads


Hi Rafael,

On 16/05/2024 11:04, Rafael J. Wysocki wrote:
> Hi Lukasz,
>
> On Mon, May 13, 2024 at 9:11 AM Lukasz Luba <[email protected]> wrote:
>>
>> Hi Rafael,
>>
>> On 5/10/24 15:13, Rafael J. Wysocki wrote:
>>> From: Rafael J. Wysocki <[email protected]>
>>>
>>> Reading the zone temperature via sysfs causes the driver callback to
>>> be invoked, but it does not cause the thermal zone object to be updated.
>>>
>>> This is problematic if the zone temperature read via sysfs differs from
>>> the temperature value stored in the thermal zone object as it may cause
>>> the kernel and user space to act against each other in some cases.
>>>
>>> For this reason, make temp_show() trigger a zone temperature update if
>>> the temperature returned by thermal_zone_get_temp() is different from
>>> the temperature value stored in the thermal zone object.


The hwmon system is doing something similar and I'm not sure we want to
mimic the same behavior.

Just to summarize:

1. There is a polling delay set

This polling delay gives the sampling rate the thermal zone is
monitored. The temperature is updated at each 'delay' tick

2. There is no polling delay set

The system relies on the interrupts to tell when a temperature reaches a
threshold.


On the other side, if the governor is in-kernel, then we should not read
the temperature of the thermal zones because it is the job of the kernel
to do that.

Actually we can assume the temperature information exported to the
userspace is a courtesy of the kernel when this one is managing the
thermal zone.

If there is no governor associated to the thermal zone because there is
no cooling device associated to the defined trip points, then we can
assume it is up to the userspace to monitor the thermal zone.

Furthermore, the hwmon gives the temperature information with the
caching and because of that it is not possible for a thermal daemon to
correctly handle a thermal zone.

That said, I would say we don't want the userspace to influence the
thermal zone monitoring in any manner.

From my POV, we should keep the code as it is.

The description of the change says "it may cause the kernel and user
space to act against each other in some cases". Is it possible to give
the cases when that can happen ?




--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


2024-05-16 10:02:35

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] thermal: sysfs: Trigger zone temperature updates on sysfs reads

Hi Daniel,

On Thu, May 16, 2024 at 11:46 AM Daniel Lezcano
<[email protected]> wrote:
>
>
> Hi Rafael,
>
> On 16/05/2024 11:04, Rafael J. Wysocki wrote:
> > Hi Lukasz,
> >
> > On Mon, May 13, 2024 at 9:11 AM Lukasz Luba <[email protected]> wrote:
> >>
> >> Hi Rafael,
> >>
> >> On 5/10/24 15:13, Rafael J. Wysocki wrote:
> >>> From: Rafael J. Wysocki <[email protected]>
> >>>
> >>> Reading the zone temperature via sysfs causes the driver callback to
> >>> be invoked, but it does not cause the thermal zone object to be updated.
> >>>
> >>> This is problematic if the zone temperature read via sysfs differs from
> >>> the temperature value stored in the thermal zone object as it may cause
> >>> the kernel and user space to act against each other in some cases.
> >>>
> >>> For this reason, make temp_show() trigger a zone temperature update if
> >>> the temperature returned by thermal_zone_get_temp() is different from
> >>> the temperature value stored in the thermal zone object.
>
>
> The hwmon system is doing something similar and I'm not sure we want to
> mimic the same behavior.
>
> Just to summarize:
>
> 1. There is a polling delay set
>
> This polling delay gives the sampling rate the thermal zone is
> monitored. The temperature is updated at each 'delay' tick
>
> 2. There is no polling delay set
>
> The system relies on the interrupts to tell when a temperature reaches a
> threshold.

So this is a bit of a problem if the interrupts are not coming.

At least from the debugfs perspective, there are "mitigation episodes"
that last forever if the zone temperature happens to be above a trip
at the system resume time, say, and is never updated afterward.

> On the other side, if the governor is in-kernel, then we should not read
> the temperature of the thermal zones because it is the job of the kernel
> to do that.
>
> Actually we can assume the temperature information exported to the
> userspace is a courtesy of the kernel when this one is managing the
> thermal zone.

It is not the case right now, though, as sysfs temperature reads
effectively bypass the whole in-kernel thermal management.

> If there is no governor associated to the thermal zone because there is
> no cooling device associated to the defined trip points, then we can
> assume it is up to the userspace to monitor the thermal zone.

Well, in that case trips should not be taken into account, but they are now.

> Furthermore, the hwmon gives the temperature information with the
> caching and because of that it is not possible for a thermal daemon to
> correctly handle a thermal zone.
>
> That said, I would say we don't want the userspace to influence the
> thermal zone monitoring in any manner.
>
> From my POV, we should keep the code as it is.

Well, it is problematic as is.

> The description of the change says "it may cause the kernel and user
> space to act against each other in some cases". Is it possible to give
> the cases when that can happen ?

This is mostly theoretical, but if user space knows that the
temperature has fallen below a trip, but the kernel doesn't know that,
they may decide to put a cooling device into different states.

In any case, the issue at hand is that thermal_zone_device_update()
processes passive and active trip points without attached cooling
devices which doesn't make much sense, so this needs to be addressed
in the first place and the $subject patch may not make any difference
if that happens, so regard it as withdrawn.

Thanks!

2024-05-16 12:57:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v1 1/6] thermal: sysfs: Trigger zone temperature updates on sysfs reads

On Thu, May 16, 2024 at 12:02 PM Rafael J. Wysocki <[email protected]> wrote:
>
> Hi Daniel,
>
> On Thu, May 16, 2024 at 11:46 AM Daniel Lezcano
> <[email protected]> wrote:
> >
> >
> > Hi Rafael,
> >
> > On 16/05/2024 11:04, Rafael J. Wysocki wrote:
> > > Hi Lukasz,
> > >
> > > On Mon, May 13, 2024 at 9:11 AM Lukasz Luba <[email protected]> wrote:
> > >>
> > >> Hi Rafael,
> > >>
> > >> On 5/10/24 15:13, Rafael J. Wysocki wrote:
> > >>> From: Rafael J. Wysocki <[email protected]>
> > >>>
> > >>> Reading the zone temperature via sysfs causes the driver callback to
> > >>> be invoked, but it does not cause the thermal zone object to be updated.
> > >>>
> > >>> This is problematic if the zone temperature read via sysfs differs from
> > >>> the temperature value stored in the thermal zone object as it may cause
> > >>> the kernel and user space to act against each other in some cases.
> > >>>
> > >>> For this reason, make temp_show() trigger a zone temperature update if
> > >>> the temperature returned by thermal_zone_get_temp() is different from
> > >>> the temperature value stored in the thermal zone object.
> >
> >
> > The hwmon system is doing something similar and I'm not sure we want to
> > mimic the same behavior.
> >
> > Just to summarize:
> >
> > 1. There is a polling delay set
> >
> > This polling delay gives the sampling rate the thermal zone is
> > monitored. The temperature is updated at each 'delay' tick
> >
> > 2. There is no polling delay set
> >
> > The system relies on the interrupts to tell when a temperature reaches a
> > threshold.
>
> So this is a bit of a problem if the interrupts are not coming.
>
> At least from the debugfs perspective, there are "mitigation episodes"
> that last forever if the zone temperature happens to be above a trip
> at the system resume time, say, and is never updated afterward.
>
> > On the other side, if the governor is in-kernel, then we should not read
> > the temperature of the thermal zones because it is the job of the kernel
> > to do that.
> >
> > Actually we can assume the temperature information exported to the
> > userspace is a courtesy of the kernel when this one is managing the
> > thermal zone.
>
> It is not the case right now, though, as sysfs temperature reads
> effectively bypass the whole in-kernel thermal management.
>
> > If there is no governor associated to the thermal zone because there is
> > no cooling device associated to the defined trip points, then we can
> > assume it is up to the userspace to monitor the thermal zone.
>
> Well, in that case trips should not be taken into account, but they are now.

Actually, there is always a governor associated with a thermal zone
AFAICS. It is set before binding any cooling devices to the thermal
zone.

However, there are thermal zones without any cooling devices bound to
them and they have the governor set to user_space, so it appears that
they want to receive trip crossing notifications, but then they don't
have a way to trigger zone temperature updates which is inconsistent.

IMO at least for these thermal zones temp_store() should trigger a
zone temperature update.