2023-12-18 19:29:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v1 0/3] thermal: core: Fix issues related to thermal zone resume

Hi Everyone,

This patch series fixes some issues related to the suspend and resume of
thermal zones during system-wide transitions.

Patch [1/3] addresses some existing synchronization issues.

Patch [2/3] is a preliminary change for the last patch.

Patch [3/3] rearranges the thermal zone resume code to resume thermal
zones asynchronously using the existing thermal zone polling support.

Please refer to the individual patch changelogs for details.

Thanks!





2023-12-18 19:29:41

by Rafael J. Wysocki

[permalink] [raw]
Subject: [PATCH v1 1/3] thermal: core: Fix thermal zone suspend-resume synchronization

From: Rafael J. Wysocki <[email protected]>

There are 3 synchronization issues with thermal zone suspend-resume
during system-wide transitions:

1. The resume code runs in a PM notifier which is invoked after user
space has been thawed, so it can run concurrently with user space
which can trigger a thermal zone device removal. If that happens,
the thermal zone resume code may use a stale pointer to the next
list element and crash, because it does not hold thermal_list_lock
while walking thermal_tz_list.

2. The thermal zone resume code calls thermal_zone_device_init()
outside the zone lock, so user space or an update triggered by
the platform firmware may see an inconsistent state of a
thermal zone leading to unexpected behavior.

3. Clearing the in_suspend global variable in thermal_pm_notify()
allows __thermal_zone_device_update() to continue for all thermal
zones and it may as well run before the thermal_tz_list walk (or
at any point during the list walk for that matter) and attempt to
operate on a thermal zone that has not been resumed yet. It may
also race destructively with thermal_zone_device_init().

To address these issues, add thermal_list_lock locking to
thermal_pm_notify(), especially arount the thermal_tz_list,
make it call thermal_zone_device_init() back-to-back with
__thermal_zone_device_update() under the zone lock and replace
in_suspend with per-zone bool "suspend" indicators set and unset
under the given zone's lock.

Link: https://lore.kernel.org/linux-pm/[email protected]/
Reported-by: Bo Ye <[email protected]>
Signed-off-by: Rafael J. Wysocki <[email protected]>
---
drivers/thermal/thermal_core.c | 30 +++++++++++++++++++++++-------
include/linux/thermal.h | 2 ++
2 files changed, 25 insertions(+), 7 deletions(-)

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -37,8 +37,6 @@ static LIST_HEAD(thermal_governor_list);
static DEFINE_MUTEX(thermal_list_lock);
static DEFINE_MUTEX(thermal_governor_lock);

-static atomic_t in_suspend;
-
static struct thermal_governor *def_governor;

/*
@@ -427,7 +425,7 @@ void __thermal_zone_device_update(struct
{
struct thermal_trip *trip;

- if (atomic_read(&in_suspend))
+ if (tz->suspended)
return;

if (!thermal_zone_device_is_enabled(tz))
@@ -1538,17 +1536,35 @@ static int thermal_pm_notify(struct noti
case PM_HIBERNATION_PREPARE:
case PM_RESTORE_PREPARE:
case PM_SUSPEND_PREPARE:
- atomic_set(&in_suspend, 1);
+ mutex_lock(&thermal_list_lock);
+
+ list_for_each_entry(tz, &thermal_tz_list, node) {
+ mutex_lock(&tz->lock);
+
+ tz->suspended = true;
+
+ mutex_unlock(&tz->lock);
+ }
+
+ mutex_unlock(&thermal_list_lock);
break;
case PM_POST_HIBERNATION:
case PM_POST_RESTORE:
case PM_POST_SUSPEND:
- atomic_set(&in_suspend, 0);
+ mutex_lock(&thermal_list_lock);
+
list_for_each_entry(tz, &thermal_tz_list, node) {
+ mutex_lock(&tz->lock);
+
+ tz->suspended = false;
+
thermal_zone_device_init(tz);
- thermal_zone_device_update(tz,
- THERMAL_EVENT_UNSPECIFIED);
+ __thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);
+
+ mutex_unlock(&tz->lock);
}
+
+ mutex_unlock(&thermal_list_lock);
break;
default:
break;
Index: linux-pm/include/linux/thermal.h
===================================================================
--- linux-pm.orig/include/linux/thermal.h
+++ linux-pm/include/linux/thermal.h
@@ -152,6 +152,7 @@ struct thermal_cooling_device {
* @node: node in thermal_tz_list (in thermal_core.c)
* @poll_queue: delayed work for polling
* @notify_event: Last notification event
+ * @suspended: thermal zone suspend indicator
*/
struct thermal_zone_device {
int id;
@@ -185,6 +186,7 @@ struct thermal_zone_device {
struct list_head node;
struct delayed_work poll_queue;
enum thermal_notify_event notify_event;
+ bool suspended;
};

/**




2023-12-28 13:23:36

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH v1 0/3] thermal: core: Fix issues related to thermal zone resume

On Mon, Dec 18, 2023 at 8:28 PM Rafael J. Wysocki <[email protected]> wrote:
>
> Hi Everyone,
>
> This patch series fixes some issues related to the suspend and resume of
> thermal zones during system-wide transitions.
>
> Patch [1/3] addresses some existing synchronization issues.
>
> Patch [2/3] is a preliminary change for the last patch.
>
> Patch [3/3] rearranges the thermal zone resume code to resume thermal
> zones asynchronously using the existing thermal zone polling support.
>
> Please refer to the individual patch changelogs for details.

These are fixes, so it would be good to get them into 6.8.

Since I don't see any objections to them, I'm adding them to the
bleeding-edge branch and will move them to linux-next next week.

Thanks!