2022-10-04 03:45:58

by Guenter Roeck

[permalink] [raw]
Subject: [RFC/RFT PATCH resend] thermal: Protect thermal device operations against thermal device removal

A call to thermal_zone_device_unregister() results in thermal device
removal. While the thermal device itself is reference counted and
protected against removal of its associated data structures, the
thermal device operations are owned by the calling code and unprotected.
This may result in crashes such as

BUG: unable to handle page fault for address: ffffffffc04ef420
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 5d60e067 P4D 5d60e067 PUD 5d610067 PMD 110197067 PTE 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 3209 Comm: cat Tainted: G W 5.10.136-19389-g615abc6eb807 #1 02df41ac0b12f3a64f4b34245188d8875bb3bce1
Hardware name: Google Coral/Coral, BIOS Google_Coral.10068.92.0 11/27/2018
RIP: 0010:thermal_zone_get_temp+0x26/0x73
Code: 89 c3 eb d3 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 53 48 85 ff 74 50 48 89 fb 48 81 ff 00 f0 ff ff 77 44 48 8b 83 98 03 00 00 <48> 83 78 10 00 74 36 49 89 f6 4c 8d bb d8 03 00 00 4c 89 ff e8 9f
RSP: 0018:ffffb3758138fd38 EFLAGS: 00010287
RAX: ffffffffc04ef410 RBX: ffff98f14d7fb000 RCX: 0000000000000000
RDX: ffff98f17cf90000 RSI: ffffb3758138fd64 RDI: ffff98f14d7fb000
RBP: ffffb3758138fd50 R08: 0000000000001000 R09: ffff98f17cf90000
R10: 0000000000000000 R11: ffffffff8dacad28 R12: 0000000000001000
R13: ffff98f1793a7d80 R14: ffff98f143231708 R15: ffff98f14d7fb018
FS: 00007ec166097800(0000) GS:ffff98f1bbd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffc04ef420 CR3: 000000010ee9a000 CR4: 00000000003506e0
Call Trace:
temp_show+0x31/0x68
dev_attr_show+0x1d/0x4f
sysfs_kf_seq_show+0x92/0x107
seq_read_iter+0xf5/0x3f2
vfs_read+0x205/0x379
__x64_sys_read+0x7c/0xe2
do_syscall_64+0x43/0x55
entry_SYSCALL_64_after_hwframe+0x61/0xc6

if a thermal device is removed while accesses to its device attributes
are ongoing.

Use the thermal device mutex to protect device operations. Clear the
device operations pointer in thermal_zone_device_unregister() under
protection of this mutex, and only access it while the mutex is held.
Flatten and simplify device mutex operations to only acquire the mutex
once and hold it instead of acquiring and releasing it several times
during thermal operations. Only validate parameters once at module entry
points after acquiring the mutex. Execute governor operations under mutex
instead of expecting governors to acquire and release it.

Signed-off-by: Guenter Roeck <[email protected]>
---
RFC/RFT:
This patch ended up being substantially more complex than I initially
thought it would be. It should only be applied after extensive review
and testing (I don't think it is 6.1 material).
I tested it as much as I could with Chromebooks using chromeos-5.10
and chromeos-5.15. Plan is to apply it to those branches as well as to
older ChromeOS kernel branches, but I would like to get some level of
confidence that this is the right approach before doing that.

drivers/thermal/gov_bang_bang.c | 8 --
drivers/thermal/gov_fair_share.c | 3 -
drivers/thermal/gov_power_allocator.c | 19 +---
drivers/thermal/gov_step_wise.c | 8 --
drivers/thermal/gov_user_space.c | 2 -
drivers/thermal/thermal_core.c | 97 +++++++++++--------
drivers/thermal/thermal_core.h | 3 +
drivers/thermal/thermal_helpers.c | 56 ++++++++---
drivers/thermal/thermal_hwmon.c | 15 ++-
drivers/thermal/thermal_netlink.c | 5 +
drivers/thermal/thermal_sysfs.c | 134 ++++++++++++++++++++------
11 files changed, 223 insertions(+), 127 deletions(-)

diff --git a/drivers/thermal/gov_bang_bang.c b/drivers/thermal/gov_bang_bang.c
index 991a1c54296d..a1b63cd54e1a 100644
--- a/drivers/thermal/gov_bang_bang.c
+++ b/drivers/thermal/gov_bang_bang.c
@@ -31,8 +31,6 @@ static void thermal_zone_trip_update(struct thermal_zone_device *tz, int trip)
trip, trip_temp, tz->temperature,
trip_hyst);

- mutex_lock(&tz->lock);
-
list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
if (instance->trip != trip)
continue;
@@ -65,8 +63,6 @@ static void thermal_zone_trip_update(struct thermal_zone_device *tz, int trip)
instance->cdev->updated = false; /* cdev needs update */
mutex_unlock(&instance->cdev->lock);
}
-
- mutex_unlock(&tz->lock);
}

/**
@@ -102,13 +98,9 @@ static int bang_bang_control(struct thermal_zone_device *tz, int trip)

thermal_zone_trip_update(tz, trip);

- mutex_lock(&tz->lock);
-
list_for_each_entry(instance, &tz->thermal_instances, tz_node)
thermal_cdev_update(instance->cdev);

- mutex_unlock(&tz->lock);
-
return 0;
}

diff --git a/drivers/thermal/gov_fair_share.c b/drivers/thermal/gov_fair_share.c
index 6a2abcfc648f..12ab7171cc4d 100644
--- a/drivers/thermal/gov_fair_share.c
+++ b/drivers/thermal/gov_fair_share.c
@@ -82,8 +82,6 @@ static int fair_share_throttle(struct thermal_zone_device *tz, int trip)
int total_instance = 0;
int cur_trip_level = get_trip_level(tz);

- mutex_lock(&tz->lock);
-
list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
if (instance->trip != trip)
continue;
@@ -112,7 +110,6 @@ static int fair_share_throttle(struct thermal_zone_device *tz, int trip)
mutex_unlock(&cdev->lock);
}

- mutex_unlock(&tz->lock);
return 0;
}

diff --git a/drivers/thermal/gov_power_allocator.c b/drivers/thermal/gov_power_allocator.c
index 1d5052470967..84e02664bc7b 100644
--- a/drivers/thermal/gov_power_allocator.c
+++ b/drivers/thermal/gov_power_allocator.c
@@ -392,8 +392,6 @@ static int allocate_power(struct thermal_zone_device *tz,
int i, num_actors, total_weight, ret = 0;
int trip_max_desired_temperature = params->trip_max_desired_temperature;

- mutex_lock(&tz->lock);
-
num_actors = 0;
total_weight = 0;
list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
@@ -404,10 +402,8 @@ static int allocate_power(struct thermal_zone_device *tz,
}
}

- if (!num_actors) {
- ret = -ENODEV;
- goto unlock;
- }
+ if (!num_actors)
+ return -ENODEV;

/*
* We need to allocate five arrays of the same size:
@@ -421,10 +417,8 @@ static int allocate_power(struct thermal_zone_device *tz,
BUILD_BUG_ON(sizeof(*req_power) != sizeof(*extra_actor_power));
BUILD_BUG_ON(sizeof(*req_power) != sizeof(*weighted_req_power));
req_power = kcalloc(num_actors * 5, sizeof(*req_power), GFP_KERNEL);
- if (!req_power) {
- ret = -ENOMEM;
- goto unlock;
- }
+ if (!req_power)
+ return -ENOMEM;

max_power = &req_power[num_actors];
granted_power = &req_power[2 * num_actors];
@@ -496,9 +490,6 @@ static int allocate_power(struct thermal_zone_device *tz,
control_temp - tz->temperature);

kfree(req_power);
-unlock:
- mutex_unlock(&tz->lock);
-
return ret;
}

@@ -576,7 +567,6 @@ static void allow_maximum_power(struct thermal_zone_device *tz, bool update)
struct power_allocator_params *params = tz->governor_data;
u32 req_power;

- mutex_lock(&tz->lock);
list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
struct thermal_cooling_device *cdev = instance->cdev;

@@ -598,7 +588,6 @@ static void allow_maximum_power(struct thermal_zone_device *tz, bool update)

mutex_unlock(&instance->cdev->lock);
}
- mutex_unlock(&tz->lock);
}

/**
diff --git a/drivers/thermal/gov_step_wise.c b/drivers/thermal/gov_step_wise.c
index 9729b46d0258..d18ab7ef6262 100644
--- a/drivers/thermal/gov_step_wise.c
+++ b/drivers/thermal/gov_step_wise.c
@@ -117,8 +117,6 @@ static void thermal_zone_trip_update(struct thermal_zone_device *tz, int trip)
dev_dbg(&tz->device, "Trip%d[type=%d,temp=%d]:trend=%d,throttle=%d\n",
trip, trip_type, trip_temp, trend, throttle);

- mutex_lock(&tz->lock);
-
list_for_each_entry(instance, &tz->thermal_instances, tz_node) {
if (instance->trip != trip)
continue;
@@ -145,8 +143,6 @@ static void thermal_zone_trip_update(struct thermal_zone_device *tz, int trip)
instance->cdev->updated = false; /* cdev needs update */
mutex_unlock(&instance->cdev->lock);
}
-
- mutex_unlock(&tz->lock);
}

/**
@@ -166,13 +162,9 @@ static int step_wise_throttle(struct thermal_zone_device *tz, int trip)

thermal_zone_trip_update(tz, trip);

- mutex_lock(&tz->lock);
-
list_for_each_entry(instance, &tz->thermal_instances, tz_node)
thermal_cdev_update(instance->cdev);

- mutex_unlock(&tz->lock);
-
return 0;
}

diff --git a/drivers/thermal/gov_user_space.c b/drivers/thermal/gov_user_space.c
index a62a4e90bd3f..0e7750273679 100644
--- a/drivers/thermal/gov_user_space.c
+++ b/drivers/thermal/gov_user_space.c
@@ -34,7 +34,6 @@ static int notify_user_space(struct thermal_zone_device *tz, int trip)
char *thermal_prop[5];
int i;

- mutex_lock(&tz->lock);
thermal_prop[0] = kasprintf(GFP_KERNEL, "NAME=%s", tz->type);
thermal_prop[1] = kasprintf(GFP_KERNEL, "TEMP=%d", tz->temperature);
thermal_prop[2] = kasprintf(GFP_KERNEL, "TRIP=%d", trip);
@@ -43,7 +42,6 @@ static int notify_user_space(struct thermal_zone_device *tz, int trip)
kobject_uevent_env(&tz->device.kobj, KOBJ_CHANGE, thermal_prop);
for (i = 0; i < 4; ++i)
kfree(thermal_prop[i]);
- mutex_unlock(&tz->lock);
return 0;
}

diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 50d50cec7774..ee41f02142a7 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -295,9 +295,14 @@ static void thermal_zone_device_set_polling(struct thermal_zone_device *tz,
cancel_delayed_work(&tz->poll_queue);
}

+static int __thermal_zone_device_is_enabled(struct thermal_zone_device *tz)
+{
+ return tz->ops && tz->mode == THERMAL_DEVICE_ENABLED;
+}
+
static inline bool should_stop_polling(struct thermal_zone_device *tz)
{
- return !thermal_zone_device_is_enabled(tz);
+ return !__thermal_zone_device_is_enabled(tz);
}

static void monitor_thermal_zone(struct thermal_zone_device *tz)
@@ -306,16 +311,12 @@ static void monitor_thermal_zone(struct thermal_zone_device *tz)

stop = should_stop_polling(tz);

- mutex_lock(&tz->lock);
-
if (!stop && tz->passive)
thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
else if (!stop && tz->polling_delay_jiffies)
thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
else
thermal_zone_device_set_polling(tz, 0);
-
- mutex_unlock(&tz->lock);
}

static void handle_non_critical_trips(struct thermal_zone_device *tz, int trip)
@@ -394,7 +395,7 @@ static void update_temperature(struct thermal_zone_device *tz)
{
int temp, ret;

- ret = thermal_zone_get_temp(tz, &temp);
+ ret = __thermal_zone_get_temp(tz, &temp);
if (ret) {
if (ret != -EAGAIN)
dev_warn(&tz->device,
@@ -403,10 +404,8 @@ static void update_temperature(struct thermal_zone_device *tz)
return;
}

- mutex_lock(&tz->lock);
tz->last_temperature = tz->temperature;
tz->temperature = temp;
- mutex_unlock(&tz->lock);

trace_thermal_temperature(tz);

@@ -423,6 +422,31 @@ static void thermal_zone_device_init(struct thermal_zone_device *tz)
pos->initialized = false;
}

+void __thermal_zone_device_update(struct thermal_zone_device *tz,
+ enum thermal_notify_event event)
+{
+ int count;
+
+ if (should_stop_polling(tz))
+ return;
+
+ if (atomic_read(&in_suspend))
+ return;
+
+ if (WARN_ONCE(!tz->ops->get_temp,
+ "'%s' must not be called without 'get_temp' ops set\n", __func__))
+ return;
+
+ update_temperature(tz);
+
+ thermal_zone_set_trips(tz);
+
+ tz->notify_event = event;
+
+ for (count = 0; count < tz->num_trips; count++)
+ handle_thermal_trip(tz, count);
+}
+
static int thermal_zone_device_set_mode(struct thermal_zone_device *tz,
enum thermal_device_mode mode)
{
@@ -431,10 +455,12 @@ static int thermal_zone_device_set_mode(struct thermal_zone_device *tz,
mutex_lock(&tz->lock);

/* do nothing if mode isn't changing */
- if (mode == tz->mode) {
- mutex_unlock(&tz->lock);
+ if (mode == tz->mode)
+ goto unlock;

- return ret;
+ if (!tz->ops) {
+ ret = -EINVAL;
+ goto unlock;
}

if (tz->ops->change_mode)
@@ -443,15 +469,15 @@ static int thermal_zone_device_set_mode(struct thermal_zone_device *tz,
if (!ret)
tz->mode = mode;

- mutex_unlock(&tz->lock);
-
- thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);
+ __thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);

if (mode == THERMAL_DEVICE_ENABLED)
thermal_notify_tz_enable(tz->id);
else
thermal_notify_tz_disable(tz->id);

+unlock:
+ mutex_unlock(&tz->lock);
return ret;
}

@@ -469,40 +495,21 @@ EXPORT_SYMBOL_GPL(thermal_zone_device_disable);

int thermal_zone_device_is_enabled(struct thermal_zone_device *tz)
{
- enum thermal_device_mode mode;
+ bool enabled;

mutex_lock(&tz->lock);
-
- mode = tz->mode;
-
+ enabled = __thermal_zone_device_is_enabled(tz);
mutex_unlock(&tz->lock);

- return mode == THERMAL_DEVICE_ENABLED;
+ return enabled;
}

void thermal_zone_device_update(struct thermal_zone_device *tz,
enum thermal_notify_event event)
{
- int count;
-
- if (should_stop_polling(tz))
- return;
-
- if (atomic_read(&in_suspend))
- return;
-
- if (WARN_ONCE(!tz->ops->get_temp, "'%s' must not be called without "
- "'get_temp' ops set\n", __func__))
- return;
-
- update_temperature(tz);
-
- thermal_zone_set_trips(tz);
-
- tz->notify_event = event;
-
- for (count = 0; count < tz->num_trips; count++)
- handle_thermal_trip(tz, count);
+ mutex_lock(&tz->lock);
+ __thermal_zone_device_update(tz, event);
+ mutex_unlock(&tz->lock);
}
EXPORT_SYMBOL_GPL(thermal_zone_device_update);

@@ -779,6 +786,7 @@ static void thermal_release(struct device *dev)
sizeof("thermal_zone") - 1)) {
tz = to_thermal_zone(dev);
thermal_zone_destroy_device_groups(tz);
+ mutex_destroy(&tz->lock);
kfree(tz);
} else if (!strncmp(dev_name(dev), "cooling_device",
sizeof("cooling_device") - 1)) {
@@ -1397,10 +1405,19 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz)
thermal_remove_hwmon_sysfs(tz);
ida_free(&thermal_tz_ida, tz->id);
ida_destroy(&tz->ida);
- mutex_destroy(&tz->lock);
device_unregister(&tz->device);

thermal_notify_tz_delete(tz_id);
+
+ /*
+ * tz->ops is not reference counted, and the caller will likely
+ * release it. Make sure it is no longer used after this call returns.
+ * Note that we can not call mutex_destroy() on the mutex since the
+ * tz device may still be in use.
+ */
+ mutex_lock(&tz->lock);
+ tz->ops = NULL;
+ mutex_unlock(&tz->lock);
}
EXPORT_SYMBOL_GPL(thermal_zone_device_unregister);

diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h
index c991bb290512..eb8a13ebd676 100644
--- a/drivers/thermal/thermal_core.h
+++ b/drivers/thermal/thermal_core.h
@@ -109,9 +109,12 @@ int thermal_register_governor(struct thermal_governor *);
void thermal_unregister_governor(struct thermal_governor *);
int thermal_zone_device_set_policy(struct thermal_zone_device *, char *);
int thermal_build_list_of_policies(char *buf);
+void __thermal_zone_device_update(struct thermal_zone_device *tz,
+ enum thermal_notify_event event);

/* Helpers */
void thermal_zone_set_trips(struct thermal_zone_device *tz);
+int __thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp);

/* sysfs I/F */
int thermal_zone_create_device_groups(struct thermal_zone_device *, int);
diff --git a/drivers/thermal/thermal_helpers.c b/drivers/thermal/thermal_helpers.c
index 690890f054a3..ad1720278b5b 100644
--- a/drivers/thermal/thermal_helpers.c
+++ b/drivers/thermal/thermal_helpers.c
@@ -65,27 +65,26 @@ get_thermal_instance(struct thermal_zone_device *tz,
EXPORT_SYMBOL(get_thermal_instance);

/**
- * thermal_zone_get_temp() - returns the temperature of a thermal zone
+ * __thermal_zone_get_temp() - returns the temperature of a thermal zone
* @tz: a valid pointer to a struct thermal_zone_device
* @temp: a valid pointer to where to store the resulting temperature.
*
* When a valid thermal zone reference is passed, it will fetch its
* temperature and fill @temp.
*
+ * Both tz and tz->ops must be valid pointers when calling this function,
+ * and tz->ops->get_temp must be set.
+ * The function must be called under tz->lock.
+ *
* Return: On success returns 0, an error code otherwise
*/
-int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
+int __thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
{
- int ret = -EINVAL;
+ int ret;
int count;
int crit_temp = INT_MAX;
enum thermal_trip_type type;

- if (!tz || IS_ERR(tz) || !tz->ops->get_temp)
- goto exit;
-
- mutex_lock(&tz->lock);
-
ret = tz->ops->get_temp(tz, temp);

if (IS_ENABLED(CONFIG_THERMAL_EMULATION) && tz->emul_temperature) {
@@ -107,8 +106,35 @@ int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
*temp = tz->emul_temperature;
}

+ return ret;
+}
+
+/**
+ * thermal_zone_get_temp() - returns the temperature of a thermal zone
+ * @tz: a valid pointer to a struct thermal_zone_device
+ * @temp: a valid pointer to where to store the resulting temperature.
+ *
+ * When a valid thermal zone reference is passed, it will fetch its
+ * temperature and fill @temp.
+ *
+ * Return: On success returns 0, an error code otherwise
+ */
+int thermal_zone_get_temp(struct thermal_zone_device *tz, int *temp)
+{
+ int ret = -EINVAL;
+
+ if (!tz || IS_ERR(tz))
+ return ret;
+
+ mutex_lock(&tz->lock);
+
+ if (!tz->ops || !tz->ops->get_temp)
+ goto unlock;
+
+ ret = __thermal_zone_get_temp(tz, temp);
+
+unlock:
mutex_unlock(&tz->lock);
-exit:
return ret;
}
EXPORT_SYMBOL_GPL(thermal_zone_get_temp);
@@ -123,6 +149,9 @@ EXPORT_SYMBOL_GPL(thermal_zone_get_temp);
* driver to let it set its own notification mechanism (usually an
* interrupt).
*
+ * This function must be called under tz->lock. Both tz and tz->ops
+ * must be valid pointers.
+ *
* It does not return a value
*/
void thermal_zone_set_trips(struct thermal_zone_device *tz)
@@ -132,10 +161,8 @@ void thermal_zone_set_trips(struct thermal_zone_device *tz)
int trip_temp, hysteresis;
int i, ret;

- mutex_lock(&tz->lock);
-
if (!tz->ops->set_trips || !tz->ops->get_trip_hyst)
- goto exit;
+ return;

for (i = 0; i < tz->num_trips; i++) {
int trip_low;
@@ -154,7 +181,7 @@ void thermal_zone_set_trips(struct thermal_zone_device *tz)

/* No need to change trip points */
if (tz->prev_low_trip == low && tz->prev_high_trip == high)
- goto exit;
+ return;

tz->prev_low_trip = low;
tz->prev_high_trip = high;
@@ -169,9 +196,6 @@ void thermal_zone_set_trips(struct thermal_zone_device *tz)
ret = tz->ops->set_trips(tz, low, high);
if (ret)
dev_err(&tz->device, "Failed to set trips: %d\n", ret);
-
-exit:
- mutex_unlock(&tz->lock);
}

static void thermal_cdev_set_cur_state(struct thermal_cooling_device *cdev,
diff --git a/drivers/thermal/thermal_hwmon.c b/drivers/thermal/thermal_hwmon.c
index 09e49ec8b6f4..98086c898bba 100644
--- a/drivers/thermal/thermal_hwmon.c
+++ b/drivers/thermal/thermal_hwmon.c
@@ -75,13 +75,22 @@ temp_crit_show(struct device *dev, struct device_attribute *attr, char *buf)
temp_crit);
struct thermal_zone_device *tz = temp->tz;
int temperature;
- int ret;
+ int ret = -EINVAL;
+
+ mutex_lock(&tz->lock);
+
+ if (!tz->ops)
+ goto unlock;

ret = tz->ops->get_crit_temp(tz, &temperature);
if (ret)
- return ret;
+ goto unlock;

- return sprintf(buf, "%d\n", temperature);
+ ret = sprintf(buf, "%d\n", temperature);
+
+unlock:
+ mutex_unlock(&tz->lock);
+ return ret;
}


diff --git a/drivers/thermal/thermal_netlink.c b/drivers/thermal/thermal_netlink.c
index 050d243a5fa1..90fb9d9fda80 100644
--- a/drivers/thermal/thermal_netlink.c
+++ b/drivers/thermal/thermal_netlink.c
@@ -469,6 +469,11 @@ static int thermal_genl_cmd_tz_get_trip(struct param *p)

mutex_lock(&tz->lock);

+ if (!tz->ops) {
+ mutex_unlock(&tz->lock);
+ return -EINVAL;
+ }
+
for (i = 0; i < tz->num_trips; i++) {

enum thermal_trip_type type;
diff --git a/drivers/thermal/thermal_sysfs.c b/drivers/thermal/thermal_sysfs.c
index 3a8d6e747c25..b211b983acbc 100644
--- a/drivers/thermal/thermal_sysfs.c
+++ b/drivers/thermal/thermal_sysfs.c
@@ -82,28 +82,46 @@ trip_point_type_show(struct device *dev, struct device_attribute *attr,
enum thermal_trip_type type;
int trip, result;

- if (!tz->ops->get_trip_type)
- return -EPERM;
-
if (sscanf(attr->attr.name, "trip_point_%d_type", &trip) != 1)
return -EINVAL;

+ mutex_lock(&tz->lock);
+
+ if (!tz->ops) {
+ result = -EINVAL;
+ goto unlock;
+ }
+
+ if (!tz->ops->get_trip_type) {
+ result = -EPERM;
+ goto unlock;
+ }
+
result = tz->ops->get_trip_type(tz, trip, &type);
if (result)
- return result;
+ goto unlock;

switch (type) {
case THERMAL_TRIP_CRITICAL:
- return sprintf(buf, "critical\n");
+ result = sprintf(buf, "critical\n");
+ break;
case THERMAL_TRIP_HOT:
- return sprintf(buf, "hot\n");
+ result = sprintf(buf, "hot\n");
+ break;
case THERMAL_TRIP_PASSIVE:
- return sprintf(buf, "passive\n");
+ result = sprintf(buf, "passive\n");
+ break;
case THERMAL_TRIP_ACTIVE:
- return sprintf(buf, "active\n");
+ result = sprintf(buf, "active\n");
+ break;
default:
- return sprintf(buf, "unknown\n");
+ result = sprintf(buf, "unknown\n");
+ break;
}
+
+unlock:
+ mutex_unlock(&tz->lock);
+ return result;
}

static ssize_t
@@ -115,34 +133,45 @@ trip_point_temp_store(struct device *dev, struct device_attribute *attr,
int temperature, hyst = 0;
enum thermal_trip_type type;

- if (!tz->ops->set_trip_temp)
- return -EPERM;
-
if (sscanf(attr->attr.name, "trip_point_%d_temp", &trip) != 1)
return -EINVAL;

if (kstrtoint(buf, 10, &temperature))
return -EINVAL;

+ mutex_lock(&tz->lock);
+
+ if (!tz->ops) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ if (!tz->ops->set_trip_temp) {
+ ret = -EPERM;
+ goto unlock;
+ }
+
ret = tz->ops->set_trip_temp(tz, trip, temperature);
if (ret)
- return ret;
+ goto unlock;

if (tz->ops->get_trip_hyst) {
ret = tz->ops->get_trip_hyst(tz, trip, &hyst);
if (ret)
- return ret;
+ goto unlock;
}

ret = tz->ops->get_trip_type(tz, trip, &type);
if (ret)
- return ret;
+ goto unlock;

thermal_notify_tz_trip_change(tz->id, trip, type, temperature, hyst);

- thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);
+ __thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);

- return count;
+unlock:
+ mutex_unlock(&tz->lock);
+ return ret ? ret : count;
}

static ssize_t
@@ -153,18 +182,30 @@ trip_point_temp_show(struct device *dev, struct device_attribute *attr,
int trip, ret;
int temperature;

- if (!tz->ops->get_trip_temp)
- return -EPERM;
-
if (sscanf(attr->attr.name, "trip_point_%d_temp", &trip) != 1)
return -EINVAL;

- ret = tz->ops->get_trip_temp(tz, trip, &temperature);
+ mutex_lock(&tz->lock);
+
+ if (!tz->ops) {
+ ret = -EINVAL;
+ goto unlock;
+ }

+ if (!tz->ops->get_trip_temp) {
+ ret = -EPERM;
+ goto unlock;
+ }
+
+ ret = tz->ops->get_trip_temp(tz, trip, &temperature);
if (ret)
- return ret;
+ goto unlock;

- return sprintf(buf, "%d\n", temperature);
+ ret = sprintf(buf, "%d\n", temperature);
+
+unlock:
+ mutex_unlock(&tz->lock);
+ return ret;
}

static ssize_t
@@ -175,15 +216,24 @@ trip_point_hyst_store(struct device *dev, struct device_attribute *attr,
int trip, ret;
int temperature;

- if (!tz->ops->set_trip_hyst)
- return -EPERM;
-
if (sscanf(attr->attr.name, "trip_point_%d_hyst", &trip) != 1)
return -EINVAL;

if (kstrtoint(buf, 10, &temperature))
return -EINVAL;

+ mutex_lock(&tz->lock);
+
+ if (!tz->ops) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ if (!tz->ops->set_trip_hyst) {
+ ret = -EPERM;
+ goto unlock;
+ }
+
/*
* We are not doing any check on the 'temperature' value
* here. The driver implementing 'set_trip_hyst' has to
@@ -194,6 +244,8 @@ trip_point_hyst_store(struct device *dev, struct device_attribute *attr,
if (!ret)
thermal_zone_set_trips(tz);

+unlock:
+ mutex_unlock(&tz->lock);
return ret ? ret : count;
}

@@ -205,14 +257,25 @@ trip_point_hyst_show(struct device *dev, struct device_attribute *attr,
int trip, ret;
int temperature;

- if (!tz->ops->get_trip_hyst)
- return -EPERM;
-
if (sscanf(attr->attr.name, "trip_point_%d_hyst", &trip) != 1)
return -EINVAL;

+ mutex_lock(&tz->lock);
+
+ if (!tz->ops) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
+ if (!tz->ops->get_trip_hyst) {
+ ret = -EPERM;
+ goto unlock;
+ }
+
ret = tz->ops->get_trip_hyst(tz, trip, &temperature);

+unlock:
+ mutex_unlock(&tz->lock);
return ret ? ret : sprintf(buf, "%d\n", temperature);
}

@@ -260,17 +323,24 @@ emul_temp_store(struct device *dev, struct device_attribute *attr,
if (kstrtoint(buf, 10, &temperature))
return -EINVAL;

+ mutex_lock(&tz->lock);
+
+ if (!tz->ops) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+
if (!tz->ops->set_emul_temp) {
- mutex_lock(&tz->lock);
tz->emul_temperature = temperature;
- mutex_unlock(&tz->lock);
} else {
ret = tz->ops->set_emul_temp(tz, temperature);
}

if (!ret)
- thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);
+ __thermal_zone_device_update(tz, THERMAL_EVENT_UNSPECIFIED);

+unlock:
+ mutex_unlock(&tz->lock);
return ret ? ret : count;
}
static DEVICE_ATTR_WO(emul_temp);
--
2.36.2


2022-10-04 11:52:36

by Daniel Lezcano

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH resend] thermal: Protect thermal device operations against thermal device removal

On 04/10/2022 05:39, Guenter Roeck wrote:
> A call to thermal_zone_device_unregister() results in thermal device
> removal. While the thermal device itself is reference counted and
> protected against removal of its associated data structures, the
> thermal device operations are owned by the calling code and unprotected.
> This may result in crashes such as
>
> BUG: unable to handle page fault for address: ffffffffc04ef420
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 5d60e067 P4D 5d60e067 PUD 5d610067 PMD 110197067 PTE 0
> Oops: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 1 PID: 3209 Comm: cat Tainted: G W 5.10.136-19389-g615abc6eb807 #1 02df41ac0b12f3a64f4b34245188d8875bb3bce1
> Hardware name: Google Coral/Coral, BIOS Google_Coral.10068.92.0 11/27/2018
> RIP: 0010:thermal_zone_get_temp+0x26/0x73
> Code: 89 c3 eb d3 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 53 48 85 ff 74 50 48 89 fb 48 81 ff 00 f0 ff ff 77 44 48 8b 83 98 03 00 00 <48> 83 78 10 00 74 36 49 89 f6 4c 8d bb d8 03 00 00 4c 89 ff e8 9f
> RSP: 0018:ffffb3758138fd38 EFLAGS: 00010287
> RAX: ffffffffc04ef410 RBX: ffff98f14d7fb000 RCX: 0000000000000000
> RDX: ffff98f17cf90000 RSI: ffffb3758138fd64 RDI: ffff98f14d7fb000
> RBP: ffffb3758138fd50 R08: 0000000000001000 R09: ffff98f17cf90000
> R10: 0000000000000000 R11: ffffffff8dacad28 R12: 0000000000001000
> R13: ffff98f1793a7d80 R14: ffff98f143231708 R15: ffff98f14d7fb018
> FS: 00007ec166097800(0000) GS:ffff98f1bbd00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffffc04ef420 CR3: 000000010ee9a000 CR4: 00000000003506e0
> Call Trace:
> temp_show+0x31/0x68
> dev_attr_show+0x1d/0x4f
> sysfs_kf_seq_show+0x92/0x107
> seq_read_iter+0xf5/0x3f2
> vfs_read+0x205/0x379
> __x64_sys_read+0x7c/0xe2
> do_syscall_64+0x43/0x55
> entry_SYSCALL_64_after_hwframe+0x61/0xc6
>
> if a thermal device is removed while accesses to its device attributes
> are ongoing.
>
> Use the thermal device mutex to protect device operations. Clear the
> device operations pointer in thermal_zone_device_unregister() under
> protection of this mutex, and only access it while the mutex is held.
> Flatten and simplify device mutex operations to only acquire the mutex
> once and hold it instead of acquiring and releasing it several times
> during thermal operations. Only validate parameters once at module entry
> points after acquiring the mutex. Execute governor operations under mutex
> instead of expecting governors to acquire and release it.

Does the following series:

https://lore.kernel.org/lkml/[email protected]/

goes to the same direction than your proposal?


--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

2022-10-04 14:50:37

by Guenter Roeck

[permalink] [raw]
Subject: Re: [RFC/RFT PATCH resend] thermal: Protect thermal device operations against thermal device removal

On 10/4/22 04:49, Daniel Lezcano wrote:
> On 04/10/2022 05:39, Guenter Roeck wrote:
>> A call to thermal_zone_device_unregister() results in thermal device
>> removal. While the thermal device itself is reference counted and
>> protected against removal of its associated data structures, the
>> thermal device operations are owned by the calling code and unprotected.
>> This may result in crashes such as
>>
>> BUG: unable to handle page fault for address: ffffffffc04ef420
>>   #PF: supervisor read access in kernel mode
>>   #PF: error_code(0x0000) - not-present page
>> PGD 5d60e067 P4D 5d60e067 PUD 5d610067 PMD 110197067 PTE 0
>> Oops: 0000 [#1] PREEMPT SMP NOPTI
>> CPU: 1 PID: 3209 Comm: cat Tainted: G        W         5.10.136-19389-g615abc6eb807 #1 02df41ac0b12f3a64f4b34245188d8875bb3bce1
>> Hardware name: Google Coral/Coral, BIOS Google_Coral.10068.92.0 11/27/2018
>> RIP: 0010:thermal_zone_get_temp+0x26/0x73
>> Code: 89 c3 eb d3 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 53 48 85 ff 74 50 48 89 fb 48 81 ff 00 f0 ff ff 77 44 48 8b 83 98 03 00 00 <48> 83 78 10 00 74 36 49 89 f6 4c 8d bb d8 03 00 00 4c 89 ff e8 9f
>> RSP: 0018:ffffb3758138fd38 EFLAGS: 00010287
>> RAX: ffffffffc04ef410 RBX: ffff98f14d7fb000 RCX: 0000000000000000
>> RDX: ffff98f17cf90000 RSI: ffffb3758138fd64 RDI: ffff98f14d7fb000
>> RBP: ffffb3758138fd50 R08: 0000000000001000 R09: ffff98f17cf90000
>> R10: 0000000000000000 R11: ffffffff8dacad28 R12: 0000000000001000
>> R13: ffff98f1793a7d80 R14: ffff98f143231708 R15: ffff98f14d7fb018
>> FS:  00007ec166097800(0000) GS:ffff98f1bbd00000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: ffffffffc04ef420 CR3: 000000010ee9a000 CR4: 00000000003506e0
>> Call Trace:
>>   temp_show+0x31/0x68
>>   dev_attr_show+0x1d/0x4f
>>   sysfs_kf_seq_show+0x92/0x107
>>   seq_read_iter+0xf5/0x3f2
>>   vfs_read+0x205/0x379
>>   __x64_sys_read+0x7c/0xe2
>>   do_syscall_64+0x43/0x55
>>   entry_SYSCALL_64_after_hwframe+0x61/0xc6
>>
>> if a thermal device is removed while accesses to its device attributes
>> are ongoing.
>>
>> Use the thermal device mutex to protect device operations. Clear the
>> device operations pointer in thermal_zone_device_unregister() under
>> protection of this mutex, and only access it while the mutex is held.
>> Flatten and simplify device mutex operations to only acquire the mutex
>> once and hold it instead of acquiring and releasing it several times
>> during thermal operations. Only validate parameters once at module entry
>> points after acquiring the mutex. Execute governor operations under mutex
>> instead of expecting governors to acquire and release it.
>
> Does the following series:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> goes to the same direction than your proposal?
>

Thanks for the pointer.

The series simplifies the mutex problem, but it doesn't solve the problem
I was trying to solve (the problem causing the crash above). There
is still no guarantee that thermal device ops are not accessed after
the call to thermal_zone_device_unregister().

Thanks,
Guenter

2022-10-07 15:43:46

by kernel test robot

[permalink] [raw]
Subject: [thermal] 4971d1200e: BUG:KASAN:use-after-free_in_mutex_lock


Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 4971d1200e1f46625fde6db421961ba1cb3a511a ("[RFC/RFT PATCH resend] thermal: Protect thermal device operations against thermal device removal")
url: https://github.com/intel-lab-lkp/linux/commits/Guenter-Roeck/thermal-Protect-thermal-device-operations-against-thermal-device-removal/20221004-114107
patch link: https://lore.kernel.org/linux-pm/[email protected]

in testcase: pm-qa
version: pm-qa-x86_64-5ead848-1_20220523
with following parameters:

test: thermal



on test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz (Haswell) with 8G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/r/[email protected]


[ 38.916500][ T50] BUG: KASAN: use-after-free in mutex_lock (kbuild/src/x86_64-3/include/linux/instrumented.h:101 kbuild/src/x86_64-3/include/linux/atomic/atomic-instrumented.h:1780 kbuild/src/x86_64-3/kernel/locking/mutex.c:171 kbuild/src/x86_64-3/kernel/locking/mutex.c:285)
[ 38.923152][ T50] Write of size 8 at addr ffff8881404a03d8 by task cpuhp/7/50
[ 38.930487][ T50]
[ 38.932702][ T50] CPU: 7 PID: 50 Comm: cpuhp/7 Tainted: G I 6.0.0-00001-g4971d1200e1f #35
[ 38.942471][ T50] Hardware name: Gigabyte Technology Co., Ltd. Z87X-UD5H/Z87X-UD5H-CF, BIOS F9 03/18/2014
[ 38.952230][ T50] Call Trace:
[ 38.955383][ T50] <TASK>
[ 38.958192][ T50] dump_stack_lvl (kbuild/src/x86_64-3/lib/dump_stack.c:107 (discriminator 1))
[ 38.962570][ T50] print_address_description+0x1f/0x200
[ 38.969032][ T50] print_report.cold (kbuild/src/x86_64-3/mm/kasan/report.c:434)
[ 38.973749][ T50] ? _raw_spin_lock_irqsave (kbuild/src/x86_64-3/arch/x86/include/asm/atomic.h:202 kbuild/src/x86_64-3/include/linux/atomic/atomic-instrumented.h:543 kbuild/src/x86_64-3/include/asm-generic/qspinlock.h:111 kbuild/src/x86_64-3/include/linux/spinlock.h:185 kbuild/src/x86_64-3/include/linux/spinlock_api_smp.h:111 kbuild/src/x86_64-3/kernel/locking/spinlock.c:162)
[ 38.979082][ T50] ? mutex_lock (kbuild/src/x86_64-3/include/linux/instrumented.h:101 kbuild/src/x86_64-3/include/linux/atomic/atomic-instrumented.h:1780 kbuild/src/x86_64-3/kernel/locking/mutex.c:171 kbuild/src/x86_64-3/kernel/locking/mutex.c:285)
[ 38.983372][ T50] kasan_report (kbuild/src/x86_64-3/mm/kasan/report.c:162 kbuild/src/x86_64-3/mm/kasan/report.c:497)
[ 38.987663][ T50] ? mutex_lock (kbuild/src/x86_64-3/include/linux/instrumented.h:101 kbuild/src/x86_64-3/include/linux/atomic/atomic-instrumented.h:1780 kbuild/src/x86_64-3/kernel/locking/mutex.c:171 kbuild/src/x86_64-3/kernel/locking/mutex.c:285)
[ 38.991952][ T50] kasan_check_range (kbuild/src/x86_64-3/mm/kasan/generic.c:190)
[ 38.996675][ T50] mutex_lock (kbuild/src/x86_64-3/include/linux/instrumented.h:101 kbuild/src/x86_64-3/include/linux/atomic/atomic-instrumented.h:1780 kbuild/src/x86_64-3/kernel/locking/mutex.c:171 kbuild/src/x86_64-3/kernel/locking/mutex.c:285)
[ 39.000791][ T50] ? __mutex_lock_slowpath (kbuild/src/x86_64-3/kernel/locking/mutex.c:282)
[ 39.005949][ T50] ? kobject_cleanup (kbuild/src/x86_64-3/lib/kobject.c:683)
[ 39.010759][ T50] thermal_zone_device_unregister (kbuild/src/x86_64-3/drivers/thermal/thermal_core.c:436 kbuild/src/x86_64-3/drivers/thermal/thermal_core.c:425)
[ 39.017303][ T50] ? mutex_unlock (kbuild/src/x86_64-3/arch/x86/include/asm/atomic64_64.h:190 kbuild/src/x86_64-3/include/linux/atomic/atomic-long.h:449 kbuild/src/x86_64-3/include/linux/atomic/atomic-instrumented.h:1790 kbuild/src/x86_64-3/kernel/locking/mutex.c:181 kbuild/src/x86_64-3/kernel/locking/mutex.c:540)
[ 39.021764][ T50] ? __mutex_unlock_slowpath+0x2c0/0x2c0
[ 39.028311][ T50] pkg_thermal_cpu_offline (kbuild/src/x86_64-3/drivers/thermal/intel/x86_pkg_temp_thermal.c:418) x86_pkg_temp_thermal
[ 39.035635][ T50] ? pkg_thermal_notify (kbuild/src/x86_64-3/drivers/thermal/intel/x86_pkg_temp_thermal.c:386) x86_pkg_temp_thermal
[ 39.042696][ T50] cpuhp_invoke_callback (kbuild/src/x86_64-3/kernel/cpu.c:192)
[ 39.047853][ T50] ? __schedule (kbuild/src/x86_64-3/kernel/sched/core.c:6376)
[ 39.052316][ T50] cpuhp_thread_fun (kbuild/src/x86_64-3/kernel/cpu.c:785)
[ 39.057039][ T50] ? smpboot_thread_fn (kbuild/src/x86_64-3/kernel/smpboot.c:112)
[ 39.061937][ T50] ? cpuhp_invoke_callback (kbuild/src/x86_64-3/kernel/cpu.c:742)
[ 39.067264][ T50] ? cpuhp_invoke_callback (kbuild/src/x86_64-3/kernel/cpu.c:742)
[ 39.072595][ T50] ? cpuhp_invoke_callback (kbuild/src/x86_64-3/kernel/cpu.c:742)
[ 39.077927][ T50] ? smpboot_thread_fn (kbuild/src/x86_64-3/kernel/smpboot.c:112)
[ 39.082823][ T50] smpboot_thread_fn (kbuild/src/x86_64-3/kernel/smpboot.c:164 (discriminator 4))
[ 39.087631][ T50] ? find_next_bit (kbuild/src/x86_64-3/arch/x86/events/intel/core.c:4961)
[ 39.092095][ T50] ? find_next_bit (kbuild/src/x86_64-3/arch/x86/events/intel/core.c:4961)
[ 39.096559][ T50] kthread (kbuild/src/x86_64-3/kernel/kthread.c:376)
[ 39.100502][ T50] ? kthread_complete_and_exit (kbuild/src/x86_64-3/kernel/kthread.c:331)
[ 39.106006][ T50] ret_from_fork (kbuild/src/x86_64-3/arch/x86/entry/entry_64.S:312)
[ 39.110295][ T50] </TASK>
[ 39.113197][ T50]
[ 39.115399][ T50] Allocated by task 19:
[ 39.119428][ T50] kasan_save_stack (kbuild/src/x86_64-3/mm/kasan/common.c:39)
[ 39.123978][ T50] __kasan_kmalloc (kbuild/src/x86_64-3/mm/kasan/common.c:45 kbuild/src/x86_64-3/mm/kasan/common.c:437 kbuild/src/x86_64-3/mm/kasan/common.c:516 kbuild/src/x86_64-3/mm/kasan/common.c:525)
[ 39.128443][ T50] thermal_zone_device_register_with_trips (kbuild/src/x86_64-3/include/linux/slab.h:600 kbuild/src/x86_64-3/include/linux/slab.h:733 kbuild/src/x86_64-3/drivers/thermal/thermal_core.c:1236)
[ 39.135161][ T50] thermal_zone_device_register (kbuild/src/x86_64-3/drivers/thermal/thermal_core.c:1347)
[ 39.140751][ T50] pkg_temp_thermal_device_add (kbuild/src/x86_64-3/drivers/thermal/intel/x86_pkg_temp_thermal.c:359) x86_pkg_temp_thermal
[ 39.148421][ T50] cpuhp_invoke_callback (kbuild/src/x86_64-3/kernel/cpu.c:192)
[ 39.153577][ T50] cpuhp_thread_fun (kbuild/src/x86_64-3/kernel/cpu.c:785)
[ 39.158300][ T50] smpboot_thread_fn (kbuild/src/x86_64-3/kernel/smpboot.c:164 (discriminator 4))
[ 39.163108][ T50] kthread (kbuild/src/x86_64-3/kernel/kthread.c:376)
[ 39.167051][ T50] ret_from_fork (kbuild/src/x86_64-3/arch/x86/entry/entry_64.S:312)
[ 39.171342][ T50]
[ 39.173541][ T50] Freed by task 50:
[ 39.177218][ T50] kasan_save_stack (kbuild/src/x86_64-3/mm/kasan/common.c:39)
[ 39.181768][ T50] kasan_set_track (kbuild/src/x86_64-3/mm/kasan/common.c:45)
[ 39.186231][ T50] kasan_set_free_info (kbuild/src/x86_64-3/mm/kasan/generic.c:372)
[ 39.191042][ T50] __kasan_slab_free (kbuild/src/x86_64-3/mm/kasan/common.c:369 kbuild/src/x86_64-3/mm/kasan/common.c:329 kbuild/src/x86_64-3/mm/kasan/common.c:375)
[ 39.195852][ T50] kfree (kbuild/src/x86_64-3/mm/slub.c:1785 kbuild/src/x86_64-3/mm/slub.c:3539 kbuild/src/x86_64-3/mm/slub.c:4567)
[ 39.197982][ T401] X.Org X Server 1.20.11
[ 39.199605][ T50] device_release (kbuild/src/x86_64-3/drivers/base/core.c:2335)
[ 39.199610][ T50] kobject_cleanup (kbuild/src/x86_64-3/lib/kobject.c:677)
[ 39.199633][ T401]
[ 39.203721][ T50] thermal_zone_device_unregister (kbuild/src/x86_64-3/drivers/thermal/thermal_core.c:436 kbuild/src/x86_64-3/drivers/thermal/thermal_core.c:425)
[ 39.203726][ T50] pkg_thermal_cpu_offline (kbuild/src/x86_64-3/drivers/thermal/intel/x86_pkg_temp_thermal.c:418) x86_pkg_temp_thermal
[ 39.203730][ T50] cpuhp_invoke_callback (kbuild/src/x86_64-3/kernel/cpu.c:192)
[ 39.203733][ T50] cpuhp_thread_fun (kbuild/src/x86_64-3/kernel/cpu.c:785)
[ 39.203735][ T50] smpboot_thread_fn (kbuild/src/x86_64-3/kernel/smpboot.c:164 (discriminator 4))
[ 39.243555][ T50] kthread (kbuild/src/x86_64-3/kernel/kthread.c:376)
[ 39.247503][ T50] ret_from_fork (kbuild/src/x86_64-3/arch/x86/entry/entry_64.S:312)
[ 39.251795][ T50]
[ 39.253995][ T50] The buggy address belongs to the object at ffff8881404a0000
[ 39.253995][ T50] which belongs to the cache kmalloc-2k of size 2048
[ 39.267912][ T50] The buggy address is located 984 bytes inside of
[ 39.267912][ T50] 2048-byte region [ffff8881404a0000, ffff8881404a0800)
[ 39.281137][ T50]
[ 39.283341][ T50] The buggy address belongs to the physical page:
[ 39.289615][ T50] page:000000009883a4a4 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff8881404a1000 pfn:0x1404a0
[ 39.301020][ T50] head:000000009883a4a4 order:3 compound_mapcount:0 compound_pincount:0
[ 39.309213][ T50] flags: 0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
[ 39.317321][ T50] raw: 0017ffffc0010200 ffffea0005049808 ffffea0005042a08 ffff888100042f00
[ 39.325772][ T50] raw: ffff8881404a1000 0000000000080004 00000001ffffffff 0000000000000000
[ 39.334220][ T50] page dumped because: kasan: bad access detected
[ 39.340500][ T50]
[ 39.342707][ T50] Memory state around the buggy address:
[ 39.348206][ T50] ffff8881404a0280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 39.356137][ T50] ffff8881404a0300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 39.364067][ T50] >ffff8881404a0380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 39.371997][ T50] ^
[ 39.378805][ T50] ffff8881404a0400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 39.386737][ T50] ffff8881404a0480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 39.394667][ T50] ==================================================================
[ 39.402641][ T50] Disabling lock debugging due to kernel taint
[ 39.624286][ T399] /usr/bin/wget -q --timeout=1800 --tries=1 --local-encoding=UTF-8 http://internal-lkp-server:80/~lkp/cgi-bin/lkp-jobfile-append-var?job_file=/lkp/jobs/scheduled/lkp-hsw-d04/pm-qa-thermal-debian-11.1-x86_64-20220510.cgz-4971d1200e1f46625fde6db421961ba1cb3a511a-20221005-50436-146a5zr-4.yaml&job_state=running -O /dev/null
[ 39.624303][ T399]
[ 39.656811][ T399] target ucode: 0x28
[ 39.656819][ T399]
[ 39.661957][ T1113] Consider using thermal netlink events interface
[ 39.663530][ T399] current_version: 28, target_version: 28
[ 39.669136][ T399]
[ 39.677836][ T399] 2022-10-05 05:00:38 make -C thermal run_tests
[ 39.677843][ T399]
[ 39.687286][ T399] make: Entering directory '/lkp/benchmarks/pm-qa/thermal'
[ 39.687294][ T399]
[ 39.696670][ T399] ###
[ 39.696676][ T399]
[ 39.701716][ T399] ### thermal_00:
[ 39.701722][ T399]
[ 39.710232][ T399] ### list existing thermal-zones and cooling-devices in the system
[ 39.710247][ T399]
[ 39.722272][ T399] ### https://wiki.linaro.org/WorkingGroups/PowerManagement/Doc/QA/Scripts#thermal_00
[ 39.722281][ T399]
[ 39.724588][ T401] X Protocol Version 11, Revision 0
[ 39.731763][ T399] ###
[ 39.733877][ T401]
[ 39.743767][ T399]
[ 39.746579][ T399] Thermal Zone list
[ 39.746585][ T399]
[ 39.752858][ T399] -----------------
[ 39.752864][ T399]
[ 39.759088][ T399] thermal_zone0
[ 39.759094][ T399]
[ 39.764876][ T399] - acpitz
[ 39.764882][ T399]
[ 39.770261][ T399] thermal_zone1
[ 39.770267][ T399]
[ 39.776014][ T399] - acpitz
[ 39.776020][ T399]
[ 39.781173][ T399]
[ 39.781179][ T399]
[ 39.785630][ T399]
[ 39.785642][ T399]
[ 39.791059][ T399] Cooling Device list
[ 39.791074][ T399]
[ 39.797515][ T399] -------------------
[ 39.797521][ T399]
[ 39.803923][ T399] cooling_device0
[ 39.803929][ T399]
[ 39.809780][ T399] - Fan
[ 39.809807][ T399]
[ 39.814977][ T399] cooling_device1
[ 39.814984][ T399]
[ 39.820857][ T399] - Fan
[ 39.820862][ T399]
[ 39.826066][ T399] cooling_device10
[ 39.826074][ T399]
[ 39.832125][ T399] - Processor
[ 39.832134][ T399]
[ 39.842028][ T399] cooling_device11
[ 39.842038][ T399]
[ 39.848197][ T399] - Processor
[ 39.848205][ T399]
[ 39.854084][ T399] cooling_device12


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (12.67 kB)
config-6.0.0-00001-g4971d1200e1f (170.83 kB)
job-script (5.55 kB)
dmesg.xz (15.77 kB)
pm-qa (25.31 kB)
job.yaml (4.32 kB)
reproduce (27.00 B)
Download all attachments