2022-01-20 00:23:30

by Nicolas Cavallari

[permalink] [raw]
Subject: Issues with mt7915 thermal throttling

I noticed some strange issues with mt7915's thermal throttling,
particularly the cooling_device.

First, it seems that the thermal subsystem expect that higher
cooling_device states provide more cooling, but mt7915e apparently does
the opposite and use it as a duty cycle, where state=1 does severe
throttling/max cooling (iperf throughput basically go down to near zero)
and state=100 is full power.

Also, state=0, from the comments, apparently disable thermal management,
except that in practice, it does not change the throttle state, since
throughput stays low when switching from state=1 to state=0, and stays
high when switching from 100 to 0.

As a result, as soon as the default thermal zone runs a little hot, the
performance of mt7915e is destroyed and does not recover much when the
temperature drops down.

I can come up with a patch to fix the first issue, but not the state=0
one, and i would like some pointers/confirmation.


2022-02-09 09:52:42

by Nicolas Cavallari

[permalink] [raw]
Subject: [PATCH RFC 1/3] mt76: mt7915e: Fix degraded performance after temporary overheat

mt7915e registers a cooling_device with wrong semantics:

1. cooling_device expect that higher states values should cool more, but
mt7915e did the opposite... with the exception of state == 0, which
should "disable thermal management", but does not seem to have any
effect since the previous state is kept.

The result is that when the thermal zone heats up a bit and bumps the
cooling_device state from 0 to 1 to cool a bit, the performance is
destroyed, and when going back from 1 to 0, the performance stays bad.

2. Reading the cooling_device state does not always return the last
written state, but can return the actual hardware throttle state,
which is different.

This is a problem because the mt7915 firmware actually implement the
equivalent of a thermal zone with trip points. Setting the cooling
device state actually changes the throttles at each trip point, so the
following could occur if the first issue is fixed:

- thermal subsystem set state to 100% power (state=0)
- mt7915e driver set trip throttles to [100%, 50%, 25%, 12%]
- hardware heats up and decides to switch to 50% power
- thermal subsystem see that power is 50% (state=50), decide to increase
it to 60% (state=40) because the rest of the system is cool.
- mt7915e driver set trip throttle to [60%, 30%, 15%, 7%]
- hardware thus switches to 30% power
[race to the bottom continues...]

This patch corrects the semantics of the cooling_device to the one that
the thermal subsystem expect it.

Signed-off-by: Nicolas Cavallari <[email protected]>
---
drivers/net/wireless/mediatek/mt76/mt7915/init.c | 16 ++++++++++------
.../net/wireless/mediatek/mt76/mt7915/mt7915.h | 2 ++
2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
index 2bc9097c5214..d6efbf1a2724 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
@@ -97,7 +97,7 @@ static int
mt7915_thermal_get_max_throttle_state(struct thermal_cooling_device *cdev,
unsigned long *state)
{
- *state = MT7915_THERMAL_THROTTLE_MAX;
+ *state = MT7915_CDEV_THROTTLE_MAX;

return 0;
}
@@ -108,7 +108,7 @@ mt7915_thermal_get_cur_throttle_state(struct thermal_cooling_device *cdev,
{
struct mt7915_phy *phy = cdev->devdata;

- *state = phy->throttle_state;
+ *state = phy->cdev_state;

return 0;
}
@@ -120,20 +120,24 @@ mt7915_thermal_set_cur_throttle_state(struct thermal_cooling_device *cdev,
struct mt7915_phy *phy = cdev->devdata;
int ret;

- if (state > MT7915_THERMAL_THROTTLE_MAX)
+ if (state > MT7915_CDEV_THROTTLE_MAX)
return -EINVAL;

if (phy->throttle_temp[0] > phy->throttle_temp[1])
return 0;

- if (state == phy->throttle_state)
+ if (state == phy->cdev_state)
return 0;

- ret = mt7915_mcu_set_thermal_throttling(phy, state);
+ // cooling_device convention: 0 = no cooling, more = more cooling
+ // mcu convention: 1 = max cooling, more = less cooling
+ ret = mt7915_mcu_set_thermal_throttling(phy,
+ MT7915_THERMAL_THROTTLE_MAX
+ - state);
if (ret)
return ret;

- phy->throttle_state = state;
+ phy->cdev_state = state;

return 0;
}
diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/mt7915.h b/drivers/net/wireless/mediatek/mt76/mt7915/mt7915.h
index 0403912a521d..cf4c8d2dcc60 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/mt7915.h
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/mt7915.h
@@ -49,6 +49,7 @@
#define MT7915_CFEND_RATE_11B 0x03 /* 11B LP, 11M */

#define MT7915_THERMAL_THROTTLE_MAX 100
+#define MT7915_CDEV_THROTTLE_MAX 99

#define MT7915_SKU_RATE_NUM 161

@@ -218,6 +219,7 @@ struct mt7915_phy {
struct ieee80211_vif *monitor_vif;

struct thermal_cooling_device *cdev;
+ u8 cdev_state;
u8 throttle_state;
u32 throttle_temp[2]; /* 0: critical high, 1: maximum */

--
2.34.1


2022-02-09 09:55:09

by Nicolas Cavallari

[permalink] [raw]
Subject: [PATCH RFC 2/3] mt76: mt7915e: Add a hwmon attribute to get the actual throttle state.

The firmware-controlled actual throttle state was previously available
by reading the cooling_device, but this confused the thermal subsystem.
Add a hwmon attribute to get it instead.

Signed-off-by: Nicolas Cavallari <[email protected]>
---
.../net/wireless/mediatek/mt76/mt7915/init.c | 27 ++++++++++++-------
1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
index d6efbf1a2724..71ca4b0641b6 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
@@ -50,15 +50,22 @@ static ssize_t mt7915_thermal_temp_show(struct device *dev,
int i = to_sensor_dev_attr(attr)->index;
int temperature;

- if (i)
- return sprintf(buf, "%u\n", phy->throttle_temp[i - 1] * 1000);
-
- temperature = mt7915_mcu_get_temperature(phy);
- if (temperature < 0)
- return temperature;
-
- /* display in millidegree celcius */
- return sprintf(buf, "%u\n", temperature * 1000);
+ switch (i) {
+ case 0:
+ temperature = mt7915_mcu_get_temperature(phy);
+ if (temperature < 0)
+ return temperature;
+ /* display in millidegree celcius */
+ return sprintf(buf, "%u\n", temperature * 1000);
+ case 1:
+ case 2:
+ return sprintf(buf, "%u\n",
+ phy->throttle_temp[i - 1] * 1000);
+ case 3:
+ return sprintf(buf, "%hhu\n", phy->throttle_state);
+ default:
+ return -EINVAL;
+ }
}

static ssize_t mt7915_thermal_temp_store(struct device *dev,
@@ -84,11 +91,13 @@ static ssize_t mt7915_thermal_temp_store(struct device *dev,
static SENSOR_DEVICE_ATTR_RO(temp1_input, mt7915_thermal_temp, 0);
static SENSOR_DEVICE_ATTR_RW(temp1_crit, mt7915_thermal_temp, 1);
static SENSOR_DEVICE_ATTR_RW(temp1_max, mt7915_thermal_temp, 2);
+static SENSOR_DEVICE_ATTR_RO(throttle1, mt7915_thermal_temp, 3);

static struct attribute *mt7915_hwmon_attrs[] = {
&sensor_dev_attr_temp1_input.dev_attr.attr,
&sensor_dev_attr_temp1_crit.dev_attr.attr,
&sensor_dev_attr_temp1_max.dev_attr.attr,
+ &sensor_dev_attr_throttle1.dev_attr.attr,
NULL,
};
ATTRIBUTE_GROUPS(mt7915_hwmon);
--
2.34.1