2022-12-07 05:31:18

by Howard Hsu

[permalink] [raw]
Subject: [PATCH v2 0/3] wifi: mt76: mt7915: rework thermal protection

*** BLURB HERE ***

Howard Hsu (3):
wifi: mt76: mt7915: rework mt7915_thermal_set_cur_throttle_state()
wifi: mt76: mt7915: call mt7915_mcu_set_thermal_throttling() only
after init_work
wifi: mt76: mt7915: rework mt7915_mcu_set_thermal_throttling()

.../net/wireless/mediatek/mt76/mt7915/init.c | 21 ++++++++++---------
.../net/wireless/mediatek/mt76/mt7915/main.c | 5 +++++
.../net/wireless/mediatek/mt76/mt7915/mcu.c | 19 +++++++++--------
3 files changed, 26 insertions(+), 19 deletions(-)

--
2.18.0


2022-12-07 05:31:44

by Howard Hsu

[permalink] [raw]
Subject: [PATCH v2 1/3] wifi: mt76: mt7915: rework mt7915_thermal_set_cur_throttle_state()

This patch includes 3 changes:
1. The maximum throttle state can be set to 100 to fix the problem that
thermal_protect_disable can never be triggered.
2. Throttle state do not need to be different from the previous state.
This will make it is impossible for users to just change the
trigger/restore temp but not the throttle state.
3. Add dev_err so that it is easier to see invalid setting while looking at dmesg.

Fixes: 771cd8d4c369 ("mt76: mt7915e: Fix degraded performance after temporary overheat")
Co-developed-by: Ryder Lee <[email protected]>
Signed-off-by: Ryder Lee <[email protected]>
Signed-off-by: Howard Hsu <[email protected]>
---
.../net/wireless/mediatek/mt76/mt7915/init.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
index c810c31fbd6e..abeecf15f1c8 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
@@ -131,14 +131,17 @@ mt7915_thermal_set_cur_throttle_state(struct thermal_cooling_device *cdev,
u8 throttling = MT7915_THERMAL_THROTTLE_MAX - state;
int ret;

- if (state > MT7915_CDEV_THROTTLE_MAX)
+ if (state > MT7915_THERMAL_THROTTLE_MAX) {
+ dev_err(phy->dev->mt76.dev,
+ "please specify a valid throttling state\n");
return -EINVAL;
+ }

- if (phy->throttle_temp[0] > phy->throttle_temp[1])
- return 0;
-
- if (state == phy->cdev_state)
- return 0;
+ if (phy->throttle_temp[0] > phy->throttle_temp[1]) {
+ dev_err(phy->dev->mt76.dev,
+ "temp1_crit shall not be greater than temp1_max\n");
+ return -EINVAL;
+ }

/*
* cooling_device convention: 0 = no cooling, more = more cooling
@@ -164,7 +167,7 @@ static void mt7915_unregister_thermal(struct mt7915_phy *phy)
struct wiphy *wiphy = phy->mt76->hw->wiphy;

if (!phy->cdev)
- return;
+ return;

sysfs_remove_link(&wiphy->dev.kobj, "cooling_device");
thermal_cooling_device_unregister(phy->cdev);
@@ -1101,7 +1104,6 @@ static void mt7915_stop_hardware(struct mt7915_dev *dev)
mt7986_wmac_disable(dev);
}

-
int mt7915_register_device(struct mt7915_dev *dev)
{
struct ieee80211_hw *hw = mt76_hw(dev);
--
2.18.0

2022-12-07 05:32:26

by Howard Hsu

[permalink] [raw]
Subject: [PATCH v2 3/3] wifi: mt76: mt7915: rework mt7915_mcu_set_thermal_throttling()

Firmware expects to disable thermal protect first before reconfiguring.

Fixes: 34b877d972be ("mt76: mt7915: add thermal cooling device support")
Signed-off-by: Howard Hsu <[email protected]>
---
.../net/wireless/mediatek/mt76/mt7915/mcu.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/mcu.c b/drivers/net/wireless/mediatek/mt76/mt7915/mcu.c
index b2652de082ba..176d293eb32a 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/mcu.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/mcu.c
@@ -3070,19 +3070,22 @@ int mt7915_mcu_set_thermal_throttling(struct mt7915_phy *phy, u8 state)
} __packed req = {
.ctrl = {
.band_idx = phy->mt76->band_idx,
+ .type.protect_type = 1,
+ .type.trigger_type = 1,
},
};
- int level;
+ int level, ret;

- if (!state) {
- req.ctrl.ctrl_id = THERMAL_PROTECT_DISABLE;
- goto out;
- }
+ req.ctrl.ctrl_id = THERMAL_PROTECT_DISABLE;
+
+ ret = mt76_mcu_send_msg(&dev->mt76, MCU_EXT_CMD(THERMAL_PROT),
+ &req, sizeof(req.ctrl), false);
+
+ if (ret || !state)
+ return ret;

/* set duty cycle and level */
for (level = 0; level < 4; level++) {
- int ret;
-
req.ctrl.ctrl_id = THERMAL_PROTECT_DUTY_CONFIG;
req.ctrl.duty.duty_level = level;
req.ctrl.duty.duty_cycle = state;
@@ -3100,8 +3103,6 @@ int mt7915_mcu_set_thermal_throttling(struct mt7915_phy *phy, u8 state)
req.restore_temp = cpu_to_le32(phy->throttle_temp[0] - 10);
req.trigger_temp = cpu_to_le32(phy->throttle_temp[1]);
req.sustain_time = cpu_to_le16(10);
-
-out:
req.ctrl.type.protect_type = 1;
req.ctrl.type.trigger_type = 1;

--
2.18.0

2022-12-07 05:32:26

by Howard Hsu

[permalink] [raw]
Subject: [PATCH v2 2/3] wifi: mt76: mt7915: call mt7915_mcu_set_thermal_throttling() only after init_work

Enable thermal management by default shall not be executed during mcu
init. This causes thermal configuration being reset to the firmware
default settings.

Fixes: 0063b86c9120 ("mt76: mt7915e: Enable thermal management by default")
Signed-off-by: Howard Hsu <[email protected]>
---
drivers/net/wireless/mediatek/mt76/mt7915/init.c | 3 +--
drivers/net/wireless/mediatek/mt76/mt7915/main.c | 5 +++++
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
index abeecf15f1c8..8f4561eed5db 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
@@ -204,8 +204,7 @@ static int mt7915_thermal_init(struct mt7915_phy *phy)
phy->throttle_temp[0] = 110;
phy->throttle_temp[1] = 120;

- return mt7915_mcu_set_thermal_throttling(phy,
- MT7915_THERMAL_THROTTLE_MAX);
+ return 0;
}

static void mt7915_led_set_config(struct led_classdev *led_cdev,
diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/main.c b/drivers/net/wireless/mediatek/mt76/mt7915/main.c
index 0511d6a505b0..fc87216e40b1 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/main.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/main.c
@@ -57,6 +57,11 @@ int mt7915_run(struct ieee80211_hw *hw)
mt7915_mac_enable_nf(dev, phy->mt76->band_idx);
}

+ ret = mt7915_mcu_set_thermal_throttling(phy, MT7915_THERMAL_THROTTLE_MAX);
+
+ if (ret)
+ goto out;
+
ret = mt76_connac_mcu_set_rts_thresh(&dev->mt76, 0x92b,
phy->mt76->band_idx);
if (ret)
--
2.18.0

2022-12-07 08:42:40

by Nicolas Cavallari

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] wifi: mt76: mt7915: rework mt7915_thermal_set_cur_throttle_state()

On 07/12/2022 06:24, Howard Hsu wrote:
> This patch includes 3 changes:
> 1. The maximum throttle state can be set to 100 to fix the problem that
> thermal_protect_disable can never be triggered.

You are modifying the cooling_device part. The cooling_device is
explicitly configured to have a max state of MT7915_CDEV_THROTTLE_MAX
(=99), so the thermal subsystem will probably prevent
mt7915_thermal_set_cur_throttle_state from being called with a higher
value. It will also probably complain if get_cur_state starts returning
values above MT7915_CDEV_THROTTLE_MAX.

And, as the comment below indicates, the thermal subsystem expect that a
higher state provide more cooling. So if 99 means "maximum cooling",
100 cannot mean "disable cooling".

Also, last time I tried, thermal_protect_disable didn't work; It didn't
disable anything, the previous thermal throttle kept being applied.
Maybe a new firmware fixed this, but the kernel cannot simply expect the
firmware to be up to date.

> 2. Throttle state do not need to be different from the previous state.
> This will make it is impossible for users to just change the
> trigger/restore temp but not the throttle state.

The throttle state is mostly set by the kernel's thermal governor and
the user has only very little control over it. The thermal governor
runs every X seconds and will change the state if it thinks it is too
low or too high.

The default step_wise governor will aggressively set it to zero if the
system isn't overheating, for example.

> 3. Add dev_err so that it is easier to see invalid setting while looking at dmesg.
>
> Fixes: 771cd8d4c369 ("mt76: mt7915e: Fix degraded performance after temporary overheat")
> Co-developed-by: Ryder Lee <[email protected]>
> Signed-off-by: Ryder Lee <[email protected]>
> Signed-off-by: Howard Hsu <[email protected]>
> ---
> .../net/wireless/mediatek/mt76/mt7915/init.c | 18 ++++++++++--------
> 1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> index c810c31fbd6e..abeecf15f1c8 100644
> --- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> +++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> @@ -131,14 +131,17 @@ mt7915_thermal_set_cur_throttle_state(struct thermal_cooling_device *cdev,
> u8 throttling = MT7915_THERMAL_THROTTLE_MAX - state;
> int ret;
>
> - if (state > MT7915_CDEV_THROTTLE_MAX)
> + if (state > MT7915_THERMAL_THROTTLE_MAX) {
> + dev_err(phy->dev->mt76.dev,
> + "please specify a valid throttling state\n");
> return -EINVAL;
> + }
>
> - if (phy->throttle_temp[0] > phy->throttle_temp[1])
> - return 0;
> -
> - if (state == phy->cdev_state)
> - return 0;
> + if (phy->throttle_temp[0] > phy->throttle_temp[1]) {
> + dev_err(phy->dev->mt76.dev,
> + "temp1_crit shall not be greater than temp1_max\n");
> + return -EINVAL;
> + }
>
> /*
> * cooling_device convention: 0 = no cooling, more = more cooling
^^^^^^^^^^^^^^^^^^^^^^^^^

2022-12-08 12:50:21

by Howard Hsu

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] wifi: mt76: mt7915: rework mt7915_thermal_set_cur_throttle_state()

On Wed, 2022-12-07 at 09:15 +0100, Nicolas Cavallari wrote:
> On 07/12/2022 06:24, Howard Hsu wrote:
> > This patch includes 3 changes:
> > 1. The maximum throttle state can be set to 100 to fix the problem
> > that
> > thermal_protect_disable can never be triggered.
>
> You are modifying the cooling_device part. The cooling_device is
> explicitly configured to have a max state of
> MT7915_CDEV_THROTTLE_MAX
> (=99), so the thermal subsystem will probably prevent
> mt7915_thermal_set_cur_throttle_state from being called with a
> higher
> value. It will also probably complain if get_cur_state starts
> returning
> values above MT7915_CDEV_THROTTLE_MAX.
>
> And, as the comment below indicates, the thermal subsystem expect
> that a
> higher state provide more cooling. So if 99 means "maximum
> cooling",
> 100 cannot mean "disable cooling".
>
> Also, last time I tried, thermal_protect_disable didn't work; It
> didn't
> disable anything, the previous thermal throttle kept being applied.
> Maybe a new firmware fixed this, but the kernel cannot simply expect
> the
> firmware to be up to date.
>

Thanks for your comments. Let me give you an example to confirm with
you if I understand your comments correctly.

1. The current cooling state of the cooling device is 50 (cur_state =
50).
2. The cooling state is set to 100 for "disable cooling".
3. The thermal subsystem decides to decrease state because the rest of
system is cooler. And then it will adjust it downward based on
cur_state, which is 100. For example, thermal subsytem set cur_state to
90. But obviously this will make the performance worse than at step 1,
even though the system is cooler. The design for 100 mean "disable
cooling" will mess up the thermal governor.

Let me know if there is any misunderstanding. And I will remove the
first change of this patch.

> > 2. Throttle state do not need to be different from the previous
> > state.
> > This will make it is impossible for users to just change the
> > trigger/restore temp but not the throttle state.
>
> The throttle state is mostly set by the kernel's thermal governor
> and
> the user has only very little control over it. The thermal governor
> runs every X seconds and will change the state if it thinks it is
> too
> low or too high.
>
> The default step_wise governor will aggressively set it to zero if
> the
> system isn't overheating, for example.
>

I don't think there is any conflict between your comment and second
change. If we keep the check that previous cooling state shall be
different from the new cooling state, this will bother users who only
wants to change the temp1_crit but not the cur_state. It is
unreasonable for the user, if they wants the new temp1_crit to take
effect in the firmware, they must set a differnt cooling state.


> > 3. Add dev_err so that it is easier to see invalid setting while
> > looking at dmesg.
> >
> > Fixes: 771cd8d4c369 ("mt76: mt7915e: Fix degraded performance after
> > temporary overheat")
> > Co-developed-by: Ryder Lee <[email protected]>
> > Signed-off-by: Ryder Lee <[email protected]>
> > Signed-off-by: Howard Hsu <[email protected]>
> > ---
> > .../net/wireless/mediatek/mt76/mt7915/init.c | 18 ++++++++++---
> > -----
> > 1 file changed, 10 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> > b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> > index c810c31fbd6e..abeecf15f1c8 100644
> > --- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> > +++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
> > @@ -131,14 +131,17 @@ mt7915_thermal_set_cur_throttle_state(struct
> > thermal_cooling_device *cdev,
> > u8 throttling = MT7915_THERMAL_THROTTLE_MAX - state;
> > int ret;
> >
> > - if (state > MT7915_CDEV_THROTTLE_MAX)
> > + if (state > MT7915_THERMAL_THROTTLE_MAX) {
> > + dev_err(phy->dev->mt76.dev,
> > + "please specify a valid throttling state\n");
> > return -EINVAL;
> > + }
> >
> > - if (phy->throttle_temp[0] > phy->throttle_temp[1])
> > - return 0;
> > -
> > - if (state == phy->cdev_state)
> > - return 0;
> > + if (phy->throttle_temp[0] > phy->throttle_temp[1]) {
> > + dev_err(phy->dev->mt76.dev,
> > + "temp1_crit shall not be greater than
> > temp1_max\n");
> > + return -EINVAL;
> > + }
> >
> > /*
> > * cooling_device convention: 0 = no cooling, more = more
> > cooling
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^
>

2022-12-08 16:49:05

by Nicolas Cavallari

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] wifi: mt76: mt7915: rework mt7915_thermal_set_cur_throttle_state()

On 08/12/2022 13:44, Howard-YH Hsu (許育豪) wrote:
> On Wed, 2022-12-07 at 09:15 +0100, Nicolas Cavallari wrote:
>> On 07/12/2022 06:24, Howard Hsu wrote:
>> > This patch includes 3 changes:
>> > 1. The maximum throttle state can be set to 100 to fix the problem
>> > that
>> > thermal_protect_disable can never be triggered.
>>
>> You are modifying the cooling_device part. The cooling_device is
>> explicitly configured to have a max state of
>> MT7915_CDEV_THROTTLE_MAX
>> (=99), so the thermal subsystem will probably prevent
>> mt7915_thermal_set_cur_throttle_state from being called with a
>> higher
>> value. It will also probably complain if get_cur_state starts
>> returning
>> values above MT7915_CDEV_THROTTLE_MAX.
>>
>> And, as the comment below indicates, the thermal subsystem expect
>> that a
>> higher state provide more cooling. So if 99 means "maximum
>> cooling",
>> 100 cannot mean "disable cooling".
>>
>> Also, last time I tried, thermal_protect_disable didn't work; It
>> didn't
>> disable anything, the previous thermal throttle kept being applied.
>> Maybe a new firmware fixed this, but the kernel cannot simply expect
>> the
>> firmware to be up to date.
>>
>
> Thanks for your comments. Let me give you an example to confirm with
> you if I understand your comments correctly.
>
> 1. The current cooling state of the cooling device is 50 (cur_state =
> 50).
> 2. The cooling state is set to 100 for "disable cooling".
> 3. The thermal subsystem decides to decrease state because the rest of
> system is cooler. And then it will adjust it downward based on
> cur_state, which is 100. For example, thermal subsytem set cur_state to
> 90. But obviously this will make the performance worse than at step 1,
> even though the system is cooler. The design for 100 mean "disable
> cooling" will mess up the thermal governor.
>
> Let me know if there is any misunderstanding. And I will remove the
> first change of this patch.

This is pretty much my second point.

The other case is if the system is overheating a lot and the kernel bumps the
state from e.g. 80 to 100, then the system should not disable throttling.

>
>> > 2. Throttle state do not need to be different from the previous
>> > state.
>> > This will make it is impossible for users to just change the
>> > trigger/restore temp but not the throttle state.
>>
>> The throttle state is mostly set by the kernel's thermal governor
>> and
>> the user has only very little control over it. The thermal governor
>> runs every X seconds and will change the state if it thinks it is
>> too
>> low or too high.
>>
>> The default step_wise governor will aggressively set it to zero if
>> the
>> system isn't overheating, for example.
>>
>
> I don't think there is any conflict between your comment and second
> change. If we keep the check that previous cooling state shall be
> different from the new cooling state, this will bother users who only
> wants to change the temp1_crit but not the cur_state. It is
> unreasonable for the user, if they wants the new temp1_crit to take
> effect in the firmware, they must set a differnt cooling state.

The point I wanted to make is that the kernel sets the throttle state a lot, and
I assumed that it could also do it even if the state does not change, which
would send unnecessary mcu commands. But it's apparently not the case.

Also, if the user changes temp1_crit or temp1_max, the changes should probably
be applied immediately instead of expecting the user to change the state afterward.