2024-06-09 07:53:31

by Johan Hovold

[permalink] [raw]
Subject: cpufreq/thermal regression in 6.10

Hi,

Steev reported to me off-list that the CPU frequency of the big cores on
the Lenovo ThinkPad X13s sometimes appears to get stuck at a low
frequency with 6.10-rc2.

I just confirmed that once the cores are fully throttled (using the
stepwise thermal governor) due to the skin temperature reaching the
first trip point, scaling_max_freq gets stuck at the next OPP:

cpu4/cpufreq/scaling_max_freq:940800
cpu5/cpufreq/scaling_max_freq:940800
cpu6/cpufreq/scaling_max_freq:940800
cpu7/cpufreq/scaling_max_freq:940800

when the temperature drops again.

This obviously leads to a massive performance drop and could possibly
also be related to reports like this one:

https://lore.kernel.org/all/CAHk-=wjwFGQZcDinK=BkEaA8FSyVg5NaUe0BobxowxeZ5PvetA@mail.gmail.com/

I assume the regression may have been introduced by all the thermal work
that went into 6.10-rc1, but I don't have time to try to track this down
myself right now (and will be away from keyboard most of next week).

I've confirmed that 6.9 works as expected.

Johan


#regzbot introduced: v6.9..v6.10-rc2


2024-06-10 11:17:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq/thermal regression in 6.10

Hi,

Thanks for the report.

On Sun, Jun 9, 2024 at 9:53 AM Johan Hovold <[email protected]> wrote:
>
> Hi,
>
> Steev reported to me off-list that the CPU frequency of the big cores on
> the Lenovo ThinkPad X13s sometimes appears to get stuck at a low
> frequency with 6.10-rc2.
>
> I just confirmed that once the cores are fully throttled (using the
> stepwise thermal governor) due to the skin temperature reaching the
> first trip point, scaling_max_freq gets stuck at the next OPP:
>
> cpu4/cpufreq/scaling_max_freq:940800
> cpu5/cpufreq/scaling_max_freq:940800
> cpu6/cpufreq/scaling_max_freq:940800
> cpu7/cpufreq/scaling_max_freq:940800
>
> when the temperature drops again.

So apparently something fails to update its frequency QoS request.

Would it be possible to provoke this with thermal debug enabled
(CONFIG_THERMAL_DEBUGFS set) and see what's there in
/sys/kernel/debug/thermal/?

> This obviously leads to a massive performance drop and could possibly
> also be related to reports like this one:
>
> https://lore.kernel.org/all/CAHk-=wjwFGQZcDinK=BkEaA8FSyVg5NaUe0BobxowxeZ5PvetA@mail.gmail.com/
>
> I assume the regression may have been introduced by all the thermal work
> that went into 6.10-rc1, but I don't have time to try to track this down
> myself right now (and will be away from keyboard most of next week).
>
> I've confirmed that 6.9 works as expected.

Well, I'd need to ask someone else affected by this, then.

2024-06-11 11:03:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpufreq/thermal regression in 6.10

On Mon, Jun 10, 2024 at 1:17 PM Rafael J. Wysocki <[email protected]> wrote:
>
> Hi,
>
> Thanks for the report.
>
> On Sun, Jun 9, 2024 at 9:53 AM Johan Hovold <[email protected]> wrote:
> >
> > Hi,
> >
> > Steev reported to me off-list that the CPU frequency of the big cores on
> > the Lenovo ThinkPad X13s sometimes appears to get stuck at a low
> > frequency with 6.10-rc2.
> >
> > I just confirmed that once the cores are fully throttled (using the
> > stepwise thermal governor) due to the skin temperature reaching the
> > first trip point, scaling_max_freq gets stuck at the next OPP:
> >
> > cpu4/cpufreq/scaling_max_freq:940800
> > cpu5/cpufreq/scaling_max_freq:940800
> > cpu6/cpufreq/scaling_max_freq:940800
> > cpu7/cpufreq/scaling_max_freq:940800
> >
> > when the temperature drops again.
>
> So apparently something fails to update its frequency QoS request.
>
> Would it be possible to provoke this with thermal debug enabled
> (CONFIG_THERMAL_DEBUGFS set) and see what's there in
> /sys/kernel/debug/thermal/?
>
> > This obviously leads to a massive performance drop and could possibly
> > also be related to reports like this one:
> >
> > https://lore.kernel.org/all/CAHk-=wjwFGQZcDinK=BkEaA8FSyVg5NaUe0BobxowxeZ5PvetA@mail.gmail.com/
> >
> > I assume the regression may have been introduced by all the thermal work
> > that went into 6.10-rc1, but I don't have time to try to track this down
> > myself right now (and will be away from keyboard most of next week).
> >
> > I've confirmed that 6.9 works as expected.
>
> Well, I'd need to ask someone else affected by this, then.

If this is the step-wise governor, the problem might have been
introduced by commit

042a3d80f118 thermal: core: Move passive polling management to the core

which removed passive polling count updates from that governor, so if
the thermal zone in question has passive polling only and no regular
polling, temperature updates may stop coming before the governor drops
the cooling device states to the "no target" level.

So please test the attached partial revert of the above commit when you can.


Attachments:
thermal-gov_step_wise--revert-passive.patch (848.00 B)

2024-06-11 12:02:34

by Johan Hovold

[permalink] [raw]
Subject: Re: cpufreq/thermal regression in 6.10

On Tue, Jun 11, 2024 at 12:54:25PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jun 10, 2024 at 1:17 PM Rafael J. Wysocki <[email protected]> wrote:
> > On Sun, Jun 9, 2024 at 9:53 AM Johan Hovold <[email protected]> wrote:

> > > Steev reported to me off-list that the CPU frequency of the big cores on
> > > the Lenovo ThinkPad X13s sometimes appears to get stuck at a low
> > > frequency with 6.10-rc2.
> > >
> > > I just confirmed that once the cores are fully throttled (using the
> > > stepwise thermal governor) due to the skin temperature reaching the
> > > first trip point, scaling_max_freq gets stuck at the next OPP:
> > >
> > > cpu4/cpufreq/scaling_max_freq:940800
> > > cpu5/cpufreq/scaling_max_freq:940800
> > > cpu6/cpufreq/scaling_max_freq:940800
> > > cpu7/cpufreq/scaling_max_freq:940800
> > >
> > > when the temperature drops again.

> If this is the step-wise governor, the problem might have been
> introduced by commit
>
> 042a3d80f118 thermal: core: Move passive polling management to the core
>
> which removed passive polling count updates from that governor, so if
> the thermal zone in question has passive polling only and no regular
> polling, temperature updates may stop coming before the governor drops
> the cooling device states to the "no target" level.
>
> So please test the attached partial revert of the above commit when you can.

Thanks for the quick fix. The partial revert seems to do the trick:

Tested-by: Johan Hovold <[email protected]>

Johan

2024-06-11 21:19:45

by Steev Klimaszewski

[permalink] [raw]
Subject: Re: cpufreq/thermal regression in 6.10


On 6/11/24 7:02 AM, Johan Hovold wrote:
> On Tue, Jun 11, 2024 at 12:54:25PM +0200, Rafael J. Wysocki wrote:
>> On Mon, Jun 10, 2024 at 1:17 PM Rafael J. Wysocki <[email protected]> wrote:
>>> On Sun, Jun 9, 2024 at 9:53 AM Johan Hovold <[email protected]> wrote:
>>>> Steev reported to me off-list that the CPU frequency of the big cores on
>>>> the Lenovo ThinkPad X13s sometimes appears to get stuck at a low
>>>> frequency with 6.10-rc2.
>>>>
>>>> I just confirmed that once the cores are fully throttled (using the
>>>> stepwise thermal governor) due to the skin temperature reaching the
>>>> first trip point, scaling_max_freq gets stuck at the next OPP:
>>>>
>>>> cpu4/cpufreq/scaling_max_freq:940800
>>>> cpu5/cpufreq/scaling_max_freq:940800
>>>> cpu6/cpufreq/scaling_max_freq:940800
>>>> cpu7/cpufreq/scaling_max_freq:940800
>>>>
>>>> when the temperature drops again.
>> If this is the step-wise governor, the problem might have been
>> introduced by commit
>>
>> 042a3d80f118 thermal: core: Move passive polling management to the core
>>
>> which removed passive polling count updates from that governor, so if
>> the thermal zone in question has passive polling only and no regular
>> polling, temperature updates may stop coming before the governor drops
>> the cooling device states to the "no target" level.
>>
>> So please test the attached partial revert of the above commit when you can.
> Thanks for the quick fix. The partial revert seems to do the trick:
>
> Tested-by: Johan Hovold <[email protected]>
>
> Johan

I can also confirm that it's working here!

Tested-by: Steev Klimaszewski <[email protected]>