2024-05-30 10:24:39

by Thorsten Leemhuis

[permalink] [raw]
Subject: Regression, thermal: core: battery reading wrong after wake from S3 [Was: Bug Report according to thermal_core.c]

[adding the culprits author, LKML, and the regression mailing list to
the list of recipients; changing subject, too]

On 29.05.24 21:52, [email protected] wrote:
> After bisection I have reported a bug according to thermal_core.c:
> After "Resume thermal zones asynchronously" commit: Wrong Battery
> Reading after Wake from S3 Sleep - Lenovo Thinkpad P1 Gen2
>
> https://bugzilla.kernel.org/show_bug.cgi?id=218881
> Could you please have a look at it
>
> I have performed a bisection and the culprit is this commit: Resume
> thermal zones asynchronously
> git bisect bad 5a5efdaffda5d23717d9117cf36cda9eafcf2fae
> # first bad commit: [5a5efdaffda5d23717d9117cf36cda9eafcf2fae] thermal:
> core: Resume thermal zones asynchronously
>
> I have also verified it by compiling a kernel

Side note: not critical at all, but would have been good if you had
specified which kernel version you build.

> without this commit.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 5a5efdaffda5d23717d9117cf36cda9eafcf2
#regzbot dup: https://bugzilla.kernel.org/show_bug.cgi?id=218881
#regzbot title
#regzbot ignore-activity

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.


2024-05-30 11:10:24

by [email protected]

[permalink] [raw]
Subject: Re: Regression, thermal: core: battery reading wrong after wake from S3 [Was: Bug Report according to thermal_core.c]

Thanks Thorsten for the side note.

I have compiled kernel 6.8.11 with reverted commit
5a5efdaffda5d23717d9117cf36cda9eafcf2fae.

Battery Status works fine now with reverted commit after S3 Sleep and
Wake cycles.

Best, Reinhard


Am 30.05.24 um 12:21 schrieb Linux regression tracking (Thorsten Leemhuis):
> [adding the culprits author, LKML, and the regression mailing list to
> the list of recipients; changing subject, too]
>
> On 29.05.24 21:52, [email protected] wrote:
>> After bisection I have reported a bug according to thermal_core.c:
>> After "Resume thermal zones asynchronously" commit: Wrong Battery
>> Reading after Wake from S3 Sleep - Lenovo Thinkpad P1 Gen2
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=218881
>> Could you please have a look at it
>>
>> I have performed a bisection and the culprit is this commit: Resume
>> thermal zones asynchronously
>> git bisect bad 5a5efdaffda5d23717d9117cf36cda9eafcf2fae
>> # first bad commit: [5a5efdaffda5d23717d9117cf36cda9eafcf2fae] thermal:
>> core: Resume thermal zones asynchronously
>>
>> I have also verified it by compiling a kernel
> Side note: not critical at all, but would have been good if you had
> specified which kernel version you build.
>
>> without this commit.
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
>
> #regzbot ^introduced 5a5efdaffda5d23717d9117cf36cda9eafcf2
> #regzbot dup: https://bugzilla.kernel.org/show_bug.cgi?id=218881
> #regzbot title
> #regzbot ignore-activity
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> That page also explains what to do if mails like this annoy you.


2024-06-03 18:39:33

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Regression, thermal: core: battery reading wrong after wake from S3 [Was: Bug Report according to thermal_core.c]

On Thu, May 30, 2024 at 1:10 PM [email protected] <[email protected]> wrote:
>
> Thanks Thorsten for the side note.
>
> I have compiled kernel 6.8.11 with reverted commit
> 5a5efdaffda5d23717d9117cf36cda9eafcf2fae.
>
> Battery Status works fine now with reverted commit after S3 Sleep and
> Wake cycles.

Well, the connection between the battery status and the resume of
thermal zones is somewhat unclear to me at the moment.

Most likely, the commit in question changes the timing of system
resume which affects the battery behavior and it seems to be related
to the EC somehow.

Let's first see what thermal zones there are on your system, so please
send the output of

$ cat /sys/class/thermal/thermal_zone*/type

2024-06-03 19:34:40

by [email protected]

[permalink] [raw]
Subject: Re: Regression, thermal: core: battery reading wrong after wake from S3 [Was: Bug Report according to thermal_core.c]

I Totally agree, this was also my first thought, what has the battery
state to do with thermals.
But at least, so far, we have in total three repros confirmed in the bug
report. All of the same machine.
Thinkpad X1 Xtreme Gen2 = Thinkpad P1 Gen2. The only difference is the
graphics Nvidia Geforce vs Nvidia Quadro

these are the types of thermal zones:

cat /sys/class/thermal/thermal_zone*/type
acpitz
SEN6
SEN7
SEN8
SEN9
SENA
SENB
SENC
SEND
x86_pkg_temp
iwlwifi_1
INT3400 Thermal
SEN1
SEN2
pch_cannonlake
SEN3
SEN0
B0D4
SEN4
SEN5


Am 03.06.24 um 20:38 schrieb Rafael J. Wysocki:
> On Thu, May 30, 2024 at 1:10 PM [email protected] <[email protected]> wrote:
>> Thanks Thorsten for the side note.
>>
>> I have compiled kernel 6.8.11 with reverted commit
>> 5a5efdaffda5d23717d9117cf36cda9eafcf2fae.
>>
>> Battery Status works fine now with reverted commit after S3 Sleep and
>> Wake cycles.
> Well, the connection between the battery status and the resume of
> thermal zones is somewhat unclear to me at the moment.
>
> Most likely, the commit in question changes the timing of system
> resume which affects the battery behavior and it seems to be related
> to the EC somehow.
>
> Let's first see what thermal zones there are on your system, so please
> send the output of
>
> $ cat /sys/class/thermal/thermal_zone*/type


2024-06-12 17:23:47

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Regression, thermal: core: battery reading wrong after wake from S3 [Was: Bug Report according to thermal_core.c]

Restored list CCs.

On Wed, Jun 12, 2024 at 3:41 PM Rafael J. Wysocki <[email protected]> wrote:
>
> On Wed, Jun 12, 2024 at 11:56 AM [email protected] <[email protected]> wrote:
> >
> > Am 11.06.24 um 16:42 schrieb Rafael J. Wysocki:
> >
> > This doesn't make them run in a different order, it just delays both
> > of them, because the notifiers are called sequentially.
> >
> > However, if you added the msleep() at the beginning of
> > thermal_zone_device_resume(), it would change the ordering of this
> > function with respect to the PM notifiers, so please try doing this.
> >
> > I did so and added msleep(1000) to thermal_core.c line 1634
> > I have also reverted the patch you sent me.
> >
> > The battery readings after resume from S3 sleep where fine.
> > I have tried 2 reboots with 4 sleep/wake cycles, respectively
>
> Thanks!
>
> This means that the two code paths in question somehow interfere
> destructively when they are running in parallel with each other.

One more thing to try is the attached patch (independent of the
previous one) to lower the priority of the thermal PM notifier to make
it run always after the ACPI battery one.

Please test this one too and let me know if it works for you.


Attachments:
thermal-core-resume-prio.patch (499.00 B)

2024-06-13 15:16:36

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Regression, thermal: core: battery reading wrong after wake from S3 [Was: Bug Report according to thermal_core.c]

On Wed, Jun 12, 2024 at 7:23 PM Rafael J. Wysocki <[email protected]> wrote:
>
> Restored list CCs.
>
> On Wed, Jun 12, 2024 at 3:41 PM Rafael J. Wysocki <[email protected]> wrote:
> >
> > On Wed, Jun 12, 2024 at 11:56 AM [email protected] <[email protected]> wrote:
> > >
> > > Am 11.06.24 um 16:42 schrieb Rafael J. Wysocki:
> > >
> > > This doesn't make them run in a different order, it just delays both
> > > of them, because the notifiers are called sequentially.
> > >
> > > However, if you added the msleep() at the beginning of
> > > thermal_zone_device_resume(), it would change the ordering of this
> > > function with respect to the PM notifiers, so please try doing this.
> > >
> > > I did so and added msleep(1000) to thermal_core.c line 1634
> > > I have also reverted the patch you sent me.
> > >
> > > The battery readings after resume from S3 sleep where fine.
> > > I have tried 2 reboots with 4 sleep/wake cycles, respectively
> >
> > Thanks!
> >
> > This means that the two code paths in question somehow interfere
> > destructively when they are running in parallel with each other.
>
> One more thing to try is the attached patch (independent of the
> previous one) to lower the priority of the thermal PM notifier to make
> it run always after the ACPI battery one.
>
> Please test this one too and let me know if it works for you.

Attached is a slightly modified version of the last patch I sent.
Please test it and let me know if it addresses the problem you are
seeing.

If it helps, I think we are done with this at least for now.


Attachments:
thermal-core-resume-prio.patch (688.00 B)

2024-06-13 20:14:17

by [email protected]

[permalink] [raw]
Subject: Re: Regression, thermal: core: battery reading wrong after wake from S3 [Was: Bug Report according to thermal_core.c]

Am 13.06.24 um 17:14 schrieb Rafael J. Wysocki:
> Let's see if the ACPI thermal zone is the real culprit.
>
> The attached patch only adds the delay for thermal zone 0 which is the
> ACPI thermal zone. It also prints the ID and type for all of the
> resuming thermal zones.
>
> Please test it (removing all of the test changes/patches tried so far)
> and let me know what happens to the battery readings.
Patch

thermal-delay-resume.patch

does not work. Output according to dmesg.txt
> Attached is a slightly modified version of the last patch I sent.
> Please test it and let me know if it addresses the problem you are
> seeing.
>
> If it helps, I think we are done with this at least for now.
patch thermal-core-resume-prio.patch with .priority = -1 does work
> One more thing to try is the attached patch (independent of the
> previous one) to lower the priority of the thermal PM notifier to make
> it run always after the ACPI battery one.
>
> Please test this one too and let me know if it works for you.
patch thermal-core-resume-prio.patch

with .priority = INT_MIN does also work.

If you need any further tests, please don't hesitate to tell me so.

Thank you for your help!


Attachments:
dmesg.txt (5.99 kB)

2024-06-14 10:33:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Regression, thermal: core: battery reading wrong after wake from S3 [Was: Bug Report according to thermal_core.c]

On Thu, Jun 13, 2024 at 10:04 PM [email protected] <[email protected]> wrote:
>
> Am 13.06.24 um 17:14 schrieb Rafael J. Wysocki:
>
> Let's see if the ACPI thermal zone is the real culprit.
>
> The attached patch only adds the delay for thermal zone 0 which is the
> ACPI thermal zone. It also prints the ID and type for all of the
> resuming thermal zones.
>
> Please test it (removing all of the test changes/patches tried so far)
> and let me know what happens to the battery readings.
>
> Patch
>
> thermal-delay-resume.patch
>
> does not work. Output according to dmesg.txt
>
> Attached is a slightly modified version of the last patch I sent.
> Please test it and let me know if it addresses the problem you are
> seeing.
>
> If it helps, I think we are done with this at least for now.
>
> patch thermal-core-resume-prio.patch with .priority = -1 does work
>
> One more thing to try is the attached patch (independent of the
> previous one) to lower the priority of the thermal PM notifier to make
> it run always after the ACPI battery one.
>
> Please test this one too and let me know if it works for you.
>
> patch thermal-core-resume-prio.patch
>
> with .priority = INT_MIN does also work.
>
> If you need any further tests, please don't hesitate to tell me so.

No, thank you, I think you've done enough and it is appreciated.

I don't see any particular drawbacks of this approach and investing
more time in trying to get to the bottom of the issue is probably not
worth it.

I'm going to submit the last patch as the proposed solution.

> Thank you for your help!

No problem and thank you!

2024-06-16 07:50:19

by [email protected]

[permalink] [raw]
Subject: Re: Regression, thermal: core: battery reading wrong after wake from S3 [Was: Bug Report according to thermal_core.c]

Currently I am running kernel 6.8.12 with reverted commit.
But yesterday I had a strange behavior.
After resuming from S3 the laptop registered a low battery, immediately
after resume (notification in KDE Plasme). But as far as I could check,
the battery level and stats where fine. Nevertheless, after 1 minute it
shut off. I guess the low battery reading directly after after resume
triggered it.

It is not directly related with the commit, because I have reverted it,
but nevertheless, maybe it is helpful to understand the underlying
firmware issue.

I' going to test your last patch and let you know.


Am 14.06.24 um 12:18 schrieb Rafael J. Wysocki:
> On Thu, Jun 13, 2024 at 10:04 PM [email protected] <[email protected]> wrote:
>> Am 13.06.24 um 17:14 schrieb Rafael J. Wysocki:
>>
>> Let's see if the ACPI thermal zone is the real culprit.
>>
>> The attached patch only adds the delay for thermal zone 0 which is the
>> ACPI thermal zone. It also prints the ID and type for all of the
>> resuming thermal zones.
>>
>> Please test it (removing all of the test changes/patches tried so far)
>> and let me know what happens to the battery readings.
>>
>> Patch
>>
>> thermal-delay-resume.patch
>>
>> does not work. Output according to dmesg.txt
>>
>> Attached is a slightly modified version of the last patch I sent.
>> Please test it and let me know if it addresses the problem you are
>> seeing.
>>
>> If it helps, I think we are done with this at least for now.
>>
>> patch thermal-core-resume-prio.patch with .priority = -1 does work
>>
>> One more thing to try is the attached patch (independent of the
>> previous one) to lower the priority of the thermal PM notifier to make
>> it run always after the ACPI battery one.
>>
>> Please test this one too and let me know if it works for you.
>>
>> patch thermal-core-resume-prio.patch
>>
>> with .priority = INT_MIN does also work.
>>
>> If you need any further tests, please don't hesitate to tell me so.
> No, thank you, I think you've done enough and it is appreciated.
>
> I don't see any particular drawbacks of this approach and investing
> more time in trying to get to the bottom of the issue is probably not
> worth it.
>
> I'm going to submit the last patch as the proposed solution.
>
>> Thank you for your help!
> No problem and thank you!