2016-11-22 17:44:58

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 0/6] hwmon/coretemp: Hotplug fixes, cleanups and state machine conversion

After the first attempt to convert the coretemp driver to the hotplug state
machine failed, we had a deeper look and went a bit farther.

The driver has quite some interesting concepts vs. the package, core and
sysfs file management and a bug in the package temperature sysfs interface
vs. cpu hotplug.

The following series fixes that bug and simplifies the package/core
management and at the end converts it to the hotplug state machine.

Along with the source size the binary size shrinks as well:
text data bss dec hex
4068 360 20 4448 1160 Before
3801 180 36 4017 fb1 After

Thanks,

tglx
-----
coretemp.c | 321 +++++++++++++++++++++----------------------------------------
1 file changed, 113 insertions(+), 208 deletions(-)




2016-11-23 15:29:41

by Guenter Roeck

[permalink] [raw]
Subject: Re: [patch 0/6] hwmon/coretemp: Hotplug fixes, cleanups and state machine conversion

On 11/22/2016 09:42 AM, Thomas Gleixner wrote:
> After the first attempt to convert the coretemp driver to the hotplug state
> machine failed, we had a deeper look and went a bit farther.
>
> The driver has quite some interesting concepts vs. the package, core and
> sysfs file management and a bug in the package temperature sysfs interface
> vs. cpu hotplug.
>
> The following series fixes that bug and simplifies the package/core
> management and at the end converts it to the hotplug state machine.
>
> Along with the source size the binary size shrinks as well:
> text data bss dec hex
> 4068 360 20 4448 1160 Before
> 3801 180 36 4017 fb1 After
>
> Thanks,
>
> tglx
> -----
> coretemp.c | 321 +++++++++++++++++++++----------------------------------------
> 1 file changed, 113 insertions(+), 208 deletions(-)
>
>
>
>
Looks good. Series applied to -next.

Guenter

2017-04-12 08:32:02

by Tommi Rantala

[permalink] [raw]
Subject: Re: [patch 0/6] hwmon/coretemp: Hotplug fixes, cleanups and state machine conversion

2016-11-23 17:28 GMT+02:00 Guenter Roeck <[email protected]>:
>
> On 11/22/2016 09:42 AM, Thomas Gleixner wrote:
>>
>> After the first attempt to convert the coretemp driver to the hotplug state
>> machine failed, we had a deeper look and went a bit farther.
>>
>> The driver has quite some interesting concepts vs. the package, core and
>> sysfs file management and a bug in the package temperature sysfs interface
>> vs. cpu hotplug.
>>
>> The following series fixes that bug and simplifies the package/core
>> management and at the end converts it to the hotplug state machine.
>>
>> Along with the source size the binary size shrinks as well:
>> text data bss dec hex
>> 4068 360 20 4448 1160 Before
>> 3801 180 36 4017 fb1 After
>>
>> Thanks,
>>
>> tglx
>> -----
>> coretemp.c | 321 +++++++++++++++++++++----------------------------------------
>> 1 file changed, 113 insertions(+), 208 deletions(-)

Hi,

Resume-from-suspend stopped working in HP xw6600 in fedora kernel
4.10.8-200.fc25.x86_64, while it worked just fine in
4.9.9-200.fc25.x86_64.

When powering on the suspended PC, there is no video output, and to
recover, I need to reset the machine.
Nothing is recorded in the journal logs for the resume, last lines are
from the suspend:

Apr 08 15:41:49 xw6600 systemd[1]: Reached target Sleep.
Apr 08 15:41:49 xw6600 systemd[1]: Starting Suspend...
Apr 08 15:41:49 xw6600 systemd-sleep[6675]: Suspending system...

Also tested 4.11-rc5, but it fails the same way.

Bisection leads to commit:

commit e00ca5df37adc68052ea699cbd010ee4e19e39e4
Author: Thomas Gleixner <[email protected]>
Date: Tue Nov 22 17:42:04 2016 +0000

hwmon: (coretemp) Convert to hotplug state machine

Install the callbacks via the state machine. Setup and teardown are handled
by the hotplug core.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Cc: [email protected]
Cc: Fenghua Yu <[email protected]>
Cc: Jean Delvare <[email protected]>
Cc: [email protected]
Cc: Guenter Roeck <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Guenter Roeck <[email protected]>

If I do "modprobe -r coretemp", then the resume works OK with
4.10.8-200.fc25.x86_64.

Any ideas?

4.9.9-200.fc25.x86_64 dmesg:
http://termbin.com/3kcl

4.10.8-200.fc25.x86_64 dmesg:
http://termbin.com/62d9

-Tommi

2017-04-12 09:29:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/6] hwmon/coretemp: Hotplug fixes, cleanups and state machine conversion

On Wed, 12 Apr 2017, Tommi Rantala wrote:
> Resume-from-suspend stopped working in HP xw6600 in fedora kernel
> 4.10.8-200.fc25.x86_64, while it worked just fine in
> 4.9.9-200.fc25.x86_64.
>
> When powering on the suspended PC, there is no video output, and to
> recover, I need to reset the machine.

Is there just no video output or is the machine completely frozen? If it's
not completely dead, then you might be able to ssh into it.

Thanks,

tglx

2017-04-12 10:43:06

by Tommi Rantala

[permalink] [raw]
Subject: Re: [patch 0/6] hwmon/coretemp: Hotplug fixes, cleanups and state machine conversion

2017-04-12 12:28 GMT+03:00 Thomas Gleixner <[email protected]>:
> On Wed, 12 Apr 2017, Tommi Rantala wrote:
>> Resume-from-suspend stopped working in HP xw6600 in fedora kernel
>> 4.10.8-200.fc25.x86_64, while it worked just fine in
>> 4.9.9-200.fc25.x86_64.
>>
>> When powering on the suspended PC, there is no video output, and to
>> recover, I need to reset the machine.
>
> Is there just no video output or is the machine completely frozen? If it's
> not completely dead, then you might be able to ssh into it.

It's completely hosed: not possible to ssh, does not respond to ping either.

I made a quick test with netconsole. After booting with
no_console_suspend=1, and setting the netconsole parameters, I can get
kernel messages (to my android phone) when suspending the machine. But
no messages after the failed resume.

Hmm, might I be able to capture messages over USB serial port...?

-Tommi

2017-04-12 10:52:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/6] hwmon/coretemp: Hotplug fixes, cleanups and state machine conversion

On Wed, 12 Apr 2017, Tommi Rantala wrote:
> 2017-04-12 12:28 GMT+03:00 Thomas Gleixner <[email protected]>:
> > On Wed, 12 Apr 2017, Tommi Rantala wrote:
> >> Resume-from-suspend stopped working in HP xw6600 in fedora kernel
> >> 4.10.8-200.fc25.x86_64, while it worked just fine in
> >> 4.9.9-200.fc25.x86_64.
> >>
> >> When powering on the suspended PC, there is no video output, and to
> >> recover, I need to reset the machine.
> >
> > Is there just no video output or is the machine completely frozen? If it's
> > not completely dead, then you might be able to ssh into it.
>
> It's completely hosed: not possible to ssh, does not respond to ping either.
>
> I made a quick test with netconsole. After booting with
> no_console_suspend=1, and setting the netconsole parameters, I can get
> kernel messages (to my android phone) when suspending the machine. But
> no messages after the failed resume.

Let's do something else first.

Can you please try to offline/online CPUs from the console?

# echo 0 >/sys/devices/system/cpu1/online
# echo 1 >/sys/devices/system/cpu1/online

If that works, then try to offline all CPUs (except 0) in the same order as
suspend (1 ... 7) and then online them again in the same order?

Thanks,

tglx

2017-04-12 11:00:19

by Tommi Rantala

[permalink] [raw]
Subject: Re: [patch 0/6] hwmon/coretemp: Hotplug fixes, cleanups and state machine conversion

2017-04-12 13:52 GMT+03:00 Thomas Gleixner <[email protected]>:
> On Wed, 12 Apr 2017, Tommi Rantala wrote:
>> 2017-04-12 12:28 GMT+03:00 Thomas Gleixner <[email protected]>:
>> > On Wed, 12 Apr 2017, Tommi Rantala wrote:
>> >> Resume-from-suspend stopped working in HP xw6600 in fedora kernel
>> >> 4.10.8-200.fc25.x86_64, while it worked just fine in
>> >> 4.9.9-200.fc25.x86_64.
>> >>
>> >> When powering on the suspended PC, there is no video output, and to
>> >> recover, I need to reset the machine.
>> >
>> > Is there just no video output or is the machine completely frozen? If it's
>> > not completely dead, then you might be able to ssh into it.
>>
>> It's completely hosed: not possible to ssh, does not respond to ping either.
>>
>> I made a quick test with netconsole. After booting with
>> no_console_suspend=1, and setting the netconsole parameters, I can get
>> kernel messages (to my android phone) when suspending the machine. But
>> no messages after the failed resume.
>
> Let's do something else first.
>
> Can you please try to offline/online CPUs from the console?
>
> # echo 0 >/sys/devices/system/cpu1/online
> # echo 1 >/sys/devices/system/cpu1/online

ok, that works.

> If that works, then try to offline all CPUs (except 0) in the same order as
> suspend (1 ... 7) and then online them again in the same order?

Seems to work without problems:

# for i in $(seq 1 7) ; do echo 0 > /sys/devices/system/cpu/cpu$i/online ; done

[ 1237.317537] intel_powerclamp: No package C-state available
[ 1308.997620] smpboot: CPU 1 is now offline
[ 1309.007167] intel_powerclamp: No package C-state available
[ 1309.032563] smpboot: CPU 2 is now offline
[ 1309.038118] intel_powerclamp: No package C-state available
[ 1309.072495] smpboot: CPU 3 is now offline
[ 1309.077807] intel_powerclamp: No package C-state available
[ 1309.099545] Broke affinity for irq 29
[ 1309.100587] smpboot: CPU 4 is now offline
[ 1309.105346] intel_powerclamp: No package C-state available
[ 1309.135530] Broke affinity for irq 22
[ 1309.135540] Broke affinity for irq 29
[ 1309.136579] smpboot: CPU 5 is now offline
[ 1309.141653] intel_powerclamp: No package C-state available
[ 1309.171517] Broke affinity for irq 22
[ 1309.171526] Broke affinity for irq 29
[ 1309.171535] Broke affinity for irq 31
[ 1309.172586] smpboot: CPU 6 is now offline
[ 1309.176967] intel_powerclamp: No package C-state available
[ 1309.209122] Broke affinity for irq 19
[ 1309.209126] Broke affinity for irq 22
[ 1309.209135] Broke affinity for irq 29
[ 1309.209145] Broke affinity for irq 31
[ 1309.212071] smpboot: CPU 7 is now offline


# for i in $(seq 1 7) ; do echo 1 > /sys/devices/system/cpu/cpu$i/online ; done

[ 1309.217476] intel_powerclamp: No package C-state available
[ 1380.624184] x86: Booting SMP configuration:
[ 1380.624186] smpboot: Booting Node 0 Processor 1 APIC 0x4
[ 1380.659810] intel_powerclamp: No package C-state available
[ 1380.659957] smpboot: Booting Node 0 Processor 2 APIC 0x2
[ 1380.671198] microcode: sig=0x10676, pf=0x40, revision=0x60f
[ 1380.672088] smpboot: Booting Node 0 Processor 3 APIC 0x6
[ 1380.677952] intel_powerclamp: No package C-state available
[ 1380.686260] microcode: sig=0x1067a, pf=0x40, revision=0xa0b
[ 1380.687098] smpboot: Booting Node 0 Processor 4 APIC 0x1
[ 1380.699214] microcode: sig=0x10676, pf=0x40, revision=0x60f
[ 1380.699742] intel_powerclamp: No package C-state available
[ 1380.700267] smpboot: Booting Node 0 Processor 5 APIC 0x5
[ 1380.715207] microcode: sig=0x1067a, pf=0x40, revision=0xa0b
[ 1380.716202] smpboot: Booting Node 0 Processor 6 APIC 0x3
[ 1380.730264] microcode: sig=0x10676, pf=0x40, revision=0x60f
[ 1380.730567] intel_powerclamp: No package C-state available
[ 1380.731267] smpboot: Booting Node 0 Processor 7 APIC 0x7
[ 1380.748276] microcode: sig=0x1067a, pf=0x40, revision=0xa0b

2017-04-12 14:54:02

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/6] hwmon/coretemp: Hotplug fixes, cleanups and state machine conversion

On Wed, 12 Apr 2017, Tommi Rantala wrote:
> 2017-04-12 13:52 GMT+03:00 Thomas Gleixner <[email protected]>:
> > Can you please try to offline/online CPUs from the console?
> >
> > # echo 0 >/sys/devices/system/cpu1/online
> > # echo 1 >/sys/devices/system/cpu1/online
>
> ok, that works.
>
> > If that works, then try to offline all CPUs (except 0) in the same order as
> > suspend (1 ... 7) and then online them again in the same order?
>
> Seems to work without problems:

Good.

Can you please try the following:

# for STATE in freezer devices platform processors core; do \
echo $STATE; \
echo $STATE >/sys/power/pm_test; \
echo mem >/sys/power/state

That should give us at least a hint in which area to dig.

Thanks,

tglx




2017-04-14 17:35:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/6] hwmon/coretemp: Hotplug fixes, cleanups and state machine conversion

On Wed, 12 Apr 2017, Thomas Gleixner wrote:
> On Wed, 12 Apr 2017, Tommi Rantala wrote:
> > 2017-04-12 13:52 GMT+03:00 Thomas Gleixner <[email protected]>:
> > > Can you please try to offline/online CPUs from the console?
> > >
> > > # echo 0 >/sys/devices/system/cpu1/online
> > > # echo 1 >/sys/devices/system/cpu1/online
> >
> > ok, that works.
> >
> > > If that works, then try to offline all CPUs (except 0) in the same order as
> > > suspend (1 ... 7) and then online them again in the same order?
> >
> > Seems to work without problems:
>
> Good.
>
> Can you please try the following:
>
> # for STATE in freezer devices platform processors core; do \
> echo $STATE; \
> echo $STATE >/sys/power/pm_test; \
> echo mem >/sys/power/state
>
> That should give us at least a hint in which area to dig.

Any news on that?

Thanks,

tglx

2017-04-15 17:22:37

by Tommi Rantala

[permalink] [raw]
Subject: Re: [patch 0/6] hwmon/coretemp: Hotplug fixes, cleanups and state machine conversion

2017-04-14 20:35 GMT+03:00 Thomas Gleixner <[email protected]>:
> On Wed, 12 Apr 2017, Thomas Gleixner wrote:
>>
>> Can you please try the following:
>>
>> # for STATE in freezer devices platform processors core; do \
>> echo $STATE; \
>> echo $STATE >/sys/power/pm_test; \
>> echo mem >/sys/power/state
>>
>> That should give us at least a hint in which area to dig.
>
> Any news on that?

Sorry, was traveling.

Testing with 4.10.8-200.fc25.x86_64: freezer, devices and platform are
OK, it breaks at "processors".
The screen stays off, and the machine no longer answers to ping.

(Without coretemp loaded, the machine survives all the states. There
are some graphics glitches and radeon error messages)

-Tommi

2017-04-23 15:02:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 0/6] hwmon/coretemp: Hotplug fixes, cleanups and state machine conversion

On Sat, 15 Apr 2017, Tommi Rantala wrote:

> 2017-04-14 20:35 GMT+03:00 Thomas Gleixner <[email protected]>:
> > On Wed, 12 Apr 2017, Thomas Gleixner wrote:
> >>
> >> Can you please try the following:
> >>
> >> # for STATE in freezer devices platform processors core; do \
> >> echo $STATE; \
> >> echo $STATE >/sys/power/pm_test; \
> >> echo mem >/sys/power/state
> >>
> >> That should give us at least a hint in which area to dig.
> >
> > Any news on that?
>
> Sorry, was traveling.
>
> Testing with 4.10.8-200.fc25.x86_64: freezer, devices and platform are
> OK, it breaks at "processors".
> The screen stays off, and the machine no longer answers to ping.
>
> (Without coretemp loaded, the machine survives all the states. There
> are some graphics glitches and radeon error messages)

That's odd. I tried on a similar machine (w/o a radeon card) and it just
works with the coretemp module loaded.

Can you please do a CPU hotplug cycle (just one CPU) with the cpuhp events
in the tracer enabled. Send me the trace output so I might be able to spot
whats different and what interdependencies between other callbacks might be
there.

I'm traveling for a week now. I come back to this after my travel; if I
forget, please send me a reminder.

Thanks,

tglx