2007-08-27 12:08:19

by Pavel Machek

[permalink] [raw]
Subject: cpu hotplug support broken in 2.6.23-rc3

Hi!

Trying to do few onlines/offlines reliably hangs my machine (thinkpad
x60, i386 architecture).

Plus I guess it would be nice to add CPU HOTPLUG into MAINTAINERS
file:

pavel@amd:/data/l/linux$ grep CPU MAINTAINERS
CPU FREQUENCY DRIVERS
CPUID/MSR DRIVER
CPUSETS
i386 SETUP CODE / CPU ERRATA WORKAROUNDS
SCx200 CPU SUPPORT
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


2007-08-27 12:09:06

by Pavel Machek

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Mon 2007-08-27 12:43:50, Pavel Machek wrote:
> Hi!
>
> Trying to do few onlines/offlines reliably hangs my machine (thinkpad
> x60, i386 architecture).
>
> Plus I guess it would be nice to add CPU HOTPLUG into MAINTAINERS
> file:
>
> pavel@amd:/data/l/linux$ grep CPU MAINTAINERS
> CPU FREQUENCY DRIVERS
> CPUID/MSR DRIVER
> CPUSETS
> i386 SETUP CODE / CPU ERRATA WORKAROUNDS
> SCx200 CPU SUPPORT

...plus it actually breaks suspend, and it is regression from 2.6.22.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-08-27 14:37:13

by Jeff Chua

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On 8/27/07, Pavel Machek <[email protected]> wrote:
> On Mon 2007-08-27 12:43:50, Pavel Machek wrote:
> > Hi!
> >
> > Trying to do few onlines/offlines reliably hangs my machine (thinkpad
> > x60, i386 architecture).

I just 3 cycles of on-line/off-line on 2.6.23-rc3 on ThinkPad x60s,
and my system still survives.

Jeff.

2007-08-27 15:22:18

by Michal Piotrowski

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

Hi,

On 27/08/07, Jeff Chua <[email protected]> wrote:
> On 8/27/07, Pavel Machek <[email protected]> wrote:
> > On Mon 2007-08-27 12:43:50, Pavel Machek wrote:
> > > Hi!
> > >
> > > Trying to do few onlines/offlines reliably hangs my machine (thinkpad
> > > x60, i386 architecture).
>
> I just 3 cycles of on-line/off-line on 2.6.23-rc3 on ThinkPad x60s,
> and my system still survives.

So maybe diff between your and Pavel's config file will give an answer.

Any details about the software environment?

Regards,
Michal

--
LOG
http://www.stardust.webpages.pl/log/

2007-08-27 21:48:19

by Pavel Machek

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Mon 2007-08-27 22:36:57, Jeff Chua wrote:
> On 8/27/07, Pavel Machek <[email protected]> wrote:
> > On Mon 2007-08-27 12:43:50, Pavel Machek wrote:
> > > Hi!
> > >
> > > Trying to do few onlines/offlines reliably hangs my machine (thinkpad
> > > x60, i386 architecture).
>
> I just 3 cycles of on-line/off-line on 2.6.23-rc3 on ThinkPad x60s,
> and my system still survives.

Can you try 20-or-so tests? Mine hangs randomly, so it survived 4 or
so cycles at one point.

...or maybe difference is in the .config, or maybe I broken something
in my kernel sources....
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-08-27 21:57:24

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Monday, 27 August 2007 23:32, Pavel Machek wrote:
> On Mon 2007-08-27 22:36:57, Jeff Chua wrote:
> > On 8/27/07, Pavel Machek <[email protected]> wrote:
> > > On Mon 2007-08-27 12:43:50, Pavel Machek wrote:
> > > > Hi!
> > > >
> > > > Trying to do few onlines/offlines reliably hangs my machine (thinkpad
> > > > x60, i386 architecture).
> >
> > I just 3 cycles of on-line/off-line on 2.6.23-rc3 on ThinkPad x60s,
> > and my system still survives.
>
> Can you try 20-or-so tests? Mine hangs randomly, so it survived 4 or
> so cycles at one point.
>
> ...or maybe difference is in the .config, or maybe I broken something
> in my kernel sources....

Well, something seems to be wrong with the CPU hotplug, but it's insanely
difficult to reproduce on my boxes.

I bet on one of the notifiers blocking while waiting on a frozen task.

Greetings,
Rafael

2007-08-27 21:59:10

by Pavel Machek

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Mon 2007-08-27 23:59:31, Rafael J. Wysocki wrote:
> On Monday, 27 August 2007 23:32, Pavel Machek wrote:
> > On Mon 2007-08-27 22:36:57, Jeff Chua wrote:
> > > On 8/27/07, Pavel Machek <[email protected]> wrote:
> > > > On Mon 2007-08-27 12:43:50, Pavel Machek wrote:
> > > > > Hi!
> > > > >
> > > > > Trying to do few onlines/offlines reliably hangs my machine (thinkpad
> > > > > x60, i386 architecture).
> > >
> > > I just 3 cycles of on-line/off-line on 2.6.23-rc3 on ThinkPad x60s,
> > > and my system still survives.
> >
> > Can you try 20-or-so tests? Mine hangs randomly, so it survived 4 or
> > so cycles at one point.
> >
> > ...or maybe difference is in the .config, or maybe I broken something
> > in my kernel sources....
>
> Well, something seems to be wrong with the CPU hotplug, but it's insanely
> difficult to reproduce on my boxes.
>
> I bet on one of the notifiers blocking while waiting on a frozen task.

It happens reliably for me, with this script... and randomly, when I
just echo 0/1 > online from commandline... so it should not be
anything with the frozen tasks.

echo test > /sys/power/disk
echo disk > /sys/power/state

reliably hangs on resume in the attached script. It works ok with
nosmp.

Pavel

#!/bin/bash
killall klogd

echo -n "testing refrigerator (testproc)..."
echo testproc > /sys/power/disk
echo disk > /sys/power/state
echo "okay"

sleep 2
echo -n "testing drivers (test)..."
echo test > /sys/power/disk
echo disk > /sys/power/state
echo "okay"

sleep 2
echo -n "testing swsusp (reboot)..."
echo reboot > /sys/power/disk
echo disk > /sys/power/state
echo "okay"

sleep 2
echo -n "testing s2ram..."
s2ram
echo "okay"

sleep 2
echo -n "testing swsusp (shutdown)..."
echo shutdown > /sys/power/disk
echo disk > /sys/power/state
echo "okay"

sleep 2
echo -n "testing swsusp (platform)..."
echo platform > /sys/power/disk
echo disk > /sys/power/state
echo "okay"

sleep 2
echo -n "testing s2ram..."
s2ram
echo "okay"


--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-08-28 10:20:24

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Monday, 27 August 2007 23:58, Pavel Machek wrote:
> On Mon 2007-08-27 23:59:31, Rafael J. Wysocki wrote:
> > On Monday, 27 August 2007 23:32, Pavel Machek wrote:
> > > On Mon 2007-08-27 22:36:57, Jeff Chua wrote:
> > > > On 8/27/07, Pavel Machek <[email protected]> wrote:
> > > > > On Mon 2007-08-27 12:43:50, Pavel Machek wrote:
> > > > > > Hi!
> > > > > >
> > > > > > Trying to do few onlines/offlines reliably hangs my machine (thinkpad
> > > > > > x60, i386 architecture).
> > > >
> > > > I just 3 cycles of on-line/off-line on 2.6.23-rc3 on ThinkPad x60s,
> > > > and my system still survives.
> > >
> > > Can you try 20-or-so tests? Mine hangs randomly, so it survived 4 or
> > > so cycles at one point.
> > >
> > > ...or maybe difference is in the .config, or maybe I broken something
> > > in my kernel sources....
> >
> > Well, something seems to be wrong with the CPU hotplug, but it's insanely
> > difficult to reproduce on my boxes.
> >
> > I bet on one of the notifiers blocking while waiting on a frozen task.
>
> It happens reliably for me, with this script... and randomly, when I
> just echo 0/1 > online from commandline... so it should not be
> anything with the frozen tasks.

That suggests the CPU hotplug just deadlocks internally.

Can you put some printk's into _cpu_down() and see where exactly it hangs?

> echo test > /sys/power/disk
> echo disk > /sys/power/state
>
> reliably hangs on resume in the attached script. It works ok with
> nosmp.

Which step hangs it? Or is it at random?

Rafael


> #!/bin/bash
> killall klogd
>
> echo -n "testing refrigerator (testproc)..."
> echo testproc > /sys/power/disk
> echo disk > /sys/power/state
> echo "okay"
>
> sleep 2
> echo -n "testing drivers (test)..."
> echo test > /sys/power/disk
> echo disk > /sys/power/state
> echo "okay"
>
> sleep 2
> echo -n "testing swsusp (reboot)..."
> echo reboot > /sys/power/disk
> echo disk > /sys/power/state
> echo "okay"
>
> sleep 2
> echo -n "testing s2ram..."
> s2ram
> echo "okay"
>
> sleep 2
> echo -n "testing swsusp (shutdown)..."
> echo shutdown > /sys/power/disk
> echo disk > /sys/power/state
> echo "okay"
>
> sleep 2
> echo -n "testing swsusp (platform)..."
> echo platform > /sys/power/disk
> echo disk > /sys/power/state
> echo "okay"
>
> sleep 2
> echo -n "testing s2ram..."
> s2ram
> echo "okay"
>
>

--
"Premature optimization is the root of all evil." - Donald Knuth

2007-08-28 13:00:27

by Akinobu Mita

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

2007/8/28, Rafael J. Wysocki <[email protected]>:
> On Monday, 27 August 2007 23:58, Pavel Machek wrote:
> > On Mon 2007-08-27 23:59:31, Rafael J. Wysocki wrote:
> > > On Monday, 27 August 2007 23:32, Pavel Machek wrote:
> > > > On Mon 2007-08-27 22:36:57, Jeff Chua wrote:
> > > > > On 8/27/07, Pavel Machek <[email protected]> wrote:
> > > > > > On Mon 2007-08-27 12:43:50, Pavel Machek wrote:
> > > > > > > Hi!
> > > > > > >
> > > > > > > Trying to do few onlines/offlines reliably hangs my machine (thinkpad
> > > > > > > x60, i386 architecture).
> > > > >
> > > > > I just 3 cycles of on-line/off-line on 2.6.23-rc3 on ThinkPad x60s,
> > > > > and my system still survives.
> > > >
> > > > Can you try 20-or-so tests? Mine hangs randomly, so it survived 4 or
> > > > so cycles at one point.
> > > >
> > > > ...or maybe difference is in the .config, or maybe I broken something
> > > > in my kernel sources....

I have been doing enough CPU offline/online test these days and it works fine.
But there is no cpufreq driver which supports my machine. So my test didn't
cover test cpu hotplug code in cpufreq.

If you have cpufreq driver and it is built as module, it is worth trying
same test after unloading cpufreq driver in order to narrow down the problem
area.

2007-08-28 14:21:48

by Jeff Chua

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On 8/28/07, Pavel Machek <[email protected]> wrote:

> Can you try 20-or-so tests? Mine hangs randomly, so it survived 4 or
> so cycles at one point.

Mine still survives with this ... with sleep 1 ...

# for((i=0; i<100; i++)); do echo $i; echo $((i % 2))
>/sys/devices/system/cpu/cpu1/online; sleep 1; done

and this as well ... without sleep ...

# for((i=0; i<100; i++)); do echo $i; echo $((i % 2))
>/sys/devices/system/cpu/cpu1/online; done

I'm on reiserfs. gcc 3.4.5. Config sent to you seperately so as not to
cloud lkml. If anyone wants the config, please let me know. Is mime
"attachment" acceptable now on lkml?

Thanks,
Jeff.

Subject: Re: cpu hotplug support broken in 2.6.23-rc3

Hi Pavel,
On Mon, Aug 27, 2007 at 12:43:50PM +0200, Pavel Machek wrote:
> Hi!
>
> Trying to do few onlines/offlines reliably hangs my machine (thinkpad
> x60, i386 architecture).
>

That's strange.

I've been running cpu offline/online tests with kern bench,
cpufreq-ondemand and a few rt-tasks running in the background
and it has worked for me.
Something like 100 iterations without a problem. But these were on
machines with 4-8 cpus. So may be this could be something specific to
the dual cpu machine.

Can you post the .config? I'll try to recreate it?

It's really strange since you mention that it tooks was
an echo 1/0 into the sysfs file to break it.

> Plus I guess it would be nice to add CPU HOTPLUG into MAINTAINERS
> file:
>

There is a list of maintainers in the Documentation/cpu-hotplug.txt,
which includes maintainers for different platforms as well.

It's a good idea to add that info to the MAINTAINERS file as well.

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-09-03 09:56:33

by Pavel Machek

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

Hi!

> > Can you try 20-or-so tests? Mine hangs randomly, so it survived 4 or
> > so cycles at one point.
>
> Mine still survives with this ... with sleep 1 ...
>
> # for((i=0; i<100; i++)); do echo $i; echo $((i % 2))
> >/sys/devices/system/cpu/cpu1/online; sleep 1; done
>
> and this as well ... without sleep ...
>
> # for((i=0; i<100; i++)); do echo $i; echo $((i % 2))
> >/sys/devices/system/cpu/cpu1/online; done
>
> I'm on reiserfs. gcc 3.4.5. Config sent to you seperately so as not to
> cloud lkml. If anyone wants the config, please let me know. Is mime
> "attachment" acceptable now on lkml?

Ok, so it gets weirder. I have now machine in "hung" state; other
consoles still work, but there are no timers - sleep 1 hangs forever.

sysrq-t shows kstopmachine hung in hrtimer_try_to_cancel.

So I indeed suspect difference-in-kconfig to trigger this, and will
try disabling noidlehz.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-09-03 10:06:26

by Pavel Machek

[permalink] [raw]
Subject: highres timers break cpu hotplug in 2.6.23-rc5 [was Re: cpu hotplug support broken in 2.6.23-rc3]

Hi!

> > Can you try 20-or-so tests? Mine hangs randomly, so it survived 4 or
> > so cycles at one point.
>
> Mine still survives with this ... with sleep 1 ...
>
> # for((i=0; i<100; i++)); do echo $i; echo $((i % 2))
> >/sys/devices/system/cpu/cpu1/online; sleep 1; done
>
> and this as well ... without sleep ...
>
> # for((i=0; i<100; i++)); do echo $i; echo $((i % 2))
> >/sys/devices/system/cpu/cpu1/online; done
>
> I'm on reiserfs. gcc 3.4.5. Config sent to you seperately so as not to
> cloud lkml. If anyone wants the config, please let me know. Is mime
> "attachment" acceptable now on lkml?

It gets weirder. With "nohz=off" on commandline, I have to press any
key (generate interrupt?) for echo 1 > online to finish. 2.6.23-rc5
kernel... but hotplug/unplug works reliably now.

With nohz=off highres=off I can unplug/replug cpus as much as I
want... running in tight loop now.
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-09-03 10:07:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Monday, 3 September 2007 05:47, Pavel Machek wrote:
> Hi!
>
> > > Can you try 20-or-so tests? Mine hangs randomly, so it survived 4 or
> > > so cycles at one point.
> >
> > Mine still survives with this ... with sleep 1 ...
> >
> > # for((i=0; i<100; i++)); do echo $i; echo $((i % 2))
> > >/sys/devices/system/cpu/cpu1/online; sleep 1; done
> >
> > and this as well ... without sleep ...
> >
> > # for((i=0; i<100; i++)); do echo $i; echo $((i % 2))
> > >/sys/devices/system/cpu/cpu1/online; done
> >
> > I'm on reiserfs. gcc 3.4.5. Config sent to you seperately so as not to
> > cloud lkml. If anyone wants the config, please let me know. Is mime
> > "attachment" acceptable now on lkml?
>
> Ok, so it gets weirder. I have now machine in "hung" state; other
> consoles still work, but there are no timers - sleep 1 hangs forever.
>
> sysrq-t shows kstopmachine hung in hrtimer_try_to_cancel.
>
> So I indeed suspect difference-in-kconfig to trigger this, and will
> try disabling noidlehz.

I would unset CONFIG_HIGH_RES_TIMERS for starters.

Well, I guess Thomas should know about that. ;-)

Greetings,
Rafael

2007-09-03 10:08:44

by Pavel Machek

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Wed 2007-08-29 13:38:27, Gautham R Shenoy wrote:
> Hi Pavel,
> On Mon, Aug 27, 2007 at 12:43:50PM +0200, Pavel Machek wrote:
> > Hi!
> >
> > Trying to do few onlines/offlines reliably hangs my machine (thinkpad
> > x60, i386 architecture).
> >
>
> That's strange.
>
> I've been running cpu offline/online tests with kern bench,
> cpufreq-ondemand and a few rt-tasks running in the background
> and it has worked for me.
> Something like 100 iterations without a problem. But these were on
> machines with 4-8 cpus. So may be this could be something specific to
> the dual cpu machine.

Seems like it is specific to nohz/highrestimers.

> Can you post the .config? I'll try to recreate it?

Will send privately.

> > Plus I guess it would be nice to add CPU HOTPLUG into MAINTAINERS
> > file:
> >
>
> There is a list of maintainers in the Documentation/cpu-hotplug.txt,
> which includes maintainers for different platforms as well.
>
> It's a good idea to add that info to the MAINTAINERS file as well.

Yes, please.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-09-03 12:34:24

by Jeff Chua

[permalink] [raw]
Subject: Re: highres timers break cpu hotplug in 2.6.23-rc5 [was Re: cpu hotplug support broken in 2.6.23-rc3]

On 9/3/07, Pavel Machek <[email protected]> wrote:

> It gets weirder. With "nohz=off" on commandline, I have to press any
> key (generate interrupt?) for echo 1 > online to finish. 2.6.23-rc5
> kernel... but hotplug/unplug works reliably now.
>
> With nohz=off highres=off I can unplug/replug cpus as much as I
> want... running in tight loop now.

Yes. CONFIG_NO_HZ and and CONFIG_HIGH_RES_TIMERS has to be unset or
suspend-to-disk would just hang, unless you type something on the
keyboard, and then you can suspend to disk. It seems interrupts are
not missing.

Thanks,
Jeff.

2007-09-03 12:35:38

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Mon, 2007-09-03 at 12:19 +0200, Rafael J. Wysocki wrote:
> > Ok, so it gets weirder. I have now machine in "hung" state; other
> > consoles still work, but there are no timers - sleep 1 hangs forever.
> >
> > sysrq-t shows kstopmachine hung in hrtimer_try_to_cancel.
> >
> > So I indeed suspect difference-in-kconfig to trigger this, and will
> > try disabling noidlehz.
>
> I would unset CONFIG_HIGH_RES_TIMERS for starters.
>
> Well, I guess Thomas should know about that. ;-)

What was the last known to work version ?

tglx


2007-09-04 07:27:53

by Pavel Machek

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

> On Mon, 2007-09-03 at 12:19 +0200, Rafael J. Wysocki wrote:
> > > Ok, so it gets weirder. I have now machine in "hung" state; other
> > > consoles still work, but there are no timers - sleep 1 hangs forever.
> > >
> > > sysrq-t shows kstopmachine hung in hrtimer_try_to_cancel.
> > >
> > > So I indeed suspect difference-in-kconfig to trigger this, and will
> > > try disabling noidlehz.
> >
> > I would unset CONFIG_HIGH_RES_TIMERS for starters.
> >
> > Well, I guess Thomas should know about that. ;-)
>
> What was the last known to work version ?

I'm afraid I only turned on HIGH_RES_TIMERS in 2.6.23-rc1
timeframe... so I'm not sure if it ever worked for me.

I can confirm it is working in 2.6.23-rc5 with highres disabled, and
broken with highres enabled. NOHZ turns "waits for keypress during
unplug/replug" into "just plain hangs".
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-09-13 20:01:49

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3


On Tue, 2007-09-04 at 09:27 +0200, Pavel Machek wrote:
> > On Mon, 2007-09-03 at 12:19 +0200, Rafael J. Wysocki wrote:
> > > > Ok, so it gets weirder. I have now machine in "hung" state; other
> > > > consoles still work, but there are no timers - sleep 1 hangs forever.
> > > >
> > > > sysrq-t shows kstopmachine hung in hrtimer_try_to_cancel.
> > > >
> > > > So I indeed suspect difference-in-kconfig to trigger this, and will
> > > > try disabling noidlehz.
> > >
> > > I would unset CONFIG_HIGH_RES_TIMERS for starters.
> > >
> > > Well, I guess Thomas should know about that. ;-)
> >
> > What was the last known to work version ?
>
> I'm afraid I only turned on HIGH_RES_TIMERS in 2.6.23-rc1
> timeframe... so I'm not sure if it ever worked for me.
>
> I can confirm it is working in 2.6.23-rc5 with highres disabled, and
> broken with highres enabled. NOHZ turns "waits for keypress during
> unplug/replug" into "just plain hangs".

Ok, I can reproduce it and I tracked down what happens:

When the CPU goes offline, the clock event source for this CPU (lapic)
is removed from the clock events framework. This also clears the
information that the CPU is using C-States which stop the local APIC
timer.

Now you put the CPU online again and the local APIC timer is used, but
the C-State information is not evaluated again in ACPI. This means that
the clock events code does not know that the APIC might stop. In the
worst case this will happen and make the CPU wait for timer interrupts
forever.

The problem only appears when you are on battery (c3/c4 available) or on
those broken machines, where C2 is in reality C3 (e.g. akpm's VAIO)

I have an yet untested fix, which preserves the broadcast state across
the offline state, but Len is looking into it as well, whether we can
just reevaluate the power states (and the broadcast flags) when a cpu
becomes online again. If Len can do that easily for 2.6.23, I'd prefer
that.

tglx


2007-09-14 12:37:29

by Pavel Machek

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

Hi!

> > > What was the last known to work version ?
> >
> > I'm afraid I only turned on HIGH_RES_TIMERS in 2.6.23-rc1
> > timeframe... so I'm not sure if it ever worked for me.
> >
> > I can confirm it is working in 2.6.23-rc5 with highres disabled, and
> > broken with highres enabled. NOHZ turns "waits for keypress during
> > unplug/replug" into "just plain hangs".
>
> Ok, I can reproduce it and I tracked down what happens:
>
> When the CPU goes offline, the clock event source for this CPU (lapic)
> is removed from the clock events framework. This also clears the
> information that the CPU is using C-States which stop the local APIC
> timer.
>
> Now you put the CPU online again and the local APIC timer is used, but
> the C-State information is not evaluated again in ACPI. This means that
> the clock events code does not know that the APIC might stop. In the
> worst case this will happen and make the CPU wait for timer interrupts
> forever.
>
> The problem only appears when you are on battery (c3/c4 available) or on
> those broken machines, where C2 is in reality C3 (e.g. akpm's VAIO)
>
> I have an yet untested fix, which preserves the broadcast state across
> the offline state, but Len is looking into it as well, whether we can
> just reevaluate the power states (and the broadcast flags) when a cpu
> becomes online again. If Len can do that easily for 2.6.23, I'd prefer
> that.

Is there a patch you want me to test? Or does Len have anything to
play with?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-09-14 12:51:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

Pavel,

On Fri, 2007-09-14 at 14:38 +0200, Pavel Machek wrote:
> > I have an yet untested fix, which preserves the broadcast state across
> > the offline state, but Len is looking into it as well, whether we can
> > just reevaluate the power states (and the broadcast flags) when a cpu
> > becomes online again. If Len can do that easily for 2.6.23, I'd prefer
> > that.
>
> Is there a patch you want me to test? Or does Len have anything to
> play with?

Venki sent me an initial patch, but it has issues with the notify
ordering. Find below my "cache the broadcast flags" version for testing.

Thanks,

tglx

---
kernel/time/tick-broadcast.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

Index: linux-2.6/kernel/time/tick-broadcast.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-broadcast.c 2007-09-14 13:22:29.000000000 +0200
+++ linux-2.6/kernel/time/tick-broadcast.c 2007-09-14 13:22:29.000000000 +0200
@@ -261,10 +261,25 @@ void tick_broadcast_on_off(unsigned long
int cpu = get_cpu();

if (!cpu_isset(*oncpu, cpu_online_map)) {
- printk(KERN_ERR "tick-braodcast: ignoring broadcast for "
- "offline CPU #%d\n", *oncpu);
- } else {
+ unsigned long flags;
+
+ spin_lock_irqsave(&tick_broadcast_lock, flags);
+ /*
+ * We need to cache the broadcast flag for offline
+ * CPUs. ACPI currently does not reevaluate the
+ * broadcast flag when a CPU goes online again. Adding
+ * a cpu notifier to ACPI is probably the correct
+ * solution, but it is hard to get this correct due to
+ * notify ordering problems. So caching the flag is
+ * the safe solution for now.
+ */
+ if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ON)
+ cpu_set(*oncpu, tick_broadcast_mask);
+ else
+ cpu_clear(*oncpu, tick_broadcast_mask);

+ spin_unlock_irqrestore(&tick_broadcast_lock, flags);
+ } else {
if (cpu == *oncpu)
tick_do_broadcast_on_off(&reason);
else


2007-09-14 13:15:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Fri, 2007-09-14 at 14:50 +0200, Thomas Gleixner wrote:
> Pavel,
>
> On Fri, 2007-09-14 at 14:38 +0200, Pavel Machek wrote:
> > > I have an yet untested fix, which preserves the broadcast state across
> > > the offline state, but Len is looking into it as well, whether we can
> > > just reevaluate the power states (and the broadcast flags) when a cpu
> > > becomes online again. If Len can do that easily for 2.6.23, I'd prefer
> > > that.
> >
> > Is there a patch you want me to test? Or does Len have anything to
> > play with?
>
> Venki sent me an initial patch, but it has issues with the notify
> ordering. Find below my "cache the broadcast flags" version for testing.

Hmmpf, the flag is still cleared when the cpu goes offline. Need to take
a closer look.

tglx


2007-09-14 18:50:55

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: RE: cpu hotplug support broken in 2.6.23-rc3



>-----Original Message-----
>From: [email protected]
>[mailto:[email protected]] On Behalf Of
>Thomas Gleixner
>Sent: Friday, September 14, 2007 5:51 AM
>To: Pavel Machek
>Cc: Rafael J. Wysocki; Jeff Chua; [email protected];
>[email protected]; [email protected]; kernel list; Len Brown
>Subject: Re: cpu hotplug support broken in 2.6.23-rc3
>
>Pavel,
>
>On Fri, 2007-09-14 at 14:38 +0200, Pavel Machek wrote:
>> > I have an yet untested fix, which preserves the broadcast
>state across
>> > the offline state, but Len is looking into it as well,
>whether we can
>> > just reevaluate the power states (and the broadcast flags)
>when a cpu
>> > becomes online again. If Len can do that easily for
>2.6.23, I'd prefer
>> > that.
>>
>> Is there a patch you want me to test? Or does Len have anything to
>> play with?
>
>Venki sent me an initial patch, but it has issues with the notify
>ordering. Find below my "cache the broadcast flags" version
>for testing.
>

While wirting that patch, I knew solution could not be that simple :(.
Does the patch work for online offline case atleast?
Will look at the Suspend/Resume ordering part in that case.

Thanks,
Venki

2007-09-14 19:19:59

by Thomas Gleixner

[permalink] [raw]
Subject: RE: cpu hotplug support broken in 2.6.23-rc3

On Fri, 2007-09-14 at 11:49 -0700, Pallipadi, Venkatesh wrote:
> >>
> >> Is there a patch you want me to test? Or does Len have anything to
> >> play with?
> >
> >Venki sent me an initial patch, but it has issues with the notify
> >ordering. Find below my "cache the broadcast flags" version
> >for testing.
> >
>
> While wirting that patch, I knew solution could not be that simple :(.
> Does the patch work for online offline case atleast?
> Will look at the Suspend/Resume ordering part in that case.

Yup, the online/offline part works and it helped me to decode the other
reason (/me needs a dark brown paperbag) why Pavel noticed that his box
turned into a brick. I'll send out a full series of fixups (including
your online/offline one) tomorrow morning. I want to give that some more
testing.

Vs. the resume reevaluation: I don't think it's an urgent problem. It's
only my VAIO which does not tell the kernel after resume that the power
supply source has changed. All my other boxen do that and we never had a
complaint about that from other folks.

tglx


2007-09-15 09:49:57

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

Pavel,

On Fri, 2007-09-14 at 15:15 +0200, Thomas Gleixner wrote:
> > Venki sent me an initial patch, but it has issues with the notify
> > ordering. Find below my "cache the broadcast flags" version for testing.
>
> Hmmpf, the flag is still cleared when the cpu goes offline. Need to take
> a closer look.

I finally tracked it down. There were several ways to turn the box into
a brick. Sigh !

Can you please test the combo patch below ?

The details are available from the for-2.6.23 branch of my hrt git repo:

http://git.kernel.org/?p=linux/kernel/git/tglx/linux-2.6-hrt.git;a=shortlog;h=for-2.6.23

Thanks,

tglx

Index: linux-2.6/kernel/time/timekeeping.c
===================================================================
--- linux-2.6.orig/kernel/time/timekeeping.c 2007-09-15 11:42:09.000000000 +0200
+++ linux-2.6/kernel/time/timekeeping.c 2007-09-15 11:43:03.000000000 +0200
@@ -217,6 +217,7 @@ static void change_clocksource(void)
}
#else
static inline void change_clocksource(void) { }
+static inline s64 __get_nsec_offset(void) { return 0; }
#endif

/**
@@ -280,6 +281,8 @@ void __init timekeeping_init(void)
static int timekeeping_suspended;
/* time in seconds when suspend began */
static unsigned long timekeeping_suspend_time;
+/* xtime offset when we went into suspend */
+static s64 timekeeping_suspend_nsecs;

/**
* timekeeping_resume - Resumes the generic timekeeping subsystem.
@@ -305,6 +308,8 @@ static int timekeeping_resume(struct sys
wall_to_monotonic.tv_sec -= sleep_length;
total_sleep_time += sleep_length;
}
+ /* Make sure that we have the correct xtime reference */
+ timespec_add_ns(&xtime, timekeeping_suspend_nsecs);
/* re-base the last cycle value */
clock->cycle_last = clocksource_read(clock);
clock->error = 0;
@@ -325,9 +330,12 @@ static int timekeeping_suspend(struct sy
{
unsigned long flags;

+ timekeeping_suspend_time = read_persistent_clock();
+
write_seqlock_irqsave(&xtime_lock, flags);
+ /* Get the current xtime offset */
+ timekeeping_suspend_nsecs = __get_nsec_offset();
timekeeping_suspended = 1;
- timekeeping_suspend_time = read_persistent_clock();
write_sequnlock_irqrestore(&xtime_lock, flags);

clockevents_notify(CLOCK_EVT_NOTIFY_SUSPEND, NULL);
Index: linux-2.6/drivers/acpi/processor_core.c
===================================================================
--- linux-2.6.orig/drivers/acpi/processor_core.c 2007-09-15 11:42:09.000000000 +0200
+++ linux-2.6/drivers/acpi/processor_core.c 2007-09-15 11:43:03.000000000 +0200
@@ -724,6 +724,25 @@ static void acpi_processor_notify(acpi_h
return;
}

+static int acpi_cpu_soft_notify(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (unsigned long)hcpu;
+ struct acpi_processor *pr = processors[cpu];
+
+ if (action == CPU_ONLINE && pr) {
+ acpi_processor_ppc_has_changed(pr);
+ acpi_processor_cst_has_changed(pr);
+ acpi_processor_tstate_has_changed(pr);
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block acpi_cpu_notifier =
+{
+ .notifier_call = acpi_cpu_soft_notify,
+};
+
static int acpi_processor_add(struct acpi_device *device)
{
struct acpi_processor *pr = NULL;
@@ -987,6 +1006,7 @@ void acpi_processor_install_hotplug_noti
ACPI_UINT32_MAX,
processor_walk_namespace_cb, &action, NULL);
#endif
+ register_hotcpu_notifier(&acpi_cpu_notifier);
}

static
@@ -999,6 +1019,7 @@ void acpi_processor_uninstall_hotplug_no
ACPI_UINT32_MAX,
processor_walk_namespace_cb, &action, NULL);
#endif
+ unregister_hotcpu_notifier(&acpi_cpu_notifier);
}

/*
Index: linux-2.6/kernel/time/tick-broadcast.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-broadcast.c 2007-09-15 11:42:09.000000000 +0200
+++ linux-2.6/kernel/time/tick-broadcast.c 2007-09-15 11:43:03.000000000 +0200
@@ -382,12 +382,23 @@ static int tick_broadcast_set_event(ktim

int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
{
+ int cpu = smp_processor_id();
+
+ /*
+ * If the CPU is marked for broadcast, enforce oneshot
+ * broadcast mode. The jinxed VAIO does not resume otherwise.
+ * No idea why it ends up in a lower C State during resume
+ * without notifying the clock events layer.
+ */
+ if (cpu_isset(cpu, tick_broadcast_mask))
+ cpu_set(cpu, tick_broadcast_oneshot_mask);
+
clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);

if(!cpus_empty(tick_broadcast_oneshot_mask))
tick_broadcast_set_event(ktime_get(), 1);

- return cpu_isset(smp_processor_id(), tick_broadcast_oneshot_mask);
+ return cpu_isset(cpu, tick_broadcast_oneshot_mask);
}

/*
@@ -549,20 +560,17 @@ void tick_broadcast_switch_to_oneshot(vo
*/
void tick_shutdown_broadcast_oneshot(unsigned int *cpup)
{
- struct clock_event_device *bc;
unsigned long flags;
unsigned int cpu = *cpup;

spin_lock_irqsave(&tick_broadcast_lock, flags);

- bc = tick_broadcast_device.evtdev;
+ /*
+ * Clear the broadcast mask flag for the dead cpu, but do not
+ * stop the broadcast device!
+ */
cpu_clear(cpu, tick_broadcast_oneshot_mask);

- if (tick_broadcast_device.mode == TICKDEV_MODE_ONESHOT) {
- if (bc && cpus_empty(tick_broadcast_oneshot_mask))
- clockevents_set_mode(bc, CLOCK_EVT_MODE_SHUTDOWN);
- }
-
spin_unlock_irqrestore(&tick_broadcast_lock, flags);
}

Index: linux-2.6/kernel/time/tick-sched.c
===================================================================
--- linux-2.6.orig/kernel/time/tick-sched.c 2007-09-15 11:42:09.000000000 +0200
+++ linux-2.6/kernel/time/tick-sched.c 2007-09-15 11:43:41.000000000 +0200
@@ -160,6 +160,18 @@ void tick_nohz_stop_sched_tick(void)
cpu = smp_processor_id();
ts = &per_cpu(tick_cpu_sched, cpu);

+ /*
+ * If this cpu is offline and it is the one which updates
+ * jiffies, then give up the assignment and let it be taken by
+ * the cpu which runs the tick timer next. If we don't drop
+ * this here the jiffies might be stale and do_timer() never
+ * invoked.
+ */
+ if (unlikely(!cpu_online(cpu))) {
+ if (cpu == tick_do_timer_cpu)
+ tick_do_timer_cpu = -1;
+ }
+
if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
goto end;



2007-09-15 10:20:26

by Andrew Morton

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Sat, 15 Sep 2007 11:49:41 +0200 Thomas Gleixner <[email protected]> wrote:

> On Fri, 2007-09-14 at 15:15 +0200, Thomas Gleixner wrote:
> > > Venki sent me an initial patch, but it has issues with the notify
> > > ordering. Find below my "cache the broadcast flags" version for testing.
> >
> > Hmmpf, the flag is still cleared when the cpu goes offline. Need to take
> > a closer look.
>
> I finally tracked it down. There were several ways to turn the box into
> a brick. Sigh !
>
> Can you please test the combo patch below ?
>
> The details are available from the for-2.6.23 branch of my hrt git repo:
>
> http://git.kernel.org/?p=linux/kernel/git/tglx/linux-2.6-hrt.git;a=shortlog;h=for-2.6.23
>

That patch fixes the resume-from-ram and suspend-to-ram regressions on the
Vaio.

I dropped the timekeeping.c hunks because they are an older version of
timekeeping-prevent-time-going-backwards-on-resume.patch which I already
had.

Is this good to go? Needs a bit of changelogging.


drivers/acpi/processor_core.c | 21 +++++++++++++++++++++
kernel/time/tick-broadcast.c | 24 ++++++++++++++++--------
kernel/time/tick-sched.c | 12 ++++++++++++
3 files changed, 49 insertions(+), 8 deletions(-)

diff -puN drivers/acpi/processor_core.c~cpu-hotplug-support-broken-in-2623-rc3 drivers/acpi/processor_core.c
--- a/drivers/acpi/processor_core.c~cpu-hotplug-support-broken-in-2623-rc3
+++ a/drivers/acpi/processor_core.c
@@ -724,6 +724,25 @@ static void acpi_processor_notify(acpi_h
return;
}

+static int acpi_cpu_soft_notify(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ unsigned int cpu = (unsigned long)hcpu;
+ struct acpi_processor *pr = processors[cpu];
+
+ if (action == CPU_ONLINE && pr) {
+ acpi_processor_ppc_has_changed(pr);
+ acpi_processor_cst_has_changed(pr);
+ acpi_processor_tstate_has_changed(pr);
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block acpi_cpu_notifier =
+{
+ .notifier_call = acpi_cpu_soft_notify,
+};
+
static int acpi_processor_add(struct acpi_device *device)
{
struct acpi_processor *pr = NULL;
@@ -987,6 +1006,7 @@ void acpi_processor_install_hotplug_noti
ACPI_UINT32_MAX,
processor_walk_namespace_cb, &action, NULL);
#endif
+ register_hotcpu_notifier(&acpi_cpu_notifier);
}

static
@@ -999,6 +1019,7 @@ void acpi_processor_uninstall_hotplug_no
ACPI_UINT32_MAX,
processor_walk_namespace_cb, &action, NULL);
#endif
+ unregister_hotcpu_notifier(&acpi_cpu_notifier);
}

/*
diff -puN kernel/time/tick-broadcast.c~cpu-hotplug-support-broken-in-2623-rc3 kernel/time/tick-broadcast.c
--- a/kernel/time/tick-broadcast.c~cpu-hotplug-support-broken-in-2623-rc3
+++ a/kernel/time/tick-broadcast.c
@@ -382,12 +382,23 @@ static int tick_broadcast_set_event(ktim

int tick_resume_broadcast_oneshot(struct clock_event_device *bc)
{
+ int cpu = smp_processor_id();
+
+ /*
+ * If the CPU is marked for broadcast, enforce oneshot
+ * broadcast mode. The jinxed VAIO does not resume otherwise.
+ * No idea why it ends up in a lower C State during resume
+ * without notifying the clock events layer.
+ */
+ if (cpu_isset(cpu, tick_broadcast_mask))
+ cpu_set(cpu, tick_broadcast_oneshot_mask);
+
clockevents_set_mode(bc, CLOCK_EVT_MODE_ONESHOT);

if(!cpus_empty(tick_broadcast_oneshot_mask))
tick_broadcast_set_event(ktime_get(), 1);

- return cpu_isset(smp_processor_id(), tick_broadcast_oneshot_mask);
+ return cpu_isset(cpu, tick_broadcast_oneshot_mask);
}

/*
@@ -549,20 +560,17 @@ void tick_broadcast_switch_to_oneshot(vo
*/
void tick_shutdown_broadcast_oneshot(unsigned int *cpup)
{
- struct clock_event_device *bc;
unsigned long flags;
unsigned int cpu = *cpup;

spin_lock_irqsave(&tick_broadcast_lock, flags);

- bc = tick_broadcast_device.evtdev;
+ /*
+ * Clear the broadcast mask flag for the dead cpu, but do not
+ * stop the broadcast device!
+ */
cpu_clear(cpu, tick_broadcast_oneshot_mask);

- if (tick_broadcast_device.mode == TICKDEV_MODE_ONESHOT) {
- if (bc && cpus_empty(tick_broadcast_oneshot_mask))
- clockevents_set_mode(bc, CLOCK_EVT_MODE_SHUTDOWN);
- }
-
spin_unlock_irqrestore(&tick_broadcast_lock, flags);
}

diff -puN kernel/time/tick-sched.c~cpu-hotplug-support-broken-in-2623-rc3 kernel/time/tick-sched.c
--- a/kernel/time/tick-sched.c~cpu-hotplug-support-broken-in-2623-rc3
+++ a/kernel/time/tick-sched.c
@@ -160,6 +160,18 @@ void tick_nohz_stop_sched_tick(void)
cpu = smp_processor_id();
ts = &per_cpu(tick_cpu_sched, cpu);

+ /*
+ * If this cpu is offline and it is the one which updates
+ * jiffies, then give up the assignment and let it be taken by
+ * the cpu which runs the tick timer next. If we don't drop
+ * this here the jiffies might be stale and do_timer() never
+ * invoked.
+ */
+ if (unlikely(!cpu_online(cpu))) {
+ if (cpu == tick_do_timer_cpu)
+ tick_do_timer_cpu = -1;
+ }
+
if (unlikely(ts->nohz_mode == NOHZ_MODE_INACTIVE))
goto end;

_

2007-09-15 13:44:22

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Sat, 2007-09-15 at 03:18 -0700, Andrew Morton wrote:
> On Sat, 15 Sep 2007 11:49:41 +0200 Thomas Gleixner <[email protected]> wrote:
>
> I dropped the timekeeping.c hunks because they are an older version of
> timekeeping-prevent-time-going-backwards-on-resume.patch which I already
> had.

Err, no. The timekeeping hunk is redone due to the lockdep fix which I
made.

Thanks,

tglx


2007-09-15 14:01:53

by Thomas Gleixner

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Sat, 2007-09-15 at 03:18 -0700, Andrew Morton wrote:
> > http://git.kernel.org/?p=linux/kernel/git/tglx/linux-2.6-hrt.git;a=shortlog;h=for-2.6.23
> >
>
> That patch fixes the resume-from-ram and suspend-to-ram regressions on the
> Vaio.
>
> I dropped the timekeeping.c hunks because they are an older version of
> timekeeping-prevent-time-going-backwards-on-resume.patch which I already
> had.
>
> Is this good to go? Needs a bit of changelogging.

Changelog it in the git tree. Please pull from there:

The following changes since commit 53a3f3087be361dacfc02e7a85b6d6142a41ce8a:
Linus Torvalds (1):
Merge branch 'for-linus' of master.kernel.org:/.../cooloney/blackfin-2.6

are available in the git repository at:

ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt.git for-2.6.23

Thomas Gleixner (6):
timekeeping: access rtc outside of xtime lock
timekeeping: Prevent time going backwards on resume
ACPI: Reevaluate C/P/T states when a cpu becomes online
clockevents: Enforce oneshot broadcast when broadcast mask is set on resume
clockevents: do not shutdown the oneshot broadcast device
clockevents: prevent stale tick update on offline cpu

drivers/acpi/processor_core.c | 21 +++++++++++++++++++++
kernel/time/tick-broadcast.c | 24 ++++++++++++++++--------
kernel/time/tick-sched.c | 12 ++++++++++++
kernel/time/timekeeping.c | 10 +++++++++-
4 files changed, 58 insertions(+), 9 deletions(-)


2007-09-15 22:03:20

by Andrew Morton

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

On Sat, 15 Sep 2007 15:28:23 +0200 Thomas Gleixner <[email protected]> wrote:

> On Sat, 2007-09-15 at 03:18 -0700, Andrew Morton wrote:
> > > http://git.kernel.org/?p=linux/kernel/git/tglx/linux-2.6-hrt.git;a=shortlog;h=for-2.6.23
> > >
> >
> > That patch fixes the resume-from-ram and suspend-to-ram regressions on the
> > Vaio.
> >
> > I dropped the timekeeping.c hunks because they are an older version of
> > timekeeping-prevent-time-going-backwards-on-resume.patch which I already
> > had.
> >
> > Is this good to go? Needs a bit of changelogging.
>
> Changelog it in the git tree. Please pull from there:

who, me?

> The following changes since commit 53a3f3087be361dacfc02e7a85b6d6142a41ce8a:
> Linus Torvalds (1):
> Merge branch 'for-linus' of master.kernel.org:/.../cooloney/blackfin-2.6
>
> are available in the git repository at:
>
> ssh://master.kernel.org/pub/scm/linux/kernel/git/tglx/linux-2.6-hrt.git for-2.6.23
>
> Thomas Gleixner (6):
> timekeeping: access rtc outside of xtime lock
> timekeeping: Prevent time going backwards on resume
> ACPI: Reevaluate C/P/T states when a cpu becomes online
> clockevents: Enforce oneshot broadcast when broadcast mask is set on resume
> clockevents: do not shutdown the oneshot broadcast device
> clockevents: prevent stale tick update on offline cpu

please send it to Linus?

2007-10-02 09:46:16

by Pavel Machek

[permalink] [raw]
Subject: Re: cpu hotplug support broken in 2.6.23-rc3

Hi!

> > > Venki sent me an initial patch, but it has issues with the notify
> > > ordering. Find below my "cache the broadcast flags" version for testing.
> >
> > Hmmpf, the flag is still cleared when the cpu goes offline. Need to take
> > a closer look.
>
> I finally tracked it down. There were several ways to turn the box into
> a brick. Sigh !
>
> Can you please test the combo patch below ?

Sorry, I was on holidays. I assume this is in -rc9 or so, already?
Yes, seems so.

Unfortunately, cpu hotplug seems to be still behaving strangely in
-rc9. I can echo 0 > online (and cpu will go down). I do echo 0 >
online, again, and I get -EBUSY. Good. But I try to do echo 1 >
online, and get -EBUSY, too... and that's bad :-(.

root@amd:/sys/devices/system/cpu/cpu1# echo 0 > online
root@amd:/sys/devices/system/cpu/cpu1# echo 0 > online
-bash: echo: write error: Device or resource busy
root@amd:/sys/devices/system/cpu/cpu1# echo 1 > online
-bash: echo: write error: Device or resource busy
root@amd:/sys/devices/system/cpu/cpu1# uname -a
Linux amd 2.6.23-rc9 #507 SMP Tue Oct 2 09:58:40 CEST 2007 i686
GNU/Linux

Kernel says:

Oct 2 11:42:12 amd log1n[1436]: ROOT LOGIN on `tty1'
Oct 2 11:42:56 amd kernel: CPU 1 is now offline
Oct 2 11:42:56 amd kernel: SMP alternatives: switching to UP code

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-11-15 22:37:06

by Pavel Machek

[permalink] [raw]
Subject: cpu hotplug strangeness in 2.6.24-rc2 (was Re: cpu hotplug support broken in 2.6.23-rc3)

Hi!

> > > Plus I guess it would be nice to add CPU HOTPLUG into MAINTAINERS
> > > file:
> > >
> >
> > There is a list of maintainers in the Documentation/cpu-hotplug.txt,
> > which includes maintainers for different platforms as well.
> >
> > It's a good idea to add that info to the MAINTAINERS file as well.
>
> Yes, please.

Just an update... In 2.6.24-rc2, cpu hotplug basically works, _but_:

if I do echo 0 > online; echo 0 > online; at same cpu, I get error,
and can't up anything any more. It is not serious, but it is not
pretty, either.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html