Hi,
I found a bug on CPU hotplug. If `pfmon --system-wide' is running,
CPU hot remove causes system hang. On the other hand, CPU hot add
during that command seems to work fine.
I detected this problem on my ia64 box, and I don't know that this
problem is also occur on any arch or freezer based CPU hotplug.
Since I'm investigating this problem now, I'm glad if someone
reports the test result on other arch or freezer.
Currently I don't know anything about the cause of this problem.
How To Reproduce
================
Assuming you have two terminal, named term A and term B.
1) execute pfmon from term A
$ pfmon --system-wide
<press ENTER to stop session>
2) issue CPU hot remove from term B
# echo 0 >/sys/devices/system/cpu/cpu1/online
3) Press enter from term B
Expected Result
===============
Succeed to offline CPU
Actual Result
=============
System hang occurs and there are the following messages on serial console.
-------------------------------------------------------------------------------
CPU0 1925600 CPU_CYCLES
CPU1 could not stop monitoring, CPU may be offline, check results
pfm_write_old_pmds error [3] Device or resource busy
pfmon_read_pmds error Device or resource busy
CPU1 read_results error
-------------------------------------------------------------------------------
Thanks,
Satoru
On Mon, May 28, 2007 at 10:54:43AM +0900, Satoru Takeuchi wrote:
> Since I'm investigating this problem now, I'm glad if someone
> reports the test result on other arch or freezer.
I suspect the freezer based approach will not have this problem. Gautham
could probably verify that.
Andrew/Linus,
So is it settled now on what approach we are going to follow (freezer
vs lock based) for cpu hotplug? I thought that Linus was not favouring freezer
based approach sometime back ..
--
Regards,
vatsa
On Mon, 28 May 2007, Srivatsa Vaddagiri wrote:
>
> So is it settled now on what approach we are going to follow (freezer
> vs lock based) for cpu hotplug? I thought that Linus was not favouring freezer
> based approach sometime back ..
As far as I'm concerned, we should
- use "preempt_disable()" to protect against CPU's coming and going
- use "stop_machine()" or similar that already honors preemption, and
which I trust a whole lot more than freezer.
- .. especially since this is already how we are supposed to be protected
against CPU's going away, and we've already started doing that (for an
example of this, see things like e18f3ffb9c from Andrew)
It really does seem fairly straightforward to make "__cpu_up()" be called
through stop_machine too. Looking at _cpu_down:
mutex_lock(&cpu_bitmask_lock);
p = __stop_machine_run(take_cpu_down, NULL, cpu);
mutex_unlock(&cpu_bitmask_lock);
and then looking at _cpu_up:
mutex_lock(&cpu_bitmask_lock);
ret = __cpu_up(cpu);
mutex_unlock(&cpu_bitmask_lock);
I just go "Aww, wouldn't it be nice to just make that "__cpu_up()" call be
done through __stop_machine_run() too?"
Hmm?
Then, you could get the "cpu_bitmask_lock" if you need to sleep, but if
you don't want to do that (and quite often you don't), just doing a
"preempt_disable()" or taking a spinlock will *also* guarantee that no new
CPU's suddenly show up, so it's safe to look at the CPU online bitmasks.
Do we really need anything else?
As mentioned, it's actually fairly easy to add verification calls to make
sure that certain accesses are done with preemption disabled, so..
Linus
On Tue, 2007-05-29 at 13:56 -0700, Linus Torvalds wrote:
>
> On Mon, 28 May 2007, Srivatsa Vaddagiri wrote:
> >
> > So is it settled now on what approach we are going to follow (freezer
> > vs lock based) for cpu hotplug? I thought that Linus was not favouring freezer
> > based approach sometime back ..
>
> As far as I'm concerned, we should
> - use "preempt_disable()" to protect against CPU's coming and going
> - use "stop_machine()" or similar that already honors preemption, and
> which I trust a whole lot more than freezer.
> - .. especially since this is already how we are supposed to be protected
> against CPU's going away, and we've already started doing that (for an
> example of this, see things like e18f3ffb9c from Andrew)
Indeed, this is how it was supposed to work.
Note that it is possible to make stop_machine() an even larger hammer,
by scheduler mods to flush all the preempted tasks. This would drop the
requirement for preempt_disable().
But cute as that would be, I've been waiting until someone demonstrates
an actual need...
Rusty.
On Tue, May 29, 2007 at 01:56:24PM -0700, Linus Torvalds wrote:
> As far as I'm concerned, we should
> - use "preempt_disable()" to protect against CPU's coming and going
> - use "stop_machine()" or similar that already honors preemption, and
> which I trust a whole lot more than freezer.
> - .. especially since this is already how we are supposed to be protected
> against CPU's going away, and we've already started doing that (for an
> example of this, see things like e18f3ffb9c from Andrew)
>
> It really does seem fairly straightforward to make "__cpu_up()" be called
> through stop_machine too. Looking at _cpu_down:
>
> mutex_lock(&cpu_bitmask_lock);
> p = __stop_machine_run(take_cpu_down, NULL, cpu);
> mutex_unlock(&cpu_bitmask_lock);
>
> and then looking at _cpu_up:
>
> mutex_lock(&cpu_bitmask_lock);
> ret = __cpu_up(cpu);
> mutex_unlock(&cpu_bitmask_lock);
>
> I just go "Aww, wouldn't it be nice to just make that "__cpu_up()" call be
> done through __stop_machine_run() too?"
>
> Hmm?
>
> Then, you could get the "cpu_bitmask_lock" if you need to sleep,
and that's where all the problems started - sleepers needing to take that mutex
recursively (which we did/do not support).
foo() takes cpu_bitmask_lock and calls
foo_bar() which also needs cpu_bitmask_lock
What is a solution to that?
- Forget (hide?) this whole locking mess by using freezer, which
is what Andrew wanted us to shoot for :) I am somewhat biased
with Andrew here in that I think it will lead to more stabler cpu
hotplug code over time. Again I know some people will beg to differ
on this view.
- extend mutexes to support recursion (which I gather Linux has
religiously avoided so far)
- invent a special lock for cpu hotplug which supports recursion.
This is what Gautham tried doing with [1], with the bonus that it
made the lock extremely scalable for readers by using per-cpu
reference counters and RCU. He is preparing to resend those patches
against latest kernel atm
- Anything else you can think of?
[1] http://lkml.org/lkml/2006/10/26/73
> but if you don't want to do that (and quite often you don't), just doing a
> "preempt_disable()" or taking a spinlock will *also* guarantee that no new
> CPU's suddenly show up, so it's safe to look at the CPU online bitmasks.
>
> Do we really need anything else?
see above
> As mentioned, it's actually fairly easy to add verification calls to make
> sure that certain accesses are done with preemption disabled, so..
--
Regards,
vatsa
On Wed, 30 May 2007, Srivatsa Vaddagiri wrote:
>
> and that's where all the problems started - sleepers needing to take that mutex
> recursively (which we did/do not support).
>
> foo() takes cpu_bitmask_lock and calls
> foo_bar() which also needs cpu_bitmask_lock
>
> What is a solution to that?
The solution is to not have crap locking in the first place.
Or, if you absolutely have to, support recursive locking. But the bassic
rule should always really be that the need for recursive locking is really
a sign of a locking bug in the first place.
So what's so wrong with just admitting that the locking was crap, and
doing it properly? The _proper_ locking doesn't need recursive locks, it
just takes the lock at the outer layers, and then does *not* take it
internally, because the called routines are called with the lock held and
know they are safe.
We have been able to do that in _every_ single kernel subsystem. What's so
special about cpufreq? NOTHING. Except that the code is sometimes horribly
bad.
Linus
On Wed, May 30, 2007 at 10:03:01AM -0700, Linus Torvalds wrote:
> > and that's where all the problems started - sleepers needing to take that mutex
> > recursively (which we did/do not support).
> >
> > foo() takes cpu_bitmask_lock and calls
> > foo_bar() which also needs cpu_bitmask_lock
> >
> > What is a solution to that?
>
> The solution is to not have crap locking in the first place.
I wish it was that simple wrt cpu hotplug lock :)
It was not just cpufreq which tripped on this locking mess. There was
flush_workqueue and also more recently preempt version of rcu.
flush_workqueue()
mutex_lock(&hotplug_lock); /* we can sleep below */
for_each_online_cpu()
flush_cpu_workqueue();
mutex_unlock(&hotplug_lock);
First problem with this snippet was that a workqueue function for which
we are waiting to be flushed may want to take the same hotplug_lock,
which can deadlock (see http://lkml.org/lkml/2006/12/6/352 for
description of this).
Second problem was : keventd is running flush_workqueue() and one of the
work functions calls flush_workqueue() again which is an attempt to take
hotplug_lock() recursively.
Partly the situation is complex because of the intertwining of
so many subsystem around this one lock. In all cases, encoding two version of
a function, one which takes the hotplug_lock and the other which doesnt take,
is probably an option. IMHO that would just make the source too ugly and
difficult to audit for cpu hotplug safety.
Best would be to go for a lock which supports recursion (and which
avoids other problems inherent with the old mutex based lock) or as
Andrew insisted with the freezer.
> Or, if you absolutely have to, support recursive locking. But the bassic
> rule should always really be that the need for recursive locking is really
> a sign of a locking bug in the first place.
>
> So what's so wrong with just admitting that the locking was crap, and
> doing it properly? The _proper_ locking doesn't need recursive locks, it
> just takes the lock at the outer layers, and then does *not* take it
> internally, because the called routines are called with the lock held and
> know they are safe.
>
> We have been able to do that in _every_ single kernel subsystem. What's so
> special about cpufreq? NOTHING. Except that the code is sometimes horribly
> bad.
--
Regards,
vatsa
On Tue, May 29, 2007 at 01:56:24PM -0700, Linus Torvalds wrote:
>
>
> On Mon, 28 May 2007, Srivatsa Vaddagiri wrote:
> >
> > So is it settled now on what approach we are going to follow (freezer
> > vs lock based) for cpu hotplug? I thought that Linus was not favouring freezer
> > based approach sometime back ..
>
> As far as I'm concerned, we should
> - use "preempt_disable()" to protect against CPU's coming and going
> - use "stop_machine()" or similar that already honors preemption, and
> which I trust a whole lot more than freezer.
> - .. especially since this is already how we are supposed to be protected
> against CPU's going away, and we've already started doing that (for an
> example of this, see things like e18f3ffb9c from Andrew)
>
Yes, provided that the code sections which need protection against CPU's
coming and going don't block, we surely can use preempt_disable/preempt_enable
for refcounting. But we do have scenarios where such code sections do
block. Vatsa has quoted a few of them in his mail.
> It really does seem fairly straightforward to make "__cpu_up()" be called
> through stop_machine too. Looking at _cpu_down:
>
> mutex_lock(&cpu_bitmask_lock);
> p = __stop_machine_run(take_cpu_down, NULL, cpu);
> mutex_unlock(&cpu_bitmask_lock);
>
> and then looking at _cpu_up:
>
> mutex_lock(&cpu_bitmask_lock);
> ret = __cpu_up(cpu);
> mutex_unlock(&cpu_bitmask_lock);
>
> I just go "Aww, wouldn't it be nice to just make that "__cpu_up()" call be
> done through __stop_machine_run() too?"
>
Sure, we can do it. But what is it going to solve?
The whole section from CPU_UP/DOWN_PREPARE to CPU_ONLINE/DEAD
is supposed to be atomic and not just __cpu_up/take_cpu_down.
These are the sections where subsystems create/destroy their per-cpu
resources.
Remember, the cpufreq problem was originally caused because we were
releasing the cpu_bitmask_lock before handling CPU_DEAD, where some of
the cpufreq specific per-cpu data was being freed. And thus, we were
operating on stale data in the window between the release of
cpu_bitmask_lock and handling of CPU_DEAD.
> Hmm?
>
> Then, you could get the "cpu_bitmask_lock" if you need to sleep, but if
> you don't want to do that (and quite often you don't), just doing a
> "preempt_disable()" or taking a spinlock will *also* guarantee that no new
> CPU's suddenly show up, so it's safe to look at the CPU online bitmasks.
>
> Do we really need anything else?
>
* We don't need locks/mutexes because (bad) experience tells us that
in this case, locking is not the right model.
* Despite having implemented it, I am not very much convinced about
freezer because it hides the cpu-hotplug details from subsystems, which
IMHO is not a good thing.
* Like every other resource, if people don't want a cpu to go down/come up,
they should bump up an associated refcount. But since we need
this refcounting model to allow blocking code sections too, we cannot
use preempt_enable/disable.
Therefore sir, we do need nice scalable refcounting model :)
> As mentioned, it's actually fairly easy to add verification calls to make
> sure that certain accesses are done with preemption disabled, so..
>
> Linus
Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"