LinuxLists.cc - [RFC][PATCH 0/4] Redesign cpu

2006-08-24 10:25:24

Subject: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

While running some tests involving simultaneous cpu_hotplugging
and changing of cpu_freq governors with 2.6.18-rc4-git1 kernel,
I hit the BUG_ON() in the cpufreq_p4_target(p4-clockmod.c).

------------[ cut here ]------------
kernel BUG at arch/i386/kernel/cpu/cpufreq/p4-clockmod.c:142!
invalid opcode: 0000 [#1]
SMP
Modules linked in: i2c_piix4 aic7xxx sd_mod
CPU: 0
EIP: 0060:[<c100fd7c>] Not tainted VLI
EFLAGS: 00010297 (2.6.18-rc4-git1 #1)
EIP is at cpufreq_p4_target+0xc3/0x12c
eax: f63ba000 ebx: 00000001 ecx: c101e7c3 edx: f63ba000
esi: f757f400 edi: ffffffff ebp: f63bad58 esp: f63bad38
ds: 007b es: 007b ss: 0068
Process bash (pid: 4432, ti=f63ba000 task=f5daa030 task.ti=f63ba000)
Stack: 00000008 00000001 0004c4b4 002625a0 00000002 c100fcb9 f757f400 00000001
f63bad74 c120d0d8 ffffffea 002625a0 f757f400 00000001 00000000 f63bad94
c120de4c 00000004 c12d8448 c12dfd38 002625a0 00000001 f757f400 f63badbc
Call Trace:
[<c1004cf4>] show_stack_log_lvl+0x87/0x8f
[<c1004e64>] show_registers+0x125/0x18e
[<c1005057>] die+0x116/0x1e1
[<c1281812>] do_trap+0x7c/0x96
[<c1005312>] do_invalid_op+0x95/0x9c
[<c10049e1>] error_code+0x39/0x40
[<c120d0d8>] __cpufreq_driver_target+0x54/0x62
[<c120de4c>] cpufreq_governor_performance+0x34/0x40
[<c120d194>] __cpufreq_governor+0x57/0xea
[<c120d45d>] __cpufreq_set_policy+0x151/0x1bd
[<c120c463>] store_scaling_governor+0x83/0xb8
[<c120c630>] store+0x38/0x49
[<c109f384>] flush_write_buffer+0x23/0x2b
[<c109f3dc>] sysfs_write_file+0x50/0x71
[<c1067c00>] vfs_write+0xab/0x153
[<c1067d43>] sys_write+0x3b/0x60
[<c1003d29>] sysenter_past_esp+0x56/0x8d
Code: 00 83 f8 1f 89 c3 7f 47 89 c1 89 e0 ba 01 00 00 00 25 00 f0 ff ff d3 e2 8b 00 e8 c8 e9 00 00 89 e0 25 00 f0 ff ff 39 58 10 74 08 <0f> 0b 8e 00 43 ff 29 c1 8b 45 e0 8b 14 c5 c0 ee 2f c1 89 d8 e8
EIP: [<c100fd7c>] cpufreq_p4_target+0xc3/0x12c SS:ESP 0068:f63bad38
<7>kobject msr1: cleaning up
kobject_uevent
fill_kobj_path: path = '/class/cpuid/cpu1'
kobject cpu1: cleaning up
kobject_uevent
fill_kobj_path: path = '/devices/system/cpu/cpu1'
------------[ cut here ]------------

The problem occured due to a race window opened up by the 2-lock scheme
in cpu_down().

mutex_lock(&cpu_bitmask_lock);
p = __stop_machine_run(take_cpu_down, NULL, cpu);
mutex_unlock(&cpu_bitmask_lock);
<snip>
/* Introduces a window here, where stale-data (which _will_ be cleaned
* up in CPU_DEAD processing) can be accessed, causing unpleasant
* things to occur. This window is opened because of two-lock scheme
* in cpu_up and cpu_down.
*/

/* CPU is completely dead: tell everyone. Too late to complain. */
if (blocking_notifier_call_chain(&cpu_chain, CPU_DEAD,
(void *)(long)cpu) == NOTIFY_BAD)

The hotplug operation is not complete *until* a CPU_DEAD notification has been
sent to all the subscribers.(cpufreq being one of them).
As a result, after __stop_machine_run() on cpuX, there's a window open for
cpufreq to go ahead and try changing frequency on cpuX, since bit corresponding
to cpuX is still set in policy->cpus mask, which will be unset only on
receiving a CPU_DEAD notification.

In future, we may face simillar scenarios where the subsystems are
maintaining the snapshot of the online cpus and the snapshot is updated only on a
CPU_UP_PREPARE or a CPU_DEAD event, allowing other tasks to perform some *nasty*
operation between __stop_machine()_run and CPU_DEAD notification.

To solve this, I eliminated mutex_lock(&cpu_add_remove_lock) from both
cpu_down and cpu_up and moved mutex_lock(&cpu_bitmask_lock) in its place.
The locks surrounding __stop_machin_run() and __cpu_up() were removed.
This, however gave rise to new set of LUKEWARM IQ's emanating from
cpufreq_cpu_callback and cpufreq_stat_cpu_callback.

The offending functions were (cpufreq_set_policy &cpufreq_driver_target) and
cpufreq_update_policy which were being called from cpufreq_cpu_callback and
cpufreq_stat_cpu_callback respectively on CPU_ONLINE and CPU_DOWN_PREPARE
events. These offenders call lock_cpu_hotplug ( YUCK!)

The possible solutions to cpu_hotplug "locking" summarized from the
previous discussion threads are:

i) Clean up the whole code and ensure there are *NO* recursive lock takers.

ii) Have a per-subsystem cpu_hotplug_lock and let cpu_down(/up) acquire
all these locks before performing a hotplug.

iii) Implement cpu_hotplug_lock as Refcount + Waitqueue.

(i) is ugly. We've seen Arjan give a try at cleaning up workqueue + ondemand
mess.

(ii) Though has been introduced recently in workqueue.c , it will only lead to
more deadlock scenarios since more locks would have to be acquired.

Eg: workqueue + cpufreq(ondemand) ABBA deadlock scenario. Consider
- task1: echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
- task2: echo 0 > /sys/devices/system/cpu/cpu1/online
entering the system at the same time.

task1: calls store_scaling_governor which takes lock_cpu_hotplug.

task2: thru a blocking_notifier_call_chain(CPU_DOWN_PREPARE)
to workqueue subsystem holds workqueue_mutex.

task1: calls create_workqueue from cpufreq_governor_dbs[ondemand]
which tries taking workqueue_mutex but can't.

task2: thru blocking_notifier_call_chain(CPU_DOWN_PREPARE) calls
cpufreq_driver_target(cpufreq subsystem),which tries to take lock_cpu_hotplug
but cant since it is already taken by task1.

A typical ABBA deadlock!!!

Replicating this persubsystem-cpu-hotplug lock model across all other
cpu_hotplug-sensitive subsystems would only create more such problems as
notifier_callback does not follow any specific ordering while notifying
the subsystems.

(iii) makes sense as it is closest to RWSEMs.
So here's an implementation on the lines of (c).

There are two types of tasks interested in cpu_hotplug
- ones who want to *prevent* a hotplug event.
- ones who want to *perform* a cpu hotplug.

For sake of simplicity let's call the former ones readers (though I would
have prefered inhibitors or somthing fancier!) and latter ones writers.
Let write operation = cpu_hotplug operation.

-The protocol is analogous to RWSEM, *only not so fair* .
- Readers assume control iff:
a) No other reader holds a reference and no writer is writing.
OR
b) Atleast one reader holds a reference.
- The reader is blocked iff a write operation is ongoing.
- Writer gets to perform a write iff:
*No* reader has a reference AND no writer is writing.
- In any other case, the writer is blocked.
- Writer, on completion would wake up other waiting writers
over the waiting readers.
- The *last* reader wakes up the first waiting writer.
- The *last* writer wakes up all the waiting readers.

Rules:
1)Those intending to prevent a hotplug operation, will call
into cpu_hotplug_disable() and cpu_hotplug_enable(), inplace of
lock_cpu_hotplug and unlock_cpu_hotplug respectively.

2)Those intending to perform a hotplug operation(cpu_up and cpu_down)
will call cpu_hotplug_begin() and cpu_hotplug_done() instead of
mutex_lock(&cpu_add_remove_lock) and mutex_unlock(&cpu_add_remove_lock)
respectively.

3)The callbacks *WILL NOT* try to cpu_hotplug_disable(), because
they are called from a point where the hotplug operation has already
begun. So cpu's won't disappear untill they return.

[patch 1/4]: Cleans up of cpufreq callback code so that Rule (3) is honoured.

[patch 2/4]: Reverts the changes made to kernel/workqueue.c so that Rule(3)
is honoured.

[patch 3/4]:Implements REFCOUNT + WAITQUEUE implementation of cpu_hotplug
"locking".Honours Rule(2). This is the important patch!

[patch 4/4]: Renames lock_cpu_hotplug to cpu_hotplug_disable and
unlock_cpu_hotplug to cpu_hotplug_enable throughout the kernel, so that
Rule (1) is honoured.

The patches are against linux-2.6.18-rc4-git1.

Awaiting your feedback.

Regards
ego
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2006-08-24 10:41:43

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

* Gautham R Shenoy <[email protected]> wrote:

> So here's an implementation on the lines of (c).
>
> There are two types of tasks interested in cpu_hotplug
> - ones who want to *prevent* a hotplug event.
> - ones who want to *perform* a cpu hotplug.
>
> For sake of simplicity let's call the former ones readers (though I
> would have prefered inhibitors or somthing fancier!) and latter ones
> writers. Let write operation = cpu_hotplug operation.
>
> -The protocol is analogous to RWSEM, *only not so fair* .

really nice! I'm quite sure that this is the right way to approach this
problem.

Please add the appropriate lock_acquire()/lock_release() calls into the
new sleeping semaphore type. Just use the parameters that rwlocks use:

#define rwlock_acquire(l, s, t, i) lock_acquire(l, s, t, 0, 2, i)
#define rwlock_acquire_read(l, s, t, i) lock_acquire(l, s, t, 2, 2, i)

and lockdep will allow recursive read-locking. You'll also need a
lockdep_init_map() call to register the lock with lockdep. Then your
locking scheme will be fully checked by lockdep too. (with your current
code the new lock type is not added to the lock dependency graph(s))

Ingo

2006-08-24 11:26:55

by tip-bot for Gautham R Shenoy

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Thu, Aug 24, 2006 at 12:34:12PM +0200, Ingo Molnar wrote:
> Please add the appropriate lock_acquire()/lock_release() calls into the
> new sleeping semaphore type. Just use the parameters that rwlocks use:
>
> #define rwlock_acquire(l, s, t, i) lock_acquire(l, s, t, 0, 2, i)
> #define rwlock_acquire_read(l, s, t, i) lock_acquire(l, s, t, 2, 2, i)

> and lockdep will allow recursive read-locking. You'll also need a
> lockdep_init_map() call to register the lock with lockdep. Then your
> locking scheme will be fully checked by lockdep too. (with your current
> code the new lock type is not added to the lock dependency graph(s))

I'm on it. :)

Thanks
ego
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2006-08-24 11:32:13

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

* Gautham R Shenoy <[email protected]> wrote:

> On Thu, Aug 24, 2006 at 12:34:12PM +0200, Ingo Molnar wrote:
> > Please add the appropriate lock_acquire()/lock_release() calls into the
> > new sleeping semaphore type. Just use the parameters that rwlocks use:
> >
> > #define rwlock_acquire(l, s, t, i) lock_acquire(l, s, t, 0, 2, i)
> > #define rwlock_acquire_read(l, s, t, i) lock_acquire(l, s, t, 2, 2, i)
>
> > and lockdep will allow recursive read-locking. You'll also need a
> > lockdep_init_map() call to register the lock with lockdep. Then your
> > locking scheme will be fully checked by lockdep too. (with your current
> > code the new lock type is not added to the lock dependency graph(s))
>
> I'm on it. :)

you'll also need to add a dep_map to the cpu_hotplug structure itself.

and i think this extension of lockdep to the new lock type will be
invariant with the per-cpu optimizations i suggested in the previous
mail: because it's only the scalability of cpu_hotplug_lock() that will
improve [dramatically], its locking semantics wont.

Ingo

2006-08-24 16:18:12

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Thu, 24 Aug 2006 15:56:18 +0530
Gautham R Shenoy <[email protected]> wrote:

> While running some tests involving simultaneous cpu_hotplugging
> and changing of cpu_freq governors with 2.6.18-rc4-git1 kernel,
> I hit the BUG_ON() in the cpufreq_p4_target(p4-clockmod.c).

cpufreq remains borked.

> The problem occured due to a race window opened up by the 2-lock scheme
> in cpu_down().

It was opened up by incorrect code in cpufreq.

> In future, we may face simillar scenarios where the subsystems are
> maintaining the snapshot of the online cpus and the snapshot is updated only on a
> CPU_UP_PREPARE or a CPU_DEAD event, allowing other tasks to perform some *nasty*
> operation between __stop_machine()_run and CPU_DEAD notification.

If those subsystems are buggy, sure.

> To solve this, I eliminated mutex_lock(&cpu_add_remove_lock) from both
> cpu_down and cpu_up and moved mutex_lock(&cpu_bitmask_lock) in its place.
> The locks surrounding __stop_machin_run() and __cpu_up() were removed.
> This, however gave rise to new set of LUKEWARM IQ's emanating from
> cpufreq_cpu_callback and cpufreq_stat_cpu_callback.

Better to eliminate lock_cpu_hotplug().

> The offending functions were (cpufreq_set_policy &cpufreq_driver_target) and
> cpufreq_update_policy which were being called from cpufreq_cpu_callback and
> cpufreq_stat_cpu_callback respectively on CPU_ONLINE and CPU_DOWN_PREPARE
> events. These offenders call lock_cpu_hotplug ( YUCK!)
>
> The possible solutions to cpu_hotplug "locking" summarized from the
> previous discussion threads are:
>
> i) Clean up the whole code and ensure there are *NO* recursive lock takers.
>
> ii) Have a per-subsystem cpu_hotplug_lock and let cpu_down(/up) acquire
> all these locks before performing a hotplug.
>
> iii) Implement cpu_hotplug_lock as Refcount + Waitqueue.
>
> (i) is ugly. We've seen Arjan give a try at cleaning up workqueue + ondemand
> mess.

Well yeah.

> (ii) Though has been introduced recently in workqueue.c , it will only lead to
> more deadlock scenarios since more locks would have to be acquired.
>
> Eg: workqueue + cpufreq(ondemand) ABBA deadlock scenario. Consider
> - task1: echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> - task2: echo 0 > /sys/devices/system/cpu/cpu1/online
> entering the system at the same time.
>
> task1: calls store_scaling_governor which takes lock_cpu_hotplug.
>
> task2: thru a blocking_notifier_call_chain(CPU_DOWN_PREPARE)
> to workqueue subsystem holds workqueue_mutex.
>
> task1: calls create_workqueue from cpufreq_governor_dbs[ondemand]
> which tries taking workqueue_mutex but can't.
>
> task2: thru blocking_notifier_call_chain(CPU_DOWN_PREPARE) calls
> cpufreq_driver_target(cpufreq subsystem),which tries to take lock_cpu_hotplug
> but cant since it is already taken by task1.
>
> A typical ABBA deadlock!!!

That's a bug in cpufreq. It should stop using lock_cpu_hotplug() and get
its locking sorted out.

> Replicating this persubsystem-cpu-hotplug lock model across all other
> cpu_hotplug-sensitive subsystems would only create more such problems as
> notifier_callback does not follow any specific ordering while notifying
> the subsystems.

The ordering is sufficiently well-defined: it's the order of the chain. I
don't expect there to be any problems here.

Yes, the chain can change while a hotadd or hotremove is happening. But
that's already a bug. We need to take cpu_add_remove_lock in
[un]register_cpu_notifier().

> (iii) makes sense as it is closest to RWSEMs.

argh.

box:/usr/src/linux-2.6.18-rc4> grep -r online . | grep cpu | wc -l
793

We're going to get into a mess if we try to fit all that stuff into using
some global unified locking scheme.

There might be scalability problems too.

There is a large amount of code in there which is presently racy against
cpu hotplug. Once we have a scheme in place, those code sites need to be
reviewed and, if necessary, fixed.

> So here's an implementation on the lines of (c).
>
> There are two types of tasks interested in cpu_hotplug
> - ones who want to *prevent* a hotplug event.
> - ones who want to *perform* a cpu hotplug.

Well. There are tasks which wish to protect a per-cpu resource. They
should do that by taking a per-resource lock. As an optional,
cpu-hotplug-specific super-fast optimisation they can also do this by
running atomically.

Look, cpufreq locking is a mess. I don't think we'll improve that by
adding new types of locks which hopefully work OK in the presence of
cpufreq's abuse. Better to do the hard work, pick apart cpufreq, work out
specifically which per-cpu resources are being used, design a locking
scheme for each one, document it and then implement it.

For instance: why does the drivers/cpufreq/cpufreq.c:store_one() macro take
lock_cpu_hotplug()? There are no per-cpu resources in sight! It's quite
inappropriate. The developers were (mis)led into doing this by the
seductive lock_kernel()iness of lock_cpu_hotplug(). Over the years we've
learnt to not do this sort of thing. lock_cpu_hotplug() was a bad idea.

We already have sufficient locking primitives to get this right. Let's fix
cpufreq locking rather than introduce complex new primitives which we hope
will work in the presence of the existing mess.

Step 1: remove all mention of lock_cpu_hotplug() from cpufreq.

Step 2: work out what data needs to be locked, and how.

Step 3: implement.

Let me re-re-re-re-emphasise: this is a cpufreq bug. All of it.

2006-08-25 09:50:54

by Dave Jones

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Thu, Aug 24, 2006 at 09:17:04AM -0700, Andrew Morton wrote:
> We already have sufficient locking primitives to get this right. Let's fix
> cpufreq locking rather than introduce complex new primitives which we hope
> will work in the presence of the existing mess.
>
> Step 1: remove all mention of lock_cpu_hotplug() from cpufreq.
> Step 2: work out what data needs to be locked, and how.
> Step 3: implement.

this is what I planned to do weeks ago when this mess first blew up.
I even went as far as sending Linus a patch for (1).
He seemed really gung-ho about trying to fix up the current mess though,
and with each incarnation since, I've been convinced we're making
the problem worse rather than really improving anything.

Dave

--
http://www.codemonkey.org.uk

2006-08-26 21:10:28

by Linus Torvalds

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Fri, 25 Aug 2006, Dave Jones wrote:
>
> On Thu, Aug 24, 2006 at 09:17:04AM -0700, Andrew Morton wrote:
> > We already have sufficient locking primitives to get this right. Let's fix
> > cpufreq locking rather than introduce complex new primitives which we hope
> > will work in the presence of the existing mess.
> >
> > Step 1: remove all mention of lock_cpu_hotplug() from cpufreq.
> > Step 2: work out what data needs to be locked, and how.
> > Step 3: implement.
>
> this is what I planned to do weeks ago when this mess first blew up.
> I even went as far as sending Linus a patch for (1).
> He seemed really gung-ho about trying to fix up the current mess though,
> and with each incarnation since, I've been convinced we're making
> the problem worse rather than really improving anything.

I definitely want to have this fixed, and Gautham's patches look like a
good thing to me, but the "trying to fix up the current mess" part was
really about trying to get 2.6.18 in a mostly working state rather than
anything else. I think it's been too late to try to actually _fix_ it for
2.6.18 for a long time already.

So my reaction is that this redesign should go in asap after 2.6.18,
unless people feel strongly that the current locking has so many bugs that
people can actually _hit_ in practice that it's better to go for the
redesign early.

I personally doubt that it's the case that we'd want to accelerate
inclusion - very few things actually do CPU hotplug, and right now the
only way to even hit the sequences in normal use is literally just the
"suspend under SMP" case that hasn't historically worked very well anyway,
but was what made at least me personally aware of the problems ;^).

Linus

2006-08-26 22:05:06

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sat, 26 Aug 2006 14:09:55 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:

>
>
> On Fri, 25 Aug 2006, Dave Jones wrote:
> >
> > On Thu, Aug 24, 2006 at 09:17:04AM -0700, Andrew Morton wrote:
> > > We already have sufficient locking primitives to get this right. Let's fix
> > > cpufreq locking rather than introduce complex new primitives which we hope
> > > will work in the presence of the existing mess.
> > >
> > > Step 1: remove all mention of lock_cpu_hotplug() from cpufreq.
> > > Step 2: work out what data needs to be locked, and how.
> > > Step 3: implement.
> >
> > this is what I planned to do weeks ago when this mess first blew up.
> > I even went as far as sending Linus a patch for (1).
> > He seemed really gung-ho about trying to fix up the current mess though,
> > and with each incarnation since, I've been convinced we're making
> > the problem worse rather than really improving anything.
>
> I definitely want to have this fixed, and Gautham's patches look like a
> good thing to me, but the "trying to fix up the current mess" part was
> really about trying to get 2.6.18 in a mostly working state rather than
> anything else. I think it's been too late to try to actually _fix_ it for
> 2.6.18 for a long time already.
>
> So my reaction is that this redesign should go in asap after 2.6.18,
> unless people feel strongly that the current locking has so many bugs that
> people can actually _hit_ in practice that it's better to go for the
> redesign early.

It certainly needs a redesign. A new sort of lock which makes it appear to
work won't fix races like:

int cpufreq_update_policy(unsigned int cpu)
{
struct cpufreq_policy *data = cpufreq_cpu_get(cpu);

...

lock_cpu_hotplug();

I rather doubt that anyone will be hitting the races in practice. I'd
recommend simply removing all the lock_cpu_hotplug() calls for 2.6.18.

2006-08-26 22:13:40

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

* Linus Torvalds <[email protected]> wrote:

> I personally doubt that it's the case that we'd want to accelerate
> inclusion - very few things actually do CPU hotplug, and right now the
> only way to even hit the sequences in normal use is literally just the
> "suspend under SMP" case that hasn't historically worked very well
> anyway, but was what made at least me personally aware of the problems
> ;^).

there's also bootup on SMP that is technically a series of hot-cpu-add
events. That already tests some aspects of it. Maybe we should turn
shutdown into the logical reverse: into a series of hot-cpu-remove
events?

Ingo

2006-08-26 22:25:28

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sun, 27 Aug 2006 00:05:25 +0200
Ingo Molnar <[email protected]> wrote:

>
> * Linus Torvalds <[email protected]> wrote:
>
> > I personally doubt that it's the case that we'd want to accelerate
> > inclusion - very few things actually do CPU hotplug, and right now the
> > only way to even hit the sequences in normal use is literally just the
> > "suspend under SMP" case that hasn't historically worked very well
> > anyway, but was what made at least me personally aware of the problems
> > ;^).
>
> there's also bootup on SMP that is technically a series of hot-cpu-add
> events. That already tests some aspects of it. Maybe we should turn
> shutdown into the logical reverse: into a series of hot-cpu-remove
> events?
>

Would be logical, but we would want to avoid making CONFIG_HOTPLUG_CPU a
requirement for SMP.

2006-08-27 06:11:50

by Dipankar Sarma

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sat, Aug 26, 2006 at 03:04:22PM -0700, Andrew Morton wrote:
> On Sat, 26 Aug 2006 14:09:55 -0700 (PDT)
> Linus Torvalds <[email protected]> wrote:
>
> > I definitely want to have this fixed, and Gautham's patches look like a
> > good thing to me, but the "trying to fix up the current mess" part was
> > really about trying to get 2.6.18 in a mostly working state rather than
> > anything else. I think it's been too late to try to actually _fix_ it for
> > 2.6.18 for a long time already.
> >
> > So my reaction is that this redesign should go in asap after 2.6.18,
> > unless people feel strongly that the current locking has so many bugs that
> > people can actually _hit_ in practice that it's better to go for the
> > redesign early.
>
> It certainly needs a redesign. A new sort of lock which makes it appear to
> work won't fix races like:
>
> int cpufreq_update_policy(unsigned int cpu)
> {
> struct cpufreq_policy *data = cpufreq_cpu_get(cpu);
>
> ...
>
> lock_cpu_hotplug();
>

The problem with cpufreq was that it used lock_cpu_hotplug()
in some common routines because it
was needed in some paths and then also called the same routines
from the CPU hotplug callback path. That is easily fixable and
Gautham's patch 1/4 does exactly that.
One thing I have privately suggested to Gautham is to do an audit
of bad lock_cpu_hotplug() uses.

Now coming to the read-side of lock_cpu_hotplug() - cpu hotplug
is a very special asynchronous event. You cannot protect against
it using your own subsystem lock because you don't control
access to cpu_online_map. With multiple low-level subsystems
needing it, it also becomes difficult to work out the lock
hierarchies. The right way to do this is what Gautham and Ingo
are discussing - a scalable rw semaphore type lock that allows
recursive readers.

>
> I rather doubt that anyone will be hitting the races in practice. I'd
> recommend simply removing all the lock_cpu_hotplug() calls for 2.6.18.

I don't think that is a good idea. The right thing to do would be to
do an audit and clean up the bad lock_cpu_hotplug() calls. People
seem to have just got lazy with lock_cpu_hotplug().

Thanks
Dipankar

2006-08-27 06:46:49

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sun, 27 Aug 2006 11:41:55 +0530
Dipankar Sarma <[email protected]> wrote:

> Now coming to the read-side of lock_cpu_hotplug() - cpu hotplug
> is a very special asynchronous event. You cannot protect against
> it using your own subsystem lock because you don't control
> access to cpu_online_map.

Yes you do. Please, read _cpu_up(), _cpu_down() and the example in
workqueue_cpu_callback(). It's really very simple.

> With multiple low-level subsystems
> needing it, it also becomes difficult to work out the lock
> hierarchies.

That'll matter if we do crappy code. Let's not do that?

> >
> > I rather doubt that anyone will be hitting the races in practice. I'd
> > recommend simply removing all the lock_cpu_hotplug() calls for 2.6.18.
>
> I don't think that is a good idea.

The code's already racy and I don't think anyone has reported a
cpufreq-vs-hotplug race.

> The right thing to do would be to
> do an audit and clean up the bad lock_cpu_hotplug() calls.

No, that won't fix it. For example, take a look at all the *callers* of
cpufreq_update_policy(). AFAICT they're all buggy. Fiddling with the
existing lock_cpu_hotplug() sites won't fix that. (Possibly this
particular problem can be fixed by checking that the relevant CPU is still
online after the appropriate locking has been taken - dunno).

It needs to be ripped out and some understanding, thought and design should
be applied to the problem.

> People
> seem to have just got lazy with lock_cpu_hotplug().

That's because lock_cpu_hotplug() purports to be some magical thing which
makes all your troubles go away.

2006-08-27 07:11:12

by Dipankar Sarma

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sat, Aug 26, 2006 at 11:46:18PM -0700, Andrew Morton wrote:
> On Sun, 27 Aug 2006 11:41:55 +0530
> Dipankar Sarma <[email protected]> wrote:
>
> > Now coming to the read-side of lock_cpu_hotplug() - cpu hotplug
> > is a very special asynchronous event. You cannot protect against
> > it using your own subsystem lock because you don't control
> > access to cpu_online_map.
>
> Yes you do. Please, read _cpu_up(), _cpu_down() and the example in
> workqueue_cpu_callback(). It's really very simple.

What are you talking about here ? That is the write side. You are
*not* supposed to do lock_cpu_hotplug() in cpu callbacks paths AFAICT.
If someone does it (like cpufreq did), it is just *wrong*.

> > With multiple low-level subsystems
> > needing it, it also becomes difficult to work out the lock
> > hierarchies.
>
> That'll matter if we do crappy code. Let's not do that?

I am talking about readsides here - you read cpu_online_map and
block then reuse the map and make some calls to another subsystem
that may again do a similar read-side cpu_hotplug lock. I suspect
that it hard to get rid of all possible dependencies.

> > > I rather doubt that anyone will be hitting the races in practice. I'd
> > > recommend simply removing all the lock_cpu_hotplug() calls for 2.6.18.
> >
> > I don't think that is a good idea.
>
> The code's already racy and I don't think anyone has reported a
> cpufreq-vs-hotplug race.

Do cpu hotplugs in a one cpu and change frequencies in another -
I think Gautham has a script to reproduce this. Besides
lockdep apparently complains about it -

http://marc.theaimsgroup.com/?l=linux-kernel&m=115359728428432&w=2

> > The right thing to do would be to
> > do an audit and clean up the bad lock_cpu_hotplug() calls.
>
> No, that won't fix it. For example, take a look at all the *callers* of
> cpufreq_update_policy(). AFAICT they're all buggy. Fiddling with the
> existing lock_cpu_hotplug() sites won't fix that. (Possibly this
> particular problem can be fixed by checking that the relevant CPU is still
> online after the appropriate locking has been taken - dunno).
>
> It needs to be ripped out and some understanding, thought and design should
> be applied to the problem.

Really, the hotplug locking rules are fairly simple-

1. If you are in cpu hotplug callback path, don't take any lock.

2. If you are in a non-hotplug path reading cpu_online_map and you don't
block, you just disable preemption and you are safe from hotplug.

3. If you are in a non-hotplug path and you use cpu_online_map and
you *really* need to block, you use lock_cpu_hotplug() or
cpu_hotplug_disable whatever it is called.

Is this too difficult for people to follow ?

>
> > People
> > seem to have just got lazy with lock_cpu_hotplug().
>
> That's because lock_cpu_hotplug() purports to be some magical thing which
> makes all your troubles go away.

No it doesn't. Perhaps we should just document the rules better
and put some static checks for people to get it right.

Thanks
Dipankar

2006-08-27 07:37:24

by Dipankar Sarma

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sat, Aug 26, 2006 at 11:46:18PM -0700, Andrew Morton wrote:
> On Sun, 27 Aug 2006 11:41:55 +0530
>
> > The right thing to do would be to
> > do an audit and clean up the bad lock_cpu_hotplug() calls.
>
> No, that won't fix it. For example, take a look at all the *callers* of
> cpufreq_update_policy(). AFAICT they're all buggy. Fiddling with the
> existing lock_cpu_hotplug() sites won't fix that. (Possibly this
> particular problem can be fixed by checking that the relevant CPU is still
> online after the appropriate locking has been taken - dunno).
>

This is a different issue from the ones that relates to lock_cpu_hotplug().
This one seems like a cpufreq internal locking problem.

On a quick look at this, it seems to me that cpufreq_cpu_get() should
do exactly what you said - use a spinlock in each cpufreq_cpu_data[] to
protect the per-cpu flag and in cpufreq_cpu_get() check if
!data and data->online == 0. They may have to do -

static struct cpufreq_data {
spinlock_t lock;
int flag;
struct cpufreq_policy *policy;
} cpufreq_cpu_data[NR_CPUS];

Thanks
Dipankar

2006-08-27 07:42:47

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sun, 27 Aug 2006 12:41:16 +0530
Dipankar Sarma <[email protected]> wrote:

> On Sat, Aug 26, 2006 at 11:46:18PM -0700, Andrew Morton wrote:
> > On Sun, 27 Aug 2006 11:41:55 +0530
> > Dipankar Sarma <[email protected]> wrote:
> >
> > > Now coming to the read-side of lock_cpu_hotplug() - cpu hotplug
> > > is a very special asynchronous event. You cannot protect against
> > > it using your own subsystem lock because you don't control
> > > access to cpu_online_map.
> >
> > Yes you do. Please, read _cpu_up(), _cpu_down() and the example in
> > workqueue_cpu_callback(). It's really very simple.
>
> What are you talking about here ?

Did you look? workqueue_mutex is used to protect per-cpu workqueue
resources. The lock is taken prior to modification of per-cpu resources
and is released after their modification. Very very simple.

> That is the write side. You are
> *not* supposed to do lock_cpu_hotplug() in cpu callbacks paths AFAICT.

The workqueue code doesn't use lock_cpu_hotplug().

> If someone does it (like cpufreq did), it is just *wrong*.
>
> > > With multiple low-level subsystems
> > > needing it, it also becomes difficult to work out the lock
> > > hierarchies.
> >
> > That'll matter if we do crappy code. Let's not do that?
>
> I am talking about readsides here - you read cpu_online_map and
> block then reuse the map and make some calls to another subsystem
> that may again do a similar read-side cpu_hotplug lock.

Two unrelated subsystems which have both independent and interdependent CPU
hotplug locking requirements and neither of which can protect per-cpu
resources via preempt_disable()? Sounds unlikely and undesirable.

> I suspect
> that it hard to get rid of all possible dependencies.
>
> > > > I rather doubt that anyone will be hitting the races in practice. I'd
> > > > recommend simply removing all the lock_cpu_hotplug() calls for 2.6.18.
> > >
> > > I don't think that is a good idea.
> >
> > The code's already racy and I don't think anyone has reported a
> > cpufreq-vs-hotplug race.
>
> Do cpu hotplugs in a one cpu and change frequencies in another -
> I think Gautham has a script to reproduce this. Besides
> lockdep apparently complains about it -
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=115359728428432&w=2
>
> > > The right thing to do would be to
> > > do an audit and clean up the bad lock_cpu_hotplug() calls.
> >
> > No, that won't fix it. For example, take a look at all the *callers* of
> > cpufreq_update_policy(). AFAICT they're all buggy. Fiddling with the
> > existing lock_cpu_hotplug() sites won't fix that. (Possibly this
> > particular problem can be fixed by checking that the relevant CPU is still
> > online after the appropriate locking has been taken - dunno).
> >
> > It needs to be ripped out and some understanding, thought and design should
> > be applied to the problem.
>
> Really, the hotplug locking rules are fairly simple-
>
> 1. If you are in cpu hotplug callback path, don't take any lock.

That rule is wrong. The CPU_UP_PREPARE and CPU_DOWN_PREPARE notification
entrypoints are the logical place for a subsystem to lock any per-cpu resources
which another thread/cpu might presently be using.

> 2. If you are in a non-hotplug path reading cpu_online_map and you don't
> block, you just disable preemption and you are safe from hotplug.

Sure.

> 3. If you are in a non-hotplug path and you use cpu_online_map and
> you *really* need to block, you use lock_cpu_hotplug() or
> cpu_hotplug_disable whatever it is called.
>
> Is this too difficult for people to follow ?

Apparently. What's happening is that lock_cpu_hotplug() is seen as some
amazing thing which will prevent an *event* from occurring.

There's an old saying "lock data, not code". What data is being locked
here? It's the subsystem's per-cpu resources which we want to lock. We
shouldn't consider the lock as being some way of preventing an event from
happening.

> >
> > > People
> > > seem to have just got lazy with lock_cpu_hotplug().
> >
> > That's because lock_cpu_hotplug() purports to be some magical thing which
> > makes all your troubles go away.
>
> No it doesn't. Perhaps we should just document the rules better
> and put some static checks for people to get it right.

Yes, we could probably fix cpufreq using the existing lock_cpu_hotplug().
But we have a quite large amount of racy-wrt-cpu-hotplug code in the kernel
and although a lot of it can be fixed with preempt_disable(), it's possible
that we'll get into scalability problems.

If we do have scalability problems, they can be fixed on a per-subsystem
basis: the affected subsystem can use per-cpu locking of its per-cpu data
within its CPU_UP_PREPARE and CPU_DOWN_PREPARE handlers. That's a local,
contained issue, and addressing it this way is better than inventing (and
debugging) some fancy new lock type.

2006-08-27 07:57:39

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sun, 27 Aug 2006 13:07:30 +0530
Dipankar Sarma <[email protected]> wrote:

> On Sat, Aug 26, 2006 at 11:46:18PM -0700, Andrew Morton wrote:
> > On Sun, 27 Aug 2006 11:41:55 +0530
> >
> > > The right thing to do would be to
> > > do an audit and clean up the bad lock_cpu_hotplug() calls.
> >
> > No, that won't fix it. For example, take a look at all the *callers* of
> > cpufreq_update_policy(). AFAICT they're all buggy. Fiddling with the
> > existing lock_cpu_hotplug() sites won't fix that. (Possibly this
> > particular problem can be fixed by checking that the relevant CPU is still
> > online after the appropriate locking has been taken - dunno).
> >
>
> This is a different issue from the ones that relates to lock_cpu_hotplug().

Well not really. The thinking goes "gee, we need to lock against CPU
hotplug. THe CPU hotplug guys gave us lock_cpu_hotplug() so we're supposed
to use that somethere". We see the result.

> This one seems like a cpufreq internal locking problem.
>
> On a quick look at this,

Me too. But fixing this will require a long look. What per-cpu resources
are being used? How should they be locked? Implement.

> it seems to me that cpufreq_cpu_get() should
> do exactly what you said - use a spinlock in each cpufreq_cpu_data[] to
> protect the per-cpu flag and in cpufreq_cpu_get() check if
> !data and data->online == 0. They may have to do -
>
> static struct cpufreq_data {
> spinlock_t lock;
> int flag;
> struct cpufreq_policy *policy;
> } cpufreq_cpu_data[NR_CPUS];
>

I think we're agreeing... cpufreq locking is busted, needs to be redone
and, I would suggest, fancifying lock_cpu_hotplug() isn't the way to do it.

2006-08-27 08:58:21

by Nick Piggin

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

Andrew Morton wrote:
> On Sun, 27 Aug 2006 12:41:16 +0530
> Dipankar Sarma <[email protected]> wrote:

>>Is this too difficult for people to follow ?
>
>
> Apparently. What's happening is that lock_cpu_hotplug() is seen as some
> amazing thing which will prevent an *event* from occurring.

It prevents the event from occurring as much as a lock taken in the
prepare notifier does, right? Or am I misunderstanding something?

>
> There's an old saying "lock data, not code". What data is being locked
> here? It's the subsystem's per-cpu resources which we want to lock. We
> shouldn't consider the lock as being some way of preventing an event from
> happening.

I agree. Where possible these things should be very simple either with
the per-cpu macros, or using local locking (versus one's notifier).

I think there can be some valid use of the hotplug lock when working
with cpumasks rather than individual CPUs (or if you simply want to
prevent a cpu from going down while programming hardware) and having a
new lock and new notifier is too heavyweight. Or do we want to move
away from the hotplug lock completely?

Hmm, so I don't know if I like the idea of a reentrant rwmutex for the
hotplug lock so it can be sprinkled everywhere... call it a refcount
or not it smells slightly BKLish (looks easy but it could be a
nightmare to audit).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-08-27 11:06:56

by Dipankar Sarma

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sun, Aug 27, 2006 at 12:42:13AM -0700, Andrew Morton wrote:
> On Sun, 27 Aug 2006 12:41:16 +0530
> Dipankar Sarma <[email protected]> wrote:
>
> > On Sat, Aug 26, 2006 at 11:46:18PM -0700, Andrew Morton wrote:
> > > Yes you do. Please, read _cpu_up(), _cpu_down() and the example in
> > > workqueue_cpu_callback(). It's really very simple.
> >
> > What are you talking about here ?
>
> Did you look? workqueue_mutex is used to protect per-cpu workqueue
> resources. The lock is taken prior to modification of per-cpu resources
> and is released after their modification. Very very simple.

I did and there is no lock named workqueue_mutex. workqueue_cpu_callback()
is farily simple and doesn't have the issues in cpufreq that
we are talking about (lock_cpu_hotplug() in cpu callback path).

>
> > That is the write side. You are
> > *not* supposed to do lock_cpu_hotplug() in cpu callbacks paths AFAICT.
>
> The workqueue code doesn't use lock_cpu_hotplug().

And that is the right thing to do.

> > I am talking about readsides here - you read cpu_online_map and
> > block then reuse the map and make some calls to another subsystem
> > that may again do a similar read-side cpu_hotplug lock.
>
> Two unrelated subsystems which have both independent and interdependent CPU
> hotplug locking requirements and neither of which can protect per-cpu
> resources via preempt_disable()? Sounds unlikely and undesirable.

I would worry about situations where we have to use set_cpus_allowed()
with cpumasks. IIRC, those weren't trivial to handle and can happen
due to interaction between unrelated subsystems one using services
of the other - rtasd -> set_cpus_allowed() for example.

> > 1. If you are in cpu hotplug callback path, don't take any lock.
>
> That rule is wrong. The CPU_UP_PREPARE and CPU_DOWN_PREPARE notification
> entrypoints are the logical place for a subsystem to lock any per-cpu resources
> which another thread/cpu might presently be using.

I meant lock_cpu_hotplug(), not any lock. Of course, susbsystems
may need to use their own lock there to handle per-cpu data there.

> > 2. If you are in a non-hotplug path reading cpu_online_map and you don't
> > block, you just disable preemption and you are safe from hotplug.
>
> Sure.
>
> > 3. If you are in a non-hotplug path and you use cpu_online_map and
> > you *really* need to block, you use lock_cpu_hotplug() or
> > cpu_hotplug_disable whatever it is called.
> >
> > Is this too difficult for people to follow ?
>
> Apparently. What's happening is that lock_cpu_hotplug() is seen as some
> amazing thing which will prevent an *event* from occurring.
>
> There's an old saying "lock data, not code". What data is being locked
> here? It's the subsystem's per-cpu resources which we want to lock. We
> shouldn't consider the lock as being some way of preventing an event from
> happening.

That is what I argued for earlier, but I was given some examples
where they really needed to disable the asynchronous event of
cpu hotplug - otherwise they would have need to use very complex
multi-layer locking.

> > > > seem to have just got lazy with lock_cpu_hotplug().
> > >
> > > That's because lock_cpu_hotplug() purports to be some magical thing which
> > > makes all your troubles go away.
> >
> > No it doesn't. Perhaps we should just document the rules better
> > and put some static checks for people to get it right.
>
> Yes, we could probably fix cpufreq using the existing lock_cpu_hotplug().
> But we have a quite large amount of racy-wrt-cpu-hotplug code in the kernel
> and although a lot of it can be fixed with preempt_disable(), it's possible
> that we'll get into scalability problems.
>
> If we do have scalability problems, they can be fixed on a per-subsystem
> basis: the affected subsystem can use per-cpu locking of its per-cpu data
> within its CPU_UP_PREPARE and CPU_DOWN_PREPARE handlers. That's a local,
> contained issue, and addressing it this way is better than inventing (and
> debugging) some fancy new lock type.

I would suggest an audit of lock_cpu_hotlpug() users to start with.

Thanks
Dipankar

2006-08-27 17:22:06

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sun, 27 Aug 2006 16:36:58 +0530
Dipankar Sarma <[email protected]> wrote:

> > Did you look? workqueue_mutex is used to protect per-cpu workqueue
> > resources. The lock is taken prior to modification of per-cpu resources
> > and is released after their modification. Very very simple.
>
> I did and there is no lock named workqueue_mutex. workqueue_cpu_callback()
> is farily simple and doesn't have the issues in cpufreq that
> we are talking about (lock_cpu_hotplug() in cpu callback path).

http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9b41ea7289a589993d3daabc61f999b4147872c4

2006-08-27 17:49:42

by Dipankar Sarma

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sun, Aug 27, 2006 at 10:21:16AM -0700, Andrew Morton wrote:
> On Sun, 27 Aug 2006 16:36:58 +0530
> Dipankar Sarma <[email protected]> wrote:
>
> > > Did you look? workqueue_mutex is used to protect per-cpu workqueue
> > > resources. The lock is taken prior to modification of per-cpu resources
> > > and is released after their modification. Very very simple.
> >
> > I did and there is no lock named workqueue_mutex. workqueue_cpu_callback()
> > is farily simple and doesn't have the issues in cpufreq that
> > we are talking about (lock_cpu_hotplug() in cpu callback path).
>
> http://www.kernel.org/git/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=9b41ea7289a589993d3daabc61f999b4147872c4

Ah, I didn't realize that it was already in git. It does take
care of create_workqueue callers, however I don't see why this
is needed -

+ break;
+
+ case CPU_DOWN_PREPARE:
+ mutex_lock(&workqueue_mutex);
+ break;
+
+ case CPU_DOWN_FAILED:
+ mutex_unlock(&workqueue_mutex);
break;

This seems like some implicit code locking to me. Why is it not
sufficient to hold the lock in the CPU_DEAD code while walking
the workqueues ?

Thanks
Dipankar

2006-08-27 18:02:56

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Sun, 27 Aug 2006 23:19:46 +0530
Dipankar Sarma <[email protected]> wrote:

> I don't see why this
> is needed -
>
> + break;
> +
> + case CPU_DOWN_PREPARE:
> + mutex_lock(&workqueue_mutex);
> + break;
> +
> + case CPU_DOWN_FAILED:
> + mutex_unlock(&workqueue_mutex);
> break;
>
> This seems like some implicit code locking to me. Why is it not
> sufficient to hold the lock in the CPU_DEAD code while walking
> the workqueues ?

?

We need to hold workqueue_mutex to protect the per-cpu workqueue resources
while cpu_online_map is changing and while per-cpu memory is being
allocated or freed.

Look at cpu_down() and mentally replace the
blocking_notifier_call_chain(CPU_DOWN_PREPARE) with
mutex_lock(workqueue_mutex), etc. The __stop_machine_run() in there
modifies the (ie: potentially frees) the workqueue code's per-cpu memory.
So we take that resource's lock while doing so.

2006-08-28 02:38:29

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/4] Redesign cpu_hotplug locking.

On Thu, Aug 24, 2006 at 09:17:04AM -0700, Andrew Morton wrote:
> > (ii) Though has been introduced recently in workqueue.c , it will only lead to
> > more deadlock scenarios since more locks would have to be acquired.
> >
> > Eg: workqueue + cpufreq(ondemand) ABBA deadlock scenario. Consider
> > - task1: echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
> > - task2: echo 0 > /sys/devices/system/cpu/cpu1/online
> > entering the system at the same time.
> >
> > task1: calls store_scaling_governor which takes lock_cpu_hotplug.
> >
> > task2: thru a blocking_notifier_call_chain(CPU_DOWN_PREPARE)
> > to workqueue subsystem holds workqueue_mutex.
> >
> > task1: calls create_workqueue from cpufreq_governor_dbs[ondemand]
> > which tries taking workqueue_mutex but can't.
> >
> > task2: thru blocking_notifier_call_chain(CPU_DOWN_PREPARE) calls
> > cpufreq_driver_target(cpufreq subsystem),which tries to take lock_cpu_hotplug
> > but cant since it is already taken by task1.
> >
> > A typical ABBA deadlock!!!
>
> That's a bug in cpufreq. It should stop using lock_cpu_hotplug() and get
> its locking sorted out.

I don't think the deadlock scenario changes if cpufreq stops using
lock_cpu_hotplug() and starts using its own internal lock like
workqueue_mutex. This is unless the lock-heirarchy can be worked out in
advance (i.e at compile time itself), which leads to the next question:

> > Replicating this persubsystem-cpu-hotplug lock model across all other
> > cpu_hotplug-sensitive subsystems would only create more such problems as
> > notifier_callback does not follow any specific ordering while notifying
> > the subsystems.
>
> The ordering is sufficiently well-defined: it's the order of the chain. I
> don't expect there to be any problems here.

How is the "order" of the chain exposed? Moreover this needs to be
exposed at compile time itself so that people can code it correctly in
the first place! This cannot be done so easily unless we use priority
for notifiers (different priorities for different callbacks) or some
linking magic, which makes it somewhat ugly to maintain.

IMHO ensuring this heriarchy of locking order is maintained to avoid
deadlocks will be nightmarish.

I would also feel more comfortable at another aspect of this single lock
- it will make both callback processing and cpu_online_map changes to be
atomic, though I can't think of specific example where having non-atomic
callback processing hurts (*shrug*, just a safety feeling derived from
experience of fixing many cpu hotplug bugs in the past).

--
Regards,
vatsa