Subject: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

Hi all,

Looks like the lockdep has resumed yelling about the cpufreq-cpu hotplug
interactions! Again, it's the Ondemand governor that's the culprit.

On linux-2.6.19-rc6-mm2, this is what I got yesterday evening.

[root@llm05 tests]# echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
[root@llm05 tests]# echo 0 > /sys/devices/system/cpu/cpu1/online

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.19-rc6-mm2 #14
-------------------------------------------------------
bash/4601 is trying to acquire lock:
(&policy->lock){--..}, at: [<c045793d>] mutex_lock+0x12/0x15

but task is already holding lock:
(cache_chain_mutex){--..}, at: [<c045793d>] mutex_lock+0x12/0x15

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (cache_chain_mutex){--..}:
[<c013bddc>] __lock_acquire+0x8ef/0x9f3
[<c013c1ec>] lock_acquire+0x68/0x82
[<c04577da>] __mutex_lock_slowpath+0xd3/0x224
[<c045793d>] mutex_lock+0x12/0x15
[<c01642f1>] cpuup_callback+0x1a3/0x2d6
[<c045a952>] notifier_call_chain+0x2b/0x5b
[<c012efb6>] __raw_notifier_call_chain+0x18/0x1d
[<c012efd5>] raw_notifier_call_chain+0x1a/0x1c
[<c013f798>] _cpu_down+0x56/0x1ef
[<c013fa5c>] cpu_down+0x26/0x3a
[<c029dde2>] store_online+0x27/0x5a
[<c029adec>] sysdev_store+0x20/0x25
[<c0197284>] sysfs_write_file+0xb6/0xde
[<c01674d4>] vfs_write+0x90/0x144
[<c0167c74>] sys_write+0x3d/0x61
[<c0103d36>] sysenter_past_esp+0x5f/0x99
[<ffffffff>] 0xffffffff

-> #2 (workqueue_mutex){--..}:
[<c013bddc>] __lock_acquire+0x8ef/0x9f3
[<c013c1ec>] lock_acquire+0x68/0x82
[<c04577da>] __mutex_lock_slowpath+0xd3/0x224
[<c045793d>] mutex_lock+0x12/0x15
[<c0131e91>] __create_workqueue+0x5b/0x11c
[<c03c4a9a>] cpufreq_governor_dbs+0xa0/0x2e8
[<c03c2cde>] __cpufreq_governor+0x64/0xac
[<c03c30d5>] __cpufreq_set_policy+0x187/0x20b
[<c03c33a1>] store_scaling_governor+0x132/0x16a
[<c03c2af9>] store+0x35/0x46
[<c0197284>] sysfs_write_file+0xb6/0xde
[<c01674d4>] vfs_write+0x90/0x144
[<c0167c74>] sys_write+0x3d/0x61
[<c0103d36>] sysenter_past_esp+0x5f/0x99
[<ffffffff>] 0xffffffff

-> #1 (dbs_mutex){--..}:
[<c013bddc>] __lock_acquire+0x8ef/0x9f3
[<c013c1ec>] lock_acquire+0x68/0x82
[<c04577da>] __mutex_lock_slowpath+0xd3/0x224
[<c045793d>] mutex_lock+0x12/0x15
[<c03c4a7e>] cpufreq_governor_dbs+0x84/0x2e8
[<c03c2cde>] __cpufreq_governor+0x64/0xac
[<c03c30d5>] __cpufreq_set_policy+0x187/0x20b
[<c03c33a1>] store_scaling_governor+0x132/0x16a
[<c03c2af9>] store+0x35/0x46
[<c0197284>] sysfs_write_file+0xb6/0xde
[<c01674d4>] vfs_write+0x90/0x144
[<c0167c74>] sys_write+0x3d/0x61
[<c0103d36>] sysenter_past_esp+0x5f/0x99
[<ffffffff>] 0xffffffff

-> #0 (&policy->lock){--..}:
[<c013bce0>] __lock_acquire+0x7f3/0x9f3
[<c013c1ec>] lock_acquire+0x68/0x82
[<c04577da>] __mutex_lock_slowpath+0xd3/0x224
[<c045793d>] mutex_lock+0x12/0x15
[<c03c2c1f>] cpufreq_driver_target+0x2b/0x51
[<c03c38db>] cpufreq_cpu_callback+0x42/0x52
[<c045a952>] notifier_call_chain+0x2b/0x5b
[<c012efb6>] __raw_notifier_call_chain+0x18/0x1d
[<c012efd5>] raw_notifier_call_chain+0x1a/0x1c
[<c013f798>] _cpu_down+0x56/0x1ef
[<c013fa5c>] cpu_down+0x26/0x3a
[<c029dde2>] store_online+0x27/0x5a
[<c029adec>] sysdev_store+0x20/0x25
[<c0197284>] sysfs_write_file+0xb6/0xde
[<c01674d4>] vfs_write+0x90/0x144
[<c0167c74>] sys_write+0x3d/0x61
[<c0103d36>] sysenter_past_esp+0x5f/0x99
[<ffffffff>] 0xffffffff

other info that might help us debug this:

4 locks held by bash/4601:
#0: (cpu_add_remove_lock){--..}, at: [<c045793d>] mutex_lock+0x12/0x15
#1: (sched_hotcpu_mutex){--..}, at: [<c045793d>] mutex_lock+0x12/0x15
#2: (workqueue_mutex){--..}, at: [<c045793d>] mutex_lock+0x12/0x15
#3: (cache_chain_mutex){--..}, at: [<c045793d>] mutex_lock+0x12/0x15

stack backtrace:
[<c0104c8f>] show_trace_log_lvl+0x19/0x2e
[<c0104d8f>] show_trace+0x12/0x14
[<c0104da5>] dump_stack+0x14/0x16
[<c013a157>] print_circular_bug_tail+0x7c/0x85
[<c013bce0>] __lock_acquire+0x7f3/0x9f3
[<c013c1ec>] lock_acquire+0x68/0x82
[<c04577da>] __mutex_lock_slowpath+0xd3/0x224
[<c045793d>] mutex_lock+0x12/0x15
[<c03c2c1f>] cpufreq_driver_target+0x2b/0x51
[<c03c38db>] cpufreq_cpu_callback+0x42/0x52
[<c045a952>] notifier_call_chain+0x2b/0x5b
[<c012efb6>] __raw_notifier_call_chain+0x18/0x1d
[<c012efd5>] raw_notifier_call_chain+0x1a/0x1c
[<c013f798>] _cpu_down+0x56/0x1ef
[<c013fa5c>] cpu_down+0x26/0x3a
[<c029dde2>] store_online+0x27/0x5a
[<c029adec>] sysdev_store+0x20/0x25
[<c0197284>] sysfs_write_file+0xb6/0xde
[<c01674d4>] vfs_write+0x90/0x144
[<c0167c74>] sys_write+0x3d/0x61
[<c0103d36>] sysenter_past_esp+0x5f/0x99
=======================
Breaking affinity for irq 24
CPU 1 is now offline

Ok, so to cut the long story short,
- While changing governor from anything to
ondemand, locks are taken in the following order

policy->lock ===> dbs_mutex ===> workqueue_mutex.

- While offlining a cpu, locks are taken in the following order

cpu_add_remove_lock ==> sched_hotcpu_mutex ==> workqueue_mutex ==
==> cache_chain_mutex ==> policy->lock.

The dependency graph built out of this info has a circular dependency
as pointed out by lockdep. However, I am not quite sure how seriously this
circular dependency warning should be taken.

One way to avoid these warnings is to take the policy->lock before the
rest of the locks, while offlining the cpu.
For a moment I even thought of taking/releasing policy->lock under
CPU_LOCK_ACQUIRE/CPU_LOCK_RELEASE events in cpufreq_cpu_callback.
But that's a bad idea since 'policy' is percpu data in the first place
and hence needs to be cpu-hotplug aware.

These circular-dependency warnings are emitted in
2.6.19-rc6-mm1 as well as (2.6.19-rc6 + hotplug_locking_rework patches)

So do we
- Rethink the strategy of per-subsystem hotcpu-locks ?

OR

- Think of a way to straighten out the super-convoluted cpufreq code ?

At the moment, I would suggest the latter :-)

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"


2006-11-29 21:06:19

by Andrew Morton

[permalink] [raw]
Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

On Wed, 29 Nov 2006 20:54:04 +0530
Gautham R Shenoy <[email protected]> wrote:

> Ok, so to cut the long story short,
> - While changing governor from anything to
> ondemand, locks are taken in the following order
>
> policy->lock ===> dbs_mutex ===> workqueue_mutex.
>
> - While offlining a cpu, locks are taken in the following order
>
> cpu_add_remove_lock ==> sched_hotcpu_mutex ==> workqueue_mutex ==
> ==> cache_chain_mutex ==> policy->lock.

What functions are taking all these locks? (ie: the callpath?)

Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

On Wed, Nov 29, 2006 at 01:05:56PM -0800, Andrew Morton wrote:
> On Wed, 29 Nov 2006 20:54:04 +0530
> Gautham R Shenoy <[email protected]> wrote:
>
> > Ok, so to cut the long story short,
> > - While changing governor from anything to
> > ondemand, locks are taken in the following order
> >
> > policy->lock ===> dbs_mutex ===> workqueue_mutex.

> >
> > - While offlining a cpu, locks are taken in the following order
> >
> > cpu_add_remove_lock ==> sched_hotcpu_mutex ==> workqueue_mutex ==
> > ==> cache_chain_mutex ==> policy->lock.
>
> What functions are taking all these locks? (ie: the callpath?)

While changing cpufreq governor to ondemand, the locks taken are:
--------------------------------------------------------------------------
lock function file
--------------------------------------------------------------------------
policy->lock store_scaling_governor drivers/cpufreq/cpufreq.c

dbs_mutex cpufreq_governor_dbs drivers/cpufreq/cpufreq_ondemand.c

workqueue_mutex __create_workqueue kernel/workqueue.c
--------------------------------------------------------------------------

The complete callpath would be

store_scaling_governor [*]
|
__cpufreq_set_policy
|
__cpufreq_governor(data, CPUFREQ_GOV_START)
|
policy->governor->governor => cpufreq_governor_dbs(data, CPUFREQ_GOV_START) [*]
|
create_workqueue #defined as __create_workqueue [*]

where [*] = locks taken.

While offlining a cpu, locks are taken in the following order:

--------------------------------------------------------------------------
lock function file
--------------------------------------------------------------------------
cpu_add_remove_lock cpu_down kernel/cpu.c

sched_hotcpu_mutex migration_call kernel/sched.c

workqueue_mutex workqueue_cpu_callback kernel/workqueue.c

cache_chain_mutex cpuup_callback mm/slab.c

policy->lock cpufreq_driver_target drivers/cpufreq/cpufreq.c
---------------------------------------------------------------------------

Please note that in the above,
- sched_hotcpu_mutex, workqueue_mutex, cache_chain_mutex are taken
while handling CPU_LOCK_ACQUIRE events in the respective subsystems'
cpu_callback functions.

- policy->lock is taken while handling CPU_DOWN_PREPARE in
cpufreq_cpu_callback which calls cpufreq_driver_target.

It's perfectly clear that in the cpu offline callpath, cpufreq
does not have to do anything with the workqueue.

So can we ignore this circular-dep warning as a false positive?
Or is there a way to exploit this circular dependency ?

At the moment, I cannot think of way to exploit this circular dependency
unless we do something like try destroying the created workqueue when the
cpu is dead, i.e make the cpufreq governors cpu-hotplug-aware.
(eeks! that doesn't look good)

I'm working on fixing this. Let me see if I can come up with something.

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

On Thu, Nov 30, 2006 at 09:58:07AM +0530, Gautham R Shenoy wrote:
>
> So can we ignore this circular-dep warning as a false positive?
> Or is there a way to exploit this circular dependency ?
>
> At the moment, I cannot think of way to exploit this circular dependency
> unless we do something like try destroying the created workqueue when the
> cpu is dead, i.e make the cpufreq governors cpu-hotplug-aware.
> (eeks! that doesn't look good)

Ok, I see that we are already doing it :(. So we can end up in a
deadlock.

Here's the culprit callpath:

_cpu_down()
!
!-> raw_notifier_call_chain(CPU_LOCK_ACQUIRE)
! !
! !-> workqueue_cpu_mutex(CPU_LOCK_ACQUIRE) [*]
!
!-> raw_notifier_call_chain(CPU_DEAD)
!
!-> cpufreq_cpu_callback (CPU_DEAD)
!
!-> cpufreq_remove_dev
!
!-> __cpufreq_governor(data, GOVERNOR_STOP)
!
!-> policy->governor->governor()
!
!-> cpufreq_governor_dbs(GOVERNOR_STOP)
!
!-> destroy_workqueue() [*]

[*] indicates function takes workqueue_mutex.

So a deadlock!

I wasn't able to observe this because I'm running Xeon SMP box on which
you cannot offline cpu0. And cpufreq data is created only for cpu0,
while all other cpus cpufreq_data just point to cpu0's cpufreq_data.

So the mentioned callpath within cpufreq_remove_dev is never reached
during the normal cpu offline cycle.

However, if there are architectures which allow the first-booted-cpu
(or to be precise, the cpu for which cpufreq_data is *actually* created)
to be offlined and we are running Ondemand governor during the offline,
we will see this deadlock.

regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2006-11-30 08:31:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency


* Gautham R Shenoy <[email protected]> wrote:

> Ok, I see that we are already doing it :(. So we can end up in a
> deadlock.
>
> Here's the culprit callpath:

in general lockdep is 100% correct when it comes to "individual locks".
The overwhelming majority of lockdep false-positives is not due to
lockdep not getting the dependencies right, but due to the "lock class"
not being correctly identified. That's not an issue here i think.

what lockdep does is it observes actual locking dependencies as they
happen individually in various contexts, and then 'completes' the
dependency graph by combining all the possible scenarios how contexts
might preempt each other. So if lockdep sees independent dependencies
and concludes that they are circular, there's nothing that saves us from
the deadlock.

The only way for those dependencies to /never/ trigger simultaneously on
different CPUs would be via the use of a further 'outer' exclusion
mechanism (i.e. a lock) - but all explicit kernel-API exclusion
mechanisms are tracked by lockdep => Q.E.D. (Open-coded exclusion might
escape the attention of lockdep, but those are extremely rare and are
also easily found.)

Ingo

2006-11-30 08:33:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency


* Gautham R Shenoy <[email protected]> wrote:

> So do we
> - Rethink the strategy of per-subsystem hotcpu-locks ?
>
> OR
>
> - Think of a way to straighten out the super-convoluted cpufreq code ?

i'm still wondering what the conceptual source of this fundamental
locking complexity in cpufreq (and hotplug) is - it is not intuitive to
me at all. Could you try to explain that?

Ingo

Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

On Thu, Nov 30, 2006 at 09:29:34AM +0100, Ingo Molnar wrote:
>
> * Gautham R Shenoy <[email protected]> wrote:
>
> > Ok, I see that we are already doing it :(. So we can end up in a
> > deadlock.
> >
> > Here's the culprit callpath:
>
> in general lockdep is 100% correct when it comes to "individual locks".
> The overwhelming majority of lockdep false-positives is not due to
> lockdep not getting the dependencies right, but due to the "lock class"
> not being correctly identified. That's not an issue here i think.

You're right. That's not the issue.

>
> what lockdep does is it observes actual locking dependencies as they
> happen individually in various contexts, and then 'completes' the
> dependency graph by combining all the possible scenarios how contexts
> might preempt each other. So if lockdep sees independent dependencies
> and concludes that they are circular, there's nothing that saves us from
> the deadlock.
>

Ah! I get it now. I had taken neither preemption nor the SMP scenario
into account before concluding that the warning might be a false
positive.

All I need to do is to run my test cases on a preemptible kernel
or in parallel on a smp box. It'll definitely deadlock there!

> The only way for those dependencies to /never/ trigger simultaneously on
> different CPUs would be via the use of a further 'outer' exclusion
> mechanism (i.e. a lock) - but all explicit kernel-API exclusion
> mechanisms are tracked by lockdep => Q.E.D. (Open-coded exclusion might
> escape the attention of lockdep, but those are extremely rare and are
> also easily found.)

Thanks for making it clear :-)

>
> Ingo

regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

On Thu, Nov 30, 2006 at 09:31:44AM +0100, Ingo Molnar wrote:
>
> * Gautham R Shenoy <[email protected]> wrote:
>
> > So do we
> > - Rethink the strategy of per-subsystem hotcpu-locks ?
> >
> > OR
> >
> > - Think of a way to straighten out the super-convoluted cpufreq code ?
>
> i'm still wondering what the conceptual source of this fundamental
> locking complexity in cpufreq (and hotplug) is - it is not intuitive to
> me at all. Could you try to explain that?

I can try :-)

IIRC, cpufreq was written before cpu-hotplug and thus, was never
hotplug aware when it was first written.

Let me try and simplify this as far as I can.

a) cpufreq maintain's it's own cpumask in the variable
policy->affected_cpus and says : If a frequency change is issued to
any one of the cpu's in the affected_cpus mask, you change frequency on
all cpus in the mask.
So this needs to be consistent with cpu_online map and hence
cpu hotplug aware. Furthermore, we don't want cpus in this mask to go
down when we are trying to change frequencies on them. The function
which drives the frequency change in cpufreq-core is
cpufreq_driver_target and it needs cpu-hotplug protection.

b) This mask is changed when the cpu_hotplug core sends a
CPU_DEAD/CPU_ONLINE notification through the function
cpufreq_cpu_callback.

So it's a Read-Write situation, where cpufreq_driver_target()
is the read side and cpufreq_cpu_callback() is
the write side.
So we would want a) and b) to be mutually exclusive.

c) Problem is when a just before a cpu is offlined,
we want to put it at the lowest possible frequency.
So we send out a CPU_DOWN_PREPARE event handled by cpufreq_cpu_callback
which calls cpufreq_driver_target.

Classic case of calling down_read from a down_write context!

To solve this so many changes have taken place since 2.6.18-rcsomething.

i) Linus split cpu_control mutex in cpu.c to cpu_add_remove_lock and
cpu_bitmask_lock. While the former serializes cpu_hotplug attempts, the
latter actually provides cpu-hotplug protection.

ii) Arjan cleaned up cpufreq to eliminate all intra-cpufreq recursive
lock_cpu_hotplug calls.

iii) Andrew replaced lock_cpu_hotplug in workqueue code with
workqueue_mutex that solved the recursive hotplug problem between
ondemand governor and workqueue.

This is how it exists in 2.6.19.
Is it safe? Not really.
The split locking still can hit a BUG_ON as described in the post
http://lkml.org/lkml/2006/8/24/89 because the split locking opens up a
window which allows cpufreq to try changing frequency on stale cpus.

Besides having so many subsystems depend on a global lock may not be
such a good idea from the scalability perspective.

I did post a rcu-based scalable version of lock_cpu_hotplug but didn't
get much response.

Which is why we tried having per-subsystem hotplug locks where
each subsystem can define it's own mutex.

The readside will be something like
mutex_lock(&my_hotcpu_lock);

/* No hotplug can happen here */

mutex_unlock(&my_hotcpu_lock);

Define a callback function my_cpu_callback and handle events
CPU_LOCK_ACQUIRE and CPU_LOCK_RELEASE which would be sent out before
beginning a cpu hotplug and after the completion of a cpu hotplug
respectively.

my_cpu_callback( int event)
{
switch(event) {
case CPU_LOCK_ACQUIRE:
mutex_lock(&my_hotcpu_mutex);
break;

/*
* Cpu hotplug happens here.
* Handle any other pre/post hotplug events
*/

case CPU_LOCK_RELEASE:
mutex_unlock(&my_hotcpu_mutex);
} break;
}

This would work fine, if the interactions between the cpu-hotplug aware
code across subsystem is well ordered.

But the ondemand governor-workqueue interactions are causing the
circular dependency problem and in another case a recursive locking
problem. Sigh!

Hope this explaination helped :-)

> Ingo

regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2006-11-30 11:03:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency


* Gautham R Shenoy <[email protected]> wrote:

> a) cpufreq maintain's it's own cpumask in the variable
> policy->affected_cpus and says : If a frequency change is issued to
> any one of the cpu's in the affected_cpus mask, you change frequency
> on all cpus in the mask. So this needs to be consistent with
> cpu_online map and hence cpu hotplug aware. Furthermore, we don't want
> cpus in this mask to go down when we are trying to change frequencies
> on them. The function which drives the frequency change in
> cpufreq-core is cpufreq_driver_target and it needs cpu-hotplug
> protection.

couldnt this complexity be radically simplified by having new kernel
infrastructure that does something like:

" 'gather' all CPUs mentioned in <mask> via scheduling a separate
helper-kthread on every CPU that <mask> specifies, disable all
interrupts, and execute function <fn> once all CPUs have been
'gathered' - and release all CPUs once <fn> has executed on each of
them."

?

This would be done totally serialized and while holding the hotplug
lock, so no CPU could go away or arrive while this operation is going
on.

Ingo

2006-11-30 11:22:41

by Andrew Morton

[permalink] [raw]
Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

On Thu, 30 Nov 2006 12:03:15 +0100
Ingo Molnar <[email protected]> wrote:

>
> * Gautham R Shenoy <[email protected]> wrote:
>
> > a) cpufreq maintain's it's own cpumask in the variable
> > policy->affected_cpus and says : If a frequency change is issued to
> > any one of the cpu's in the affected_cpus mask, you change frequency
> > on all cpus in the mask. So this needs to be consistent with
> > cpu_online map and hence cpu hotplug aware. Furthermore, we don't want
> > cpus in this mask to go down when we are trying to change frequencies
> > on them. The function which drives the frequency change in
> > cpufreq-core is cpufreq_driver_target and it needs cpu-hotplug
> > protection.
>
> couldnt this complexity be radically simplified by having new kernel
> infrastructure that does something like:
>
> " 'gather' all CPUs mentioned in <mask> via scheduling a separate
> helper-kthread on every CPU that <mask> specifies, disable all
> interrupts, and execute function <fn> once all CPUs have been
> 'gathered' - and release all CPUs once <fn> has executed on each of
> them."
>
> ?

How does this differ from stop_machine_run(), which hot-unplug presently uses?

> This would be done totally serialized and while holding the hotplug
> lock, so no CPU could go away or arrive while this operation is going
> on.

You said "the hotplug lock". That is the problem.

Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

On Thu, Nov 30, 2006 at 12:03:15PM +0100, Ingo Molnar wrote:
>
> * Gautham R Shenoy <[email protected]> wrote:
>
> > a) cpufreq maintain's it's own cpumask in the variable
> > policy->affected_cpus and says : If a frequency change is issued to
> > any one of the cpu's in the affected_cpus mask, you change frequency
> > on all cpus in the mask. So this needs to be consistent with
> > cpu_online map and hence cpu hotplug aware. Furthermore, we don't want
> > cpus in this mask to go down when we are trying to change frequencies
> > on them. The function which drives the frequency change in
> > cpufreq-core is cpufreq_driver_target and it needs cpu-hotplug
> > protection.
>
> couldnt this complexity be radically simplified by having new kernel
> infrastructure that does something like:
>
> " 'gather' all CPUs mentioned in <mask> via scheduling a separate
> helper-kthread on every CPU that <mask> specifies, disable all
> interrupts, and execute function <fn> once all CPUs have been
> 'gathered' - and release all CPUs once <fn> has executed on each of
> them."
>
> ?

This is what is currently being done by cpufreq:

a) get_some_cpu_hotplug_protection() [use either some global mechanism
or a persubsystem mutex]

b) actual_freq_change_driver_function(mask)
/* You can check out cpufreq_p4_target() in
* arch/i386/kernel/cpu/cpufreq/p4-clockmod.c
*/

{
for_each_cpu_mask(i, mask) {
cpumask_t this_cpu = cpumask_of_cpu(i);
set_cpus_allowed(current, this_cpu);
function_to_change_frequency();

}
}

c) release_whatever_cpu_hotplug_protection()

>
> This would be done totally serialized and while holding the hotplug
> lock, so no CPU could go away or arrive while this operation is going
> on.
>

Isn't the above same as what you are suggesting? Or have I missed out
anything?

> Ingo

regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2006-11-30 11:47:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency


* Andrew Morton <[email protected]> wrote:

> > This would be done totally serialized and while holding the hotplug
> > lock, so no CPU could go away or arrive while this operation is
> > going on.
>
> You said "the hotplug lock". That is the problem.

maybe i'm too dense today but i still dont see the fundamental problem.

Even with complex inter-subsystem interactions, hotplugging could be
effectively and scalably controlled via a self-recursive per-CPU mutex,
and a pointer to it embedded in task_struct:

struct task_struct {
...
int hotplug_depth;
struct mutex *hotplug_lock;
}
...

DEFINE_PER_CPU(struct mutex, hotplug_lock);

void cpu_hotplug_lock(void)
{
int cpu = raw_smp_processor_id();
/*
* Interrupts/softirqs are hotplug-safe:
*/
if (in_interrupt())
return;
if (current->hotplug_depth++)
return;
current->hotplug_lock = &per_cpu(hotplug_lock, cpu);
mutex_lock(current->hotplug_lock);
}

void cpu_hotplug_unlock(void)
{
int cpu;

if (in_interrupt())
return;
if (DEBUG_LOCKS_WARN_ON(!current->hotplug_depth))
return;
if (--current->hotplug_depth)
return;

mutex_unlock(current->hotplug_lock);
current->hotplug_lock = NULL;
}

...

void do_exit(void)
{
...
DEBUG_LOCKS_WARN_ON(current->hotplug_depth);
...
}
...
copy_process(void)
{
...
p->hotplug_depth = 0;
p->hotplug_lock = NULL;
...
}

50 lines of code at most. The only rule is to not use cpu_hotplug_lock()
in process-context non-preemptible code [interrupt contexts are
automatically ignored]. What am i missing?

Ingo

2006-11-30 11:53:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency


* Gautham R Shenoy <[email protected]> wrote:

> This is what is currently being done by cpufreq:

ok!

> a) get_some_cpu_hotplug_protection() [use either some global mechanism
> or a persubsystem mutex]

this bit is wrong i think. Any reason why it's not a per-CPU (but
otherwise global) array of mutexes that controls CPU hotplug - as per my
previous mail?

that would flatten the whole locking. Only one kind of lock taken,
recursive and scalable.

Then the mechanism that changes CPU frequency should take all these
hotplug locks on all (online) CPUs, and then first stop all processing
on all CPUs, and then do the frequency change, atomically. This is with
interrupts disabled everywhere /first/, and /without any additional
locking/. That would prevent any sort of interaction from other CPUs -
they'd all be sitting still with interrupts disabled.

Ingo

Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

On Thu, Nov 30, 2006 at 12:53:27PM +0100, Ingo Molnar wrote:
>
> * Gautham R Shenoy <[email protected]> wrote:
>
> > This is what is currently being done by cpufreq:
>
> ok!
>
> > a) get_some_cpu_hotplug_protection() [use either some global mechanism
> > or a persubsystem mutex]
>
> this bit is wrong i think. Any reason why it's not a per-CPU (but
> otherwise global) array of mutexes that controls CPU hotplug - as per my
> previous mail?
>
> that would flatten the whole locking. Only one kind of lock taken,
> recursive and scalable.

I had posted one such recursive scalable version which can be found here
http://lkml.org/lkml/2006/10/26/73

I remember cc'ing you.

Yeah, it looks complicated and big, but then I did not want to add
another field to the task struct as one such attempt had already been
frowned upon ( I think long back Ashok posted it)

So I ended up writing the whole read/write lock/unlock code myself.

It's a RCU based lock, extremely light on the read side, but costly for the
writers since it does a synchronize_sched.

And yeah, it's partial towards the readers but with an additional field
in the task struct we can have a fair implementation.

Besides, an unfair cpu_hotplug_lock won't work since a process doing a
sched_getaffinity in a forever_while loop can prevent any hotplug from
happening.

>
> Then the mechanism that changes CPU frequency should take all these
> hotplug locks on all (online) CPUs, and then first stop all processing
> on all CPUs, and then do the frequency change, atomically. This is with
> interrupts disabled everywhere /first/, and /without any additional
> locking/. That would prevent any sort of interaction from other CPUs -
> they'd all be sitting still with interrupts disabled.
>

Yup.

> Ingo

regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

On Thu, Nov 30, 2006 at 12:46:17PM +0100, Ingo Molnar wrote:
>
> struct task_struct {
> ...
> int hotplug_depth;
> struct mutex *hotplug_lock;
> }
> ...
>
> DEFINE_PER_CPU(struct mutex, hotplug_lock);
>
> void cpu_hotplug_lock(void)
> {
> int cpu = raw_smp_processor_id();
> /*
> * Interrupts/softirqs are hotplug-safe:
> */
> if (in_interrupt())
> return;
> if (current->hotplug_depth++)
> return;
> current->hotplug_lock = &per_cpu(hotplug_lock, cpu);
> mutex_lock(current->hotplug_lock);
> }
>
> void cpu_hotplug_unlock(void)
> {
> int cpu;
>
> if (in_interrupt())
> return;
> if (DEBUG_LOCKS_WARN_ON(!current->hotplug_depth))
> return;
> if (--current->hotplug_depth)
> return;
>
> mutex_unlock(current->hotplug_lock);
> current->hotplug_lock = NULL;
> }
>

In process context preemptible code,
Lets say we are currently running on processor i.

cpu_hotplug_lock() ; /* does mutex_lock(&percpu(hotplug_lock, i)) */

/* do some operation, which might sleep */
/* migrates to cpu j */

cpu_hotplug_unlock(); /* does mutex_unlock(&percpu(hotplug_lock, i)
while running on cpu j */

This would cause cacheline ping pong, no?

>
> Ingo

regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2006-11-30 14:35:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency


* Gautham R Shenoy <[email protected]> wrote:

> In process context preemptible code,
> Lets say we are currently running on processor i.
>
> cpu_hotplug_lock() ; /* does mutex_lock(&percpu(hotplug_lock, i)) */
>
> /* do some operation, which might sleep */
> /* migrates to cpu j */
>
> cpu_hotplug_unlock(); /* does mutex_unlock(&percpu(hotplug_lock, i)
> while running on cpu j */
>
> This would cause cacheline ping pong, no?

that would be attached to a very cache-inefficient thing: migrating a
task from one CPU to another. This is not the kind of ping-pong we are
normally worried about. (nor does it happen all that often)

Ingo

2006-11-30 19:40:35

by Andrew Morton

[permalink] [raw]
Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

On Thu, 30 Nov 2006 12:46:17 +0100
Ingo Molnar <[email protected]> wrote:

> * Andrew Morton <[email protected]> wrote:
>
> > > This would be done totally serialized and while holding the hotplug
> > > lock, so no CPU could go away or arrive while this operation is
> > > going on.
> >
> > You said "the hotplug lock". That is the problem.
>
> maybe i'm too dense today but i still dont see the fundamental problem.
>
> Even with complex inter-subsystem interactions, hotplugging could be
> effectively and scalably controlled via a self-recursive per-CPU mutex,
> and a pointer to it embedded in task_struct:

Yes, I suppose we could come up with now new lock type which fixes the
problem. But first, let us review the problem.

Someone went through cpufreq sprinkling lock_cpu_hotplug() everywhere as if
it had magical properties. For the past year or more, others have been
picking through the resulting bugs, trying to make things better and
treating that initial magic-pixie-dust as if it were inviolate and nobody
had a chance of understanding it.

IOW, cpufreq is screwed up, and we keep on trying to come up with more
complexity to unscrew it.

So what I would propose is that rather than going ahead and piling more
complexity on top of the existing poo-pile in an attempt to make it seem to
work, we should simply rip all the cpu-hotplug locking out of cpufreq
(there's a davej patch for that in -mm) and then just redo it all in an
organised fashion.

If, as a result of that exercise, we decide that the existing cpu-hotplug
serialisation mechanisms (ie: per-subsystem notification callbacks and
preempt_disable()) are insufficient then sure, that is the time to think
about more sophisticated locking primitives.

But please, let's stop asking "how do we make cpufreq's hotplug locking
work?". Let's instead ask "how do we make cpufreq safe wrt hotplug?"

To do the latter is a matter of

- identify the per-cpu resources which need locking

- decide what mechanism is to be used to protect each such resource

- design the locking and its hierarchy

- implement, test.

In all this time, Gautham's emails from yesterday were the first occasion
on which anybody has taken the time to sit down and get us started on this
quite ordinary design & devel process.


Let me re-re-re-re-reiterate: this is a cpufreq problem. It has not yet
been demonstrated (AFAICS) that it is a general problem. I am unaware of
any other subsystems having got themselves into such a mess over hotplug
locking. Now, maybe that's because cpufreq really is especially hard. Or
maybe it's because it's just messed up. I don't believe we know which of
those is true yet.

2006-11-30 20:24:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency


* Andrew Morton <[email protected]> wrote:

> > Even with complex inter-subsystem interactions, hotplugging could be
> > effectively and scalably controlled via a self-recursive per-CPU
> > mutex, and a pointer to it embedded in task_struct:

> So what I would propose is that rather than going ahead and piling
> more complexity on top of the existing poo-pile in an attempt to make
> it seem to work, we should simply rip all the cpu-hotplug locking out
> of cpufreq (there's a davej patch for that in -mm) and then just redo
> it all in an organised fashion.

actually, that's precisely what i'm suggesting too, i wrote it to
Gautham in one of the previous mails:

|| that would flatten the whole locking. Only one kind of lock taken,
|| recursive and scalable.
||
|| Then the mechanism that changes CPU frequency should take all these
|| hotplug locks on all (online) CPUs, and then first stop all
|| processing on all CPUs, and then do the frequency change, atomically.
|| This is with interrupts disabled everywhere /first/, and /without any
|| additional locking/. That would prevent any sort of interaction from
|| other CPUs - they'd all be sitting still with interrupts disabled.

no other locking, only the CPU hotplug lock and the (existing) ability
to 'do stuff' with nothing else running on any other CPU.

Ingo

Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

On Thu, Nov 30, 2006 at 09:29:34AM +0100, Ingo Molnar wrote:
> what lockdep does is it observes actual locking dependencies as they
> happen individually in various contexts, and then 'completes' the
> dependency graph by combining all the possible scenarios how contexts
> might preempt each other. So if lockdep sees independent dependencies
> and concludes that they are circular, there's nothing that saves us from
> the deadlock.

Ingo,

Consider a case where we have three locks A, B and C.
We have very clear locking rule inside the kernel that lock A *should*
be acquired before acquiring either lock B or lock C.

At runtime lockdep detects the two dependency chains,
A --> B --> C

and

A --> C --> B.

Does lockdep issue a circular dependency warning for this ?
It's quite clear from the locking rule that we cannot have a
circular deadlock, since A acts as a mutex for B->C / C->B callpath.

Just curious :-) [ Well, I might encounter such a scenario in an attempt
to make cpufreq cpu-hotplug safe! ]

> Ingo

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2006-12-01 08:55:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency


* Gautham R Shenoy <[email protected]> wrote:

> Consider a case where we have three locks A, B and C.
> We have very clear locking rule inside the kernel that lock A *should*
> be acquired before acquiring either lock B or lock C.
>
> At runtime lockdep detects the two dependency chains,
> A --> B --> C
>
> and
>
> A --> C --> B.
>
> Does lockdep issue a circular dependency warning for this ?
> It's quite clear from the locking rule that we cannot have a
> circular deadlock, since A acts as a mutex for B->C / C->B callpath.
>
> Just curious :-) [ Well, I might encounter such a scenario in an attempt
> to make cpufreq cpu-hotplug safe! ]

yes, lockdep will warn about this. Will you /ever/ have a B->C or C->B
dependency without A being taken first?

if not then my suggestion would be to just lump 'B' and 'C' into a
single lock - or to get rid of them altogether. There's little reason to
keep them separate. /If/ you want to keep them separate (because they
protect different data structures) then it's the cleanest design to have
an ordering between them. The taking of 'A' might go away anytime - such
a construct is really, really fragile and is only asking for trouble.

Ingo

2006-12-06 18:38:00

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: RE: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency



>-----Original Message-----
>From: [email protected]
>[mailto:[email protected]] On Behalf Of
>Gautham R Shenoy
>Sent: Thursday, November 30, 2006 3:44 AM
>To: Ingo Molnar
>Cc: Gautham R Shenoy; [email protected];
>[email protected]; [email protected];
>[email protected]; [email protected]; [email protected]
>Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency
>
>On Thu, Nov 30, 2006 at 12:03:15PM +0100, Ingo Molnar wrote:
>>
>> * Gautham R Shenoy <[email protected]> wrote:
>>
>> > a) cpufreq maintain's it's own cpumask in the variable
>> > policy->affected_cpus and says : If a frequency change is issued to
>> > any one of the cpu's in the affected_cpus mask, you change
>frequency
>> > on all cpus in the mask. So this needs to be consistent with
>> > cpu_online map and hence cpu hotplug aware. Furthermore,
>we don't want
>> > cpus in this mask to go down when we are trying to change
>frequencies
>> > on them. The function which drives the frequency change in
>> > cpufreq-core is cpufreq_driver_target and it needs cpu-hotplug
>> > protection.
>>
>> couldnt this complexity be radically simplified by having new kernel
>> infrastructure that does something like:
>>
>> " 'gather' all CPUs mentioned in <mask> via scheduling a separate
>> helper-kthread on every CPU that <mask> specifies, disable all
>> interrupts, and execute function <fn> once all CPUs have been
>> 'gathered' - and release all CPUs once <fn> has executed
>on each of
>> them."
>>
>> ?
>
>This is what is currently being done by cpufreq:
>
>a) get_some_cpu_hotplug_protection() [use either some global mechanism
> or a persubsystem mutex]
>
>b) actual_freq_change_driver_function(mask)
>/* You can check out cpufreq_p4_target() in
> * arch/i386/kernel/cpu/cpufreq/p4-clockmod.c
> */
>
> {
> for_each_cpu_mask(i, mask) {
> cpumask_t this_cpu = cpumask_of_cpu(i);
> set_cpus_allowed(current, this_cpu);
> function_to_change_frequency();
>
> }
> }
>

As there are many options being discussed here, let me propose one
more option that can eliminate the need for hotplug lock in
cpufreq_driver_target() path.

As Gautham clearly explained before, today we have cpufreq calling
cpufreq_driver_target() in each driver to change the CPU frequency
And the driver internally uses set_cpus_allowed to reschedule onto
different affected_cpus and change the frequency. That is the main
reason why we need to disable hotplug in this path.

But, if we make cpufreq more affected_cpus aware and have a per_cpu
target()
call by moving set_cpus_allowed() from driver into cpufreq core and
define
the target function to be atomic/non-sleeping type, then we really don't
need a hotplug lock for the driver any more. Driver can have
get_cpu/put_cpu
pair to disable preemption and then change the frequency.

This means a lot of changes as we need new interface changes to cpufreq
and
rewrite of bunch of drivers. But, this looks to me as the least
complicated solution.

Thanks,
Venki

Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency

Hi Venki,
On Wed, Dec 06, 2006 at 10:27:01AM -0800, Pallipadi, Venkatesh wrote:

> But, if we make cpufreq more affected_cpus aware and have a per_cpu
> target()
> call by moving set_cpus_allowed() from driver into cpufreq core and
> define
> the target function to be atomic/non-sleeping type, then we really don't
> need a hotplug lock for the driver any more. Driver can have
> get_cpu/put_cpu
> pair to disable preemption and then change the frequency.

Well, we would still need to keep the affected_cpus map in sync with the
cpu_online map. That would still require hotplug protection, right?

Besides, I would love to see a way of implementing target function to be
atomic/non-sleeping type. But as of now, the target functions call
cpufreq_notify_transition which might sleep.

That's not the last of my worries. The ondemand-workqueue interaction
in the cpu_hotplug callback path can cause a deadlock if we go for
per-subsystem hotcpu mutexes. Can you think of a way by which we can
avoid destroying the kondemand workqueue from the cpu-hotplug callback
path ? Please see http://lkml.org/lkml/2006/11/30/9 for the
culprit-callpath.

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2006-12-07 12:50:50

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: RE: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency



>-----Original Message-----
>From: [email protected]
>[mailto:[email protected]] On Behalf Of
>Gautham R Shenoy
>Sent: Wednesday, December 06, 2006 11:07 PM
>To: Pallipadi, Venkatesh
>Cc: [email protected]; Ingo Molnar; [email protected];
>[email protected]; [email protected];
>[email protected]; [email protected]; [email protected]
>Subject: Re: CPUFREQ-CPUHOTPLUG: Possible circular locking dependency
>
>Hi Venki,
>On Wed, Dec 06, 2006 at 10:27:01AM -0800, Pallipadi, Venkatesh wrote:
>
>> But, if we make cpufreq more affected_cpus aware and have a per_cpu
>> target()
>> call by moving set_cpus_allowed() from driver into cpufreq core and
>> define
>> the target function to be atomic/non-sleeping type, then we
>really don't
>> need a hotplug lock for the driver any more. Driver can have
>> get_cpu/put_cpu
>> pair to disable preemption and then change the frequency.
>
>Well, we would still need to keep the affected_cpus map in
>sync with the
>cpu_online map. That would still require hotplug protection, right?

Not really. Cpufreq can look at all the CPUs in affected_cpus and call
new_target() only for CPUs that are online. Or it can call for every CPU
and the actual implementation in driver can do something like

set_cpus_allowed(requested_processor_mask)
If (get_cpu() != requested_cpu) {
/* CPU offline and nothing can be done */
put_cpu();
return 0;
}

This is what I did for new cpufreq interface I added for getavg().
It was easy to ensure the atomic driver call as only one driver is
using it today :-)


>Besides, I would love to see a way of implementing target
>function to be
>atomic/non-sleeping type. But as of now, the target functions call
>cpufreq_notify_transition which might sleep.
>

That is the reason I don't have a patch for this now.. :-) There are
more
than 10 or so drivers that need to change for new interface. I haven't
checked
whether it is possible to do this without sleeping in all those drivers.


>That's not the last of my worries. The ondemand-workqueue interaction
>in the cpu_hotplug callback path can cause a deadlock if we go for
>per-subsystem hotcpu mutexes. Can you think of a way by which we can
>avoid destroying the kondemand workqueue from the cpu-hotplug callback
>path ? Please see http://lkml.org/lkml/2006/11/30/9 for the
>culprit-callpath.
>

Yes. I was thinking about it. One possible solution is to move
create_workqueue()/destroy_workqueue() as in attached patch.

Thanks,
Venki

Not fully tested at the moment.

Remove callbacks using workqueue callbacks in governor start and stop.
And move those call back to module init and exit time.

Signed-off-by: Venkatesh Pallipadi <[email protected]>

Index: linux-2.6.19-rc-mm/drivers/cpufreq/cpufreq_ondemand.c
===================================================================
--- linux-2.6.19-rc-mm.orig/drivers/cpufreq/cpufreq_ondemand.c
+++ linux-2.6.19-rc-mm/drivers/cpufreq/cpufreq_ondemand.c
@@ -514,7 +514,6 @@ static inline void dbs_timer_exit(struct
{
dbs_info->enable = 0;
cancel_delayed_work(&dbs_info->work);
- flush_workqueue(kondemand_wq);
}

static int cpufreq_governor_dbs(struct cpufreq_policy *policy,
@@ -543,16 +542,6 @@ static int cpufreq_governor_dbs(struct c

mutex_lock(&dbs_mutex);
dbs_enable++;
- if (dbs_enable == 1) {
- kondemand_wq = create_workqueue("kondemand");
- if (!kondemand_wq) {
- printk(KERN_ERR
- "Creation of kondemand
failed\n");
- dbs_enable--;
- mutex_unlock(&dbs_mutex);
- return -ENOSPC;
- }
- }

rc = sysfs_create_group(&policy->kobj, &dbs_attr_group);
if (rc) {
@@ -601,8 +590,6 @@ static int cpufreq_governor_dbs(struct c
dbs_timer_exit(this_dbs_info);
sysfs_remove_group(&policy->kobj, &dbs_attr_group);
dbs_enable--;
- if (dbs_enable == 0)
- destroy_workqueue(kondemand_wq);

mutex_unlock(&dbs_mutex);

@@ -632,12 +619,18 @@ static struct cpufreq_governor cpufreq_g

static int __init cpufreq_gov_dbs_init(void)
{
+ kondemand_wq = create_workqueue("kondemand");
+ if (!kondemand_wq) {
+ printk(KERN_ERR "Creation of kondemand failed\n");
+ return -EFAULT;
+ }
return cpufreq_register_governor(&cpufreq_gov_dbs);
}

static void __exit cpufreq_gov_dbs_exit(void)
{
cpufreq_unregister_governor(&cpufreq_gov_dbs);
+ destroy_workqueue(kondemand_wq);
}


Attachments:
ondemand_recursive_locking_fix.patch (1.79 kB)
ondemand_recursive_locking_fix.patch