2010-01-29 04:43:09

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: [2.6.33-rc5] Weird deadlock when shutting down

Hi Ingo !

Johannes and I see this on our quad G5s... it -could- be similar to
one reported a short while ago by Xiaotian Feng <[email protected]>
under the subject [2.6.33-rc4] sysfs lockdep warnings on cpu hotplug.

Basically, the machine deadlocks right after printing the following
when doing a shutdown:

halt/4071 is trying to acquire lock:
(s_active){++++.+}, at: [<c0000000001ef868>] .sysfs_addrm_finish+0x58/0xc0

but task is already holding lock:
(&per_cpu(cpu_policy_rwsem, cpu)){+.+.+.}, at: [<c0000000004cd6ac>] .lock_policy_rwsem_write+0x84/0xf4

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

<nothing else ... machine deadlocked here>

Any idea ?

Cheers,
Ben.


2010-02-18 09:36:30

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Fri, 2010-01-29 at 15:41 +1100, Benjamin Herrenschmidt wrote:
> Hi Ingo !
>
> Johannes and I see this on our quad G5s... it -could- be similar to
> one reported a short while ago by Xiaotian Feng <[email protected]>
> under the subject [2.6.33-rc4] sysfs lockdep warnings on cpu hotplug.
>
> Basically, the machine deadlocks right after printing the following
> when doing a shutdown:
>
> halt/4071 is trying to acquire lock:
> (s_active){++++.+}, at: [<c0000000001ef868>]
> .sysfs_addrm_finish+0x58/0xc0
>
> but task is already holding lock:
> (&per_cpu(cpu_policy_rwsem, cpu)){+.+.+.}, at: [<c0000000004cd6ac>]
> .lock_policy_rwsem_write+0x84/0xf4
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:

This is still happening with -rc8. Any news?

johannes

2010-02-18 16:32:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down



On Thu, 18 Feb 2010, Johannes Berg wrote:
> >
> > Basically, the machine deadlocks right after printing the following
> > when doing a shutdown:
> >
> > halt/4071 is trying to acquire lock:
> > (s_active){++++.+}, at: [<c0000000001ef868>]
> > .sysfs_addrm_finish+0x58/0xc0
> >
> > but task is already holding lock:
> > (&per_cpu(cpu_policy_rwsem, cpu)){+.+.+.}, at: [<c0000000004cd6ac>]
> > .lock_policy_rwsem_write+0x84/0xf4
> >
> > which lock already depends on the new lock.

You don't have a full backtrace for these things?

We've had lots of trouble with the cpu governors, and I suspect the
problem isn't new, but the lockdep warning is likely new (see commit
846f99749ab68bbc7f75c74fec305de675b1a1bf: "sysfs: Add lockdep annotations
for the sysfs active reference").

So it is likely to be an old issue that (a) now gets warned about and (b)
might have had timing changes enough to trigger it.

I suspect it is G5-specific (or specific to whatever CPU frequency code
that gets used there), since I think we'd have had lots of reports if this
happened on x86.

Linus

2010-02-18 18:45:52

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Thu, 2010-02-18 at 08:31 -0800, Linus Torvalds wrote:

> > > halt/4071 is trying to acquire lock:
> > > (s_active){++++.+}, at: [<c0000000001ef868>]
> > > .sysfs_addrm_finish+0x58/0xc0
> > >
> > > but task is already holding lock:
> > > (&per_cpu(cpu_policy_rwsem, cpu)){+.+.+.}, at:
> [<c0000000004cd6ac>]
> > > .lock_policy_rwsem_write+0x84/0xf4
> > >
> > > which lock already depends on the new lock.
>
> You don't have a full backtrace for these things?

No, it deadlocks right there, unfortunately.

> We've had lots of trouble with the cpu governors, and I suspect the
> problem isn't new, but the lockdep warning is likely new (see commit
> 846f99749ab68bbc7f75c74fec305de675b1a1bf: "sysfs: Add lockdep
> annotations
> for the sysfs active reference").
>
> So it is likely to be an old issue that (a) now gets warned about and
> (b) might have had timing changes enough to trigger it.

Well, it used to not deadlock and actually shut down the machine :) So
in that sense it's definitely new. It might have printed a lockdep
warning before, which you wouldn't normally see since the machine turns
off right after this.

> I suspect it is G5-specific (or specific to whatever CPU frequency
> code
> that gets used there), since I think we'd have had lots of reports if
> this
> happened on x86.

Yeah, that's puzzling me as well.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-20 04:46:51

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Fri, Feb 19, 2010 at 12:31 AM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Thu, 18 Feb 2010, Johannes Berg wrote:
>> >
>> > Basically, the machine deadlocks right after printing the following
>> > when doing a shutdown:
>> >
>> > halt/4071 is trying to acquire lock:
>> >  (s_active){++++.+}, at: [<c0000000001ef868>]
>> > .sysfs_addrm_finish+0x58/0xc0
>> >
>> > but task is already holding lock:
>> >  (&per_cpu(cpu_policy_rwsem, cpu)){+.+.+.}, at: [<c0000000004cd6ac>]
>> > .lock_policy_rwsem_write+0x84/0xf4
>> >
>> > which lock already depends on the new lock.
>
> You don't have a full backtrace for these things?
>
> We've had lots of trouble with the cpu governors, and I suspect the
> problem isn't new, but the lockdep warning is likely new (see commit
> 846f99749ab68bbc7f75c74fec305de675b1a1bf: "sysfs: Add lockdep annotations
> for the sysfs active reference").
>
> So it is likely to be an old issue that (a) now gets warned about and (b)
> might have had timing changes enough to trigger it.
>

Right.

This is a real deadlock case found by lockdep added to s_active.

The problem is that we did kobject_put(&data->kobj) while holding policy_rwsem
which is used to protect 'data'. It is not so easy to fix this,
probably we need to
do more work on cpufreq code.

Thanks.

2010-02-20 07:13:53

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Fri, Feb 19, 2010 at 2:45 AM, Johannes Berg
<[email protected]> wrote:
> On Thu, 2010-02-18 at 08:31 -0800, Linus Torvalds wrote:
>
>> > > halt/4071 is trying to acquire lock:
>> > >  (s_active){++++.+}, at: [<c0000000001ef868>]
>> > > .sysfs_addrm_finish+0x58/0xc0
>> > >
>> > > but task is already holding lock:
>> > >  (&per_cpu(cpu_policy_rwsem, cpu)){+.+.+.}, at:
>> [<c0000000004cd6ac>]
>> > > .lock_policy_rwsem_write+0x84/0xf4
>> > >
>> > > which lock already depends on the new lock.
>>
>> You don't have a full backtrace for these things?
>
> No, it deadlocks right there, unfortunately.
>
>> We've had lots of trouble with the cpu governors, and I suspect the
>> problem isn't new, but the lockdep warning is likely new (see commit
>> 846f99749ab68bbc7f75c74fec305de675b1a1bf: "sysfs: Add lockdep
>> annotations
>> for the sysfs active reference").
>>
>> So it is likely to be an old issue that (a) now gets warned about and
>> (b) might have had timing changes enough to trigger it.
>
> Well, it used to not deadlock and actually shut down the machine :) So
> in that sense it's definitely new. It might have printed a lockdep
> warning before, which you wouldn't normally see since the machine turns
> off right after this.
>
>> I suspect it is G5-specific (or specific to whatever CPU frequency
>> code
>> that gets used there), since I think we'd have had lots of reports if
>> this
>> happened on x86.
>
> Yeah, that's puzzling me as well.
>

Does my following untested patch help?

Signed-off-by: WANG Cong <[email protected]>

---------


Attachments:
drivers-cpufreq-fix-deadlock.diff (1.20 kB)

2010-02-20 07:42:58

by Dave Young

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Fri, Feb 19, 2010 at 2:45 AM, Johannes Berg
<[email protected]> wrote:
> On Thu, 2010-02-18 at 08:31 -0800, Linus Torvalds wrote:
>
>> > > halt/4071 is trying to acquire lock:
>> > >  (s_active){++++.+}, at: [<c0000000001ef868>]
>> > > .sysfs_addrm_finish+0x58/0xc0
>> > >
>> > > but task is already holding lock:
>> > >  (&per_cpu(cpu_policy_rwsem, cpu)){+.+.+.}, at:
>> [<c0000000004cd6ac>]
>> > > .lock_policy_rwsem_write+0x84/0xf4
>> > >
>> > > which lock already depends on the new lock.
>>
>> You don't have a full backtrace for these things?
>
> No, it deadlocks right there, unfortunately.
>
>> We've had lots of trouble with the cpu governors, and I suspect the
>> problem isn't new, but the lockdep warning is likely new (see commit
>> 846f99749ab68bbc7f75c74fec305de675b1a1bf: "sysfs: Add lockdep
>> annotations
>> for the sysfs active reference").
>>
>> So it is likely to be an old issue that (a) now gets warned about and
>> (b) might have had timing changes enough to trigger it.
>
> Well, it used to not deadlock and actually shut down the machine :) So
> in that sense it's definitely new. It might have printed a lockdep
> warning before, which you wouldn't normally see since the machine turns
> off right after this.

before shutdown, you can:
echo N > /proc/sys/kernel/printk_delay
to see the printk messages, N is 0-10000 in milliseconds

>
>> I suspect it is G5-specific (or specific to whatever CPU frequency
>> code
>> that gets used there), since I think we'd have had lots of reports if
>> this
>> happened on x86.
>
> Yeah, that's puzzling me as well.
>
> johannes
>



--
Regards
dave

2010-02-20 08:45:46

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sat, 2010-02-20 at 12:46 +0800, Américo Wang wrote:

> This is a real deadlock case found by lockdep added to s_active.
>
> The problem is that we did kobject_put(&data->kobj) while holding
> policy_rwsem
> which is used to protect 'data'. It is not so easy to fix this,
> probably we need to do more work on cpufreq code.

But it doesn't make sense that it's just an existing real deadlock that
is now found -- it never occurred previously!

Anyway, I'll try your patch, thanks for that.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-20 08:46:59

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sat, 2010-02-20 at 15:42 +0800, Dave Young wrote:

> > Well, it used to not deadlock and actually shut down the machine :)
> So
> > in that sense it's definitely new. It might have printed a lockdep
> > warning before, which you wouldn't normally see since the machine
> turns
> > off right after this.
>
> before shutdown, you can:
> echo N > /proc/sys/kernel/printk_delay
> to see the printk messages, N is 0-10000 in milliseconds

Well if I understand Américo correctly then it won't have printed
anything (even if it were to deadlock) before adding the lockdep
annotations to s_active, so I guess that theory is out.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-20 08:57:10

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sat, 2010-02-20 at 15:13 +0800, Américo Wang wrote:

> Does my following untested patch help?

Sorry, no. I'll hook up a screen to the box after I return from the
fresh market.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-20 09:06:44

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sat, Feb 20, 2010 at 4:56 PM, Johannes Berg
<[email protected]> wrote:
> On Sat, 2010-02-20 at 15:13 +0800, Américo Wang wrote:
>
>> Does my following untested patch help?
>
> Sorry, no. I'll hook up a screen to the box after I return from the
> fresh market.
>

Are you sure there is no difference? :-/

Also, could you please also apply the 4 patches from Eric?

You can get them here:
http://lkml.org/lkml/2010/2/11/334

Thanks much for your testing!

2010-02-20 09:31:11

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Thu, Feb 18, 2010 at 5:36 PM, Johannes Berg
<[email protected]> wrote:
> On Fri, 2010-01-29 at 15:41 +1100, Benjamin Herrenschmidt wrote:
>> Hi Ingo !
>>
>> Johannes and I see this on our quad G5s... it -could- be similar to
>> one reported a short while ago by Xiaotian Feng <[email protected]>
>> under the subject [2.6.33-rc4] sysfs lockdep warnings on cpu hotplug.
>>
>> Basically, the machine deadlocks right after printing the following
>> when doing a shutdown:
>>
>> halt/4071 is trying to acquire lock:
>>  (s_active){++++.+}, at: [<c0000000001ef868>]
>> .sysfs_addrm_finish+0x58/0xc0
>>
>> but task is already holding lock:
>>  (&per_cpu(cpu_policy_rwsem, cpu)){+.+.+.}, at: [<c0000000004cd6ac>]
>> .lock_policy_rwsem_write+0x84/0xf4
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>
> This is still happening with -rc8. Any news?
>

Hey, johannes

Not sure if you made some mistake here, the one you report here [1]
is _not_ the same with this one reported by Benjamin.

Please make sure what you are talking about here is the same one.

Thanks.

1. http://lkml.org/lkml/2010/2/18/33

2010-02-20 10:52:47

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sat, 2010-02-20 at 17:30 +0800, Américo Wang wrote:

> Hey, johannes
>
> Not sure if you made some mistake here, the one you report here [1]
> is _not_ the same with this one reported by Benjamin.
>
> Please make sure what you are talking about here is the same one.
>
> Thanks.
>
> 1. http://lkml.org/lkml/2010/2/18/33

I'm talking about the problem Ben reported -- that one is completely
different. Was your patch supposed to address _that_ one?

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-20 10:53:38

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sat, 2010-02-20 at 17:06 +0800, Américo Wang wrote:

> >> Does my following untested patch help?
> >
> > Sorry, no. I'll hook up a screen to the box after I return from the
> > fresh market.

> Are you sure there is no difference? :-/

No ... could be a different deadlock now :) Not sure how likely that is
though.

> Also, could you please also apply the 4 patches from Eric?
>
> You can get them here:
> http://lkml.org/lkml/2010/2/11/334

Will do.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-20 11:29:00

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sat, 2010-02-20 at 17:06 +0800, Américo Wang wrote:
> On Sat, Feb 20, 2010 at 4:56 PM, Johannes Berg
> <[email protected]> wrote:
> > On Sat, 2010-02-20 at 15:13 +0800, Américo Wang wrote:
> >
> >> Does my following untested patch help?
> >
> > Sorry, no. I'll hook up a screen to the box after I return from the
> > fresh market.
> >
>
> Are you sure there is no difference? :-/

It deadlocks after

Disabling non-boot CPUs ...

Not sure if that counts as a difference...

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-20 12:07:42

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sat, 2010-02-20 at 12:28 +0100, Johannes Berg wrote:

> It deadlocks after
>
> Disabling non-boot CPUs ...

I suspect the BUG: key not in data! thing I get now disables lockdep (it
seems to be mostly due to module loading btw) and then I don't get any
output here.

Seems it's all busted.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-20 13:42:11

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sat, Feb 20, 2010 at 12:28:51PM +0100, Johannes Berg wrote:
>On Sat, 2010-02-20 at 17:06 +0800, Américo Wang wrote:
>> On Sat, Feb 20, 2010 at 4:56 PM, Johannes Berg
>> <[email protected]> wrote:
>> > On Sat, 2010-02-20 at 15:13 +0800, Américo Wang wrote:
>> >
>> >> Does my following untested patch help?
>> >
>> > Sorry, no. I'll hook up a screen to the box after I return from the
>> > fresh market.
>> >
>>
>> Are you sure there is no difference? :-/
>
>It deadlocks after
>
>Disabling non-boot CPUs ...
>
>Not sure if that counts as a difference...
>

I am not sure neither...

That message is displayed before shutting down the devices.

To verify, you can add some printk() in the end of
__cpufreq_remove_dev(), or enable CONFIG_CPU_FREQ_DEBUG.

Thanks!

2010-02-20 13:58:00

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sat, 2010-02-20 at 21:44 +0800, Américo Wang wrote:

> That message is displayed before shutting down the devices.
>
> To verify, you can add some printk() in the end of
> __cpufreq_remove_dev(), or enable CONFIG_CPU_FREQ_DEBUG.

That is already enabled.

johannes

2010-02-21 09:51:17

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sat, Feb 20, 2010 at 9:57 PM, Johannes Berg
<[email protected]> wrote:
> On Sat, 2010-02-20 at 21:44 +0800, Américo Wang wrote:
>
>> That message is displayed before shutting down the devices.
>>
>> To verify, you can add some printk() in the end of
>> __cpufreq_remove_dev(), or enable CONFIG_CPU_FREQ_DEBUG.
>
> That is already enabled.
>

Ok, I got it.

Could you test the patch below? Thanks!

----------------->

Signed-off-by: WANG Cong <[email protected]>


Attachments:
drivers-cpufreq-fix-deadlock.diff (2.03 kB)

2010-02-21 10:17:56

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sun, 2010-02-21 at 17:51 +0800, Américo Wang wrote:
> On Sat, Feb 20, 2010 at 9:57 PM, Johannes Berg
> <[email protected]> wrote:
> > On Sat, 2010-02-20 at 21:44 +0800, Américo Wang wrote:
> >
> >> That message is displayed before shutting down the devices.
> >>
> >> To verify, you can add some printk() in the end of
> >> __cpufreq_remove_dev(), or enable CONFIG_CPU_FREQ_DEBUG.
> >
> > That is already enabled.
> >
>
> Ok, I got it.
>
> Could you test the patch below? Thanks!

No change, sorry, still hangs right after "Disabling non-boot CPUs ..."
before the machine turns off.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-21 10:43:57

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

Incidentally, the machine also freezes hard without any output at all if
I "echo 0 > /sys/.../cpu1/online".

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-21 10:55:54

by Xiaotian Feng

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sun, Feb 21, 2010 at 6:43 PM, Johannes Berg
<[email protected]> wrote:
> Incidentally, the machine also freezes hard without any output at all if
> I "echo 0 > /sys/.../cpu1/online".

It might be nothing related with cpufreq. I think there's something
wrong during the _cpu_down path.
put more debug printks into _cpu_down(), if we can find kernel is
stuck in which place in _cpu_down, it would be helpful.

>
> johannes
>

2010-02-21 11:12:49

by Xiaotian Feng

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sun, Feb 21, 2010 at 6:55 PM, Xiaotian Feng <[email protected]> wrote:
> On Sun, Feb 21, 2010 at 6:43 PM, Johannes Berg
> <[email protected]> wrote:
>> Incidentally, the machine also freezes hard without any output at all if
>> I "echo 0 > /sys/.../cpu1/online".
>
> It might be nothing related with cpufreq. I think there's something
> wrong during the _cpu_down path.
> put more debug printks into _cpu_down(), if we can find kernel is
> stuck in which place in _cpu_down, it would be helpful.

and it looks like this breakage is only seen on powerMac G5, so it
might be arch specific. Maybe some commit in powermac breaks G5's
cpu_down, just a guess ;-)

>
>>
>> johannes
>>
>

2010-02-21 11:14:51

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sun, 2010-02-21 at 19:12 +0800, Xiaotian Feng wrote:
> On Sun, Feb 21, 2010 at 6:55 PM, Xiaotian Feng <[email protected]>
> wrote:
> > On Sun, Feb 21, 2010 at 6:43 PM, Johannes Berg
> > <[email protected]> wrote:
> >> Incidentally, the machine also freezes hard without any output at
> all if
> >> I "echo 0 > /sys/.../cpu1/online".
> >
> > It might be nothing related with cpufreq. I think there's something
> > wrong during the _cpu_down path.
> > put more debug printks into _cpu_down(), if we can find kernel is
> > stuck in which place in _cpu_down, it would be helpful.
>
> and it looks like this breakage is only seen on powerMac G5, so it
> might be arch specific. Maybe some commit in powermac breaks G5's
> cpu_down, just a guess ;-)

Hmm, not sure ... it seems to be in __stop_machine(), in this code:

printk("got cpu\n");
for_each_online_cpu(i) {
sm_work = per_cpu_ptr(stop_machine_work, i);
INIT_WORK(sm_work, stop_cpu);
queue_work_on(i, stop_machine_wq, sm_work);
}
/* This will release the thread on our CPU. */
put_cpu();
printk("put cpu\n");


which is weird... the "got cpu" printk is the last thing I see.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-21 11:22:47

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sun, 2010-02-21 at 12:14 +0100, Johannes Berg wrote:

> printk("got cpu\n");
> for_each_online_cpu(i) {
> sm_work = per_cpu_ptr(stop_machine_work, i);
> INIT_WORK(sm_work, stop_cpu);
> queue_work_on(i, stop_machine_wq, sm_work);
> }
> /* This will release the thread on our CPU. */
> put_cpu();
> printk("put cpu\n");

As odd as that may be, it hangs in put_cpu() here.

johannes

2010-02-22 08:19:37

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sun, Feb 21, 2010 at 6:17 PM, Johannes Berg
<[email protected]> wrote:
> On Sun, 2010-02-21 at 17:51 +0800, Américo Wang wrote:
>> On Sat, Feb 20, 2010 at 9:57 PM, Johannes Berg
>> <[email protected]> wrote:
>> > On Sat, 2010-02-20 at 21:44 +0800, Américo Wang wrote:
>> >
>> >> That message is displayed before shutting down the devices.
>> >>
>> >> To verify, you can add some printk() in the end of
>> >> __cpufreq_remove_dev(), or enable CONFIG_CPU_FREQ_DEBUG.
>> >
>> > That is already enabled.
>> >
>>
>> Ok, I got it.
>>
>> Could you test the patch below? Thanks!
>
> No change, sorry, still hangs right after "Disabling non-boot CPUs ..."
> before the machine turns off.
>

Oh, I see, then this will be another problem.

My previous patch is to fix the cpufreq lockdep warning mentioned
in Benjamin's report, so this hang should be caused by other problem,
not the cpufreq lockdep problem.

Thanks.

2010-02-22 08:23:01

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sun, Feb 21, 2010 at 7:12 PM, Xiaotian Feng <[email protected]> wrote:
> On Sun, Feb 21, 2010 at 6:55 PM, Xiaotian Feng <[email protected]> wrote:
>> On Sun, Feb 21, 2010 at 6:43 PM, Johannes Berg
>> <[email protected]> wrote:
>>> Incidentally, the machine also freezes hard without any output at all if
>>> I "echo 0 > /sys/.../cpu1/online".
>>
>> It might be nothing related with cpufreq. I think there's something
>> wrong during the _cpu_down path.
>> put more debug printks into _cpu_down(), if we can find kernel is
>> stuck in which place in _cpu_down, it would be helpful.
>
> and it looks like this breakage is only seen on powerMac G5, so it
> might be arch specific. Maybe some commit in powermac breaks G5's
> cpu_down, just a guess ;-)
>

Maybe not, there could be some generic code bug that will only be
exposed on a specific arch.

Thanks.

2010-02-22 08:34:42

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Sun, Feb 21, 2010 at 7:22 PM, Johannes Berg
<[email protected]> wrote:
> On Sun, 2010-02-21 at 12:14 +0100, Johannes Berg wrote:
>
>>         printk("got cpu\n");
>>         for_each_online_cpu(i) {
>>                 sm_work = per_cpu_ptr(stop_machine_work, i);
>>                 INIT_WORK(sm_work, stop_cpu);
>>                 queue_work_on(i, stop_machine_wq, sm_work);
>>         }
>>         /* This will release the thread on our CPU. */
>>         put_cpu();
>>         printk("put cpu\n");
>
> As odd as that may be, it hangs in put_cpu() here.
>

Hmm, does adding synchronize_sched() in _cpu_down() help?

Something like this:

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 677f253..681f5c5 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -228,6 +228,7 @@ static int __ref _cpu_down(unsigned int cpu, int
tasks_frozen)
cpumask_copy(old_allowed, &current->cpus_allowed);
set_cpus_allowed_ptr(current, cpu_active_mask);

+ synchronize_sched();
err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
if (err) {
set_cpu_active(cpu, true);

2010-02-22 08:40:15

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Mon, 2010-02-22 at 16:19 +0800, Américo Wang wrote:

> >> Could you test the patch below? Thanks!
> >
> > No change, sorry, still hangs right after "Disabling non-boot CPUs
> ..."
> > before the machine turns off.
> >
>
> Oh, I see, then this will be another problem.
>
> My previous patch is to fix the cpufreq lockdep warning mentioned
> in Benjamin's report, so this hang should be caused by other problem,
> not the cpufreq lockdep problem.

Right, sounds like -- and I haven't seen that lockdep report during
shutdown any more.

johannes


Attachments:
signature.asc (801.00 B)
This is a digitally signed message part

2010-02-22 09:05:01

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Mon, 2010-02-22 at 16:34 +0800, Américo Wang wrote:
> On Sun, Feb 21, 2010 at 7:22 PM, Johannes Berg
> <[email protected]> wrote:
> > On Sun, 2010-02-21 at 12:14 +0100, Johannes Berg wrote:
> >
> >> printk("got cpu\n");
> >> for_each_online_cpu(i) {
> >> sm_work = per_cpu_ptr(stop_machine_work,
> i);
> >> INIT_WORK(sm_work, stop_cpu);
> >> queue_work_on(i, stop_machine_wq, sm_work);
> >> }
> >> /* This will release the thread on our CPU. */
> >> put_cpu();
> >> printk("put cpu\n");
> >
> > As odd as that may be, it hangs in put_cpu() here.
> >
>
> Hmm, does adding synchronize_sched() in _cpu_down() help?

No luck.

johannes

2010-02-22 09:12:15

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Mon, Feb 22, 2010 at 5:04 PM, Johannes Berg
<[email protected]> wrote:
> On Mon, 2010-02-22 at 16:34 +0800, Américo Wang wrote:
>> On Sun, Feb 21, 2010 at 7:22 PM, Johannes Berg
>> <[email protected]> wrote:
>> > On Sun, 2010-02-21 at 12:14 +0100, Johannes Berg wrote:
>> >
>> >>         printk("got cpu\n");
>> >>         for_each_online_cpu(i) {
>> >>                 sm_work = per_cpu_ptr(stop_machine_work,
>> i);
>> >>                 INIT_WORK(sm_work, stop_cpu);
>> >>                 queue_work_on(i, stop_machine_wq, sm_work);
>> >>         }
>> >>         /* This will release the thread on our CPU. */
>> >>         put_cpu();
>> >>         printk("put cpu\n");
>> >
>> > As odd as that may be, it hangs in put_cpu() here.
>> >
>>
>> Hmm, does adding synchronize_sched() in _cpu_down() help?
>
> No luck.
>

Ok, thanks.

Since it hangs in put_cpu() which is just preempt_enable(), so I began
to suspect if we need a synchronize_sched(), or some barrier perhaps.
I am not sure at all.

Before other experts look at this, I think doing a bisect would be very
useful.

Again, thanks for your testing!

2010-02-22 09:14:27

by Johannes Berg

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Mon, 2010-02-22 at 17:12 +0800, Américo Wang wrote:

> Since it hangs in put_cpu() which is just preempt_enable(), so I began
> to suspect if we need a synchronize_sched(), or some barrier perhaps.
> I am not sure at all.

Right.

> Before other experts look at this, I think doing a bisect would be
> very useful.

I was afraid you'd say that, it'll take forever though since I need to
walk over to it after every shutdown, see if it turned off and then turn
it on again (and possibly off).... I guess I'll get started on that.

johannes

2010-02-22 09:21:13

by Cong Wang

[permalink] [raw]
Subject: Re: [2.6.33-rc5] Weird deadlock when shutting down

On Mon, Feb 22, 2010 at 5:14 PM, Johannes Berg
<[email protected]> wrote:
> On Mon, 2010-02-22 at 17:12 +0800, Américo Wang wrote:
>> Before other experts look at this, I think doing a bisect would be
>> very useful.
>
> I was afraid you'd say that, it'll take forever though since I need to
> walk over to it after every shutdown, see if it turned off and then turn
> it on again (and possibly off).... I guess I'll get started on that.
>

No problem.

Feel free to wait for the experts like Linus taking this. Leave as it is. ;)

Thanks.