2009-12-20 10:04:17

by Sachin Sant

[permalink] [raw]
Subject: CPU offline/online related hang with latest git

With 2.6.33-rc1-git1 (dd59f6c..) i am having trouble with
CPU hotplug on multiple architectures. Trying to offline
a CPU results in a machine hang.

Sample o/p from a 4 way x86_64 box :

#:/sys/devices/system/cpu/cpu3 # cat online
1
#:/sys/devices/system/cpu/cpu3 # echo 0 > online
Stack:
Call Trace:
<IRQ>
<EOI>
Code: 45 f0 48 89 45 b8 48 8d 45 d0 4c 89 4d f8 c7 45 b0 10 00 00 00 48 89 45 c0 e8 5a ff ff ff c9 c3 89 f0 b9 40 00 00 00 55 99 f7 f9 <48> 89 e5 48 89 f9 48 83 ec 08 31 d2 41 89 c0 eb 12 48 8b 01 48
Stack:
Call Trace:
Code: 20 00 00 00 00 48 89 3e 49 8d 74 24 28 e8 45 75 1d 00 49 c7 44 24 50 00 00 00 00 48 8b 9b 60 01 00 00 48 85 db 0f 85 25 ff ff ff <5b> 41 5c c9 c3 55 48 89 e5 41 57 41 56 41 55 41 54 53 48 81 ec
Stack:
Call Trace:
Code: 48 0f 45 d8 eb 1a 44 0f a3 20 19 c0 48 c7 c3 10 b5 a0 81 85 c0 48 c7 c0 30 b5 a0 81 48 0f 44 d8 31 c0 f3 90 44 8b 25 82 35 96 00 <41> 39 c4 74 46 41 83 fc 02 74 08 41 83 fc 03 75 12 eb 03 fa eb
Stack:
Call Trace:
Code: 48 c7 c0 30 b5 a0 81 48 0f 45 d8 eb 1a 44 0f a3 20 19 c0 48 c7 c3 10 b5 a0 81 85 c0 48 c7 c0 30 b5 a0 81 48 0f 44 d8 31 c0 f3 90 <44> 8b 25 82 35 96 00 41 39 c4 74 46 41 83 fc 02 74 08 41 83 fc
Uhhuh. NMI received for unknown reason 25 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue

The above messages are repeated after this. I observed similar hangs
on other architectures as well (s390x, PowerPC, x86_32).

2.6.33-rc1 worked fine. I haven't tried the bisect. Will do that
first thing tomorrow morning.

Thanks
-Sachin

--

---------------------------------
Sachin Sant
IBM Linux Technology Center
India Systems and Technology Labs
Bangalore, India
---------------------------------


2009-12-20 13:32:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: CPU offline/online related hang with latest git

On Sun, 2009-12-20 at 15:34 +0530, Sachin Sant wrote:
> With 2.6.33-rc1-git1 (dd59f6c..) i am having trouble with
> CPU hotplug on multiple architectures. Trying to offline
> a CPU results in a machine hang.

damnit, you're right.. it's getting stuck during hotplug on stop machine
some place.

didn't notice it because it didn't crash..

sorry for that :-(

2009-12-20 14:51:54

by Jens Axboe

[permalink] [raw]
Subject: Re: CPU offline/online related hang with latest git

On Sun, Dec 20 2009, Peter Zijlstra wrote:
> On Sun, 2009-12-20 at 15:34 +0530, Sachin Sant wrote:
> > With 2.6.33-rc1-git1 (dd59f6c..) i am having trouble with
> > CPU hotplug on multiple architectures. Trying to offline
> > a CPU results in a machine hang.
>
> damnit, you're right.. it's getting stuck during hotplug on stop machine
> some place.
>
> didn't notice it because it didn't crash..
>
> sorry for that :-(

Perhaps this is also why suspend doesn't work on current -git... -rc1 is
fine.

--
Jens Axboe

2009-12-20 14:55:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: CPU offline/online related hang with latest git

On Sun, 2009-12-20 at 15:51 +0100, Jens Axboe wrote:
> On Sun, Dec 20 2009, Peter Zijlstra wrote:
> > On Sun, 2009-12-20 at 15:34 +0530, Sachin Sant wrote:
> > > With 2.6.33-rc1-git1 (dd59f6c..) i am having trouble with
> > > CPU hotplug on multiple architectures. Trying to offline
> > > a CPU results in a machine hang.
> >
> > damnit, you're right.. it's getting stuck during hotplug on stop machine
> > some place.
> >
> > didn't notice it because it didn't crash..
> >
> > sorry for that :-(
>
> Perhaps this is also why suspend doesn't work on current -git... -rc1 is
> fine.

Yep, suspend relies on hotplug.

2009-12-20 14:57:18

by Jens Axboe

[permalink] [raw]
Subject: Re: CPU offline/online related hang with latest git

On Sun, Dec 20 2009, Peter Zijlstra wrote:
> On Sun, 2009-12-20 at 15:51 +0100, Jens Axboe wrote:
> > On Sun, Dec 20 2009, Peter Zijlstra wrote:
> > > On Sun, 2009-12-20 at 15:34 +0530, Sachin Sant wrote:
> > > > With 2.6.33-rc1-git1 (dd59f6c..) i am having trouble with
> > > > CPU hotplug on multiple architectures. Trying to offline
> > > > a CPU results in a machine hang.
> > >
> > > damnit, you're right.. it's getting stuck during hotplug on stop machine
> > > some place.
> > >
> > > didn't notice it because it didn't crash..
> > >
> > > sorry for that :-(
> >
> > Perhaps this is also why suspend doesn't work on current -git... -rc1 is
> > fine.
>
> Yep, suspend relies on hotplug.

Exactly, hence the connection. Shall I bisect, or do you already know
what the problem is?

--
Jens Axboe

2009-12-20 15:01:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: CPU offline/online related hang with latest git

On Sun, 2009-12-20 at 15:57 +0100, Jens Axboe wrote:

> Exactly, hence the connection. Shall I bisect, or do you already know
> what the problem is?

I have a definite suspect alright, let me prod at this a bit more.

2009-12-20 16:37:11

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] sched: Fix hotplug

The hot-unplug kstopmachine usage does a wakeup after deactivating the
cpu, hence we cannot use cpu_active() here but must rely on the good
olde online.

Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 720df10..0ac4fa5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2348,7 +2348,7 @@ int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
* not worry about this generic constraint ]
*/
if (unlikely(!cpumask_test_cpu(cpu, &p->cpus_allowed) ||
- !cpu_active(cpu)))
+ !cpu_online(cpu)))
cpu = select_fallback_rq(task_cpu(p), p);

return cpu;

2009-12-20 22:14:12

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] sched: Fix hotplug

On Sun, Dec 20 2009, Peter Zijlstra wrote:
> The hot-unplug kstopmachine usage does a wakeup after deactivating the
> cpu, hence we cannot use cpu_active() here but must rely on the good
> olde online.

Yep, this works for me!

Tested-by: Jens Axboe <[email protected]>

>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> kernel/sched.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 720df10..0ac4fa5 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2348,7 +2348,7 @@ int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
> * not worry about this generic constraint ]
> */
> if (unlikely(!cpumask_test_cpu(cpu, &p->cpus_allowed) ||
> - !cpu_active(cpu)))
> + !cpu_online(cpu)))
> cpu = select_fallback_rq(task_cpu(p), p);
>
> return cpu;
>
>

--
Jens Axboe

2009-12-20 22:34:47

by Peter Zijlstra

[permalink] [raw]
Subject: [tip:sched/urgent] sched: Fix hotplug hang

Commit-ID: 70f1120527797adb31c68bdc6f1b45e182c342c7
Gitweb: http://git.kernel.org/tip/70f1120527797adb31c68bdc6f1b45e182c342c7
Author: Peter Zijlstra <[email protected]>
AuthorDate: Sun, 20 Dec 2009 17:36:27 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 20 Dec 2009 23:31:23 +0100

sched: Fix hotplug hang

The hot-unplug kstopmachine usage does a wakeup after
deactivating the cpu, hence we cannot use cpu_active()
here but must rely on the good olde online.

Reported-by: Sachin Sant <[email protected]>
Reported-by: Jens Axboe <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Tested-by: Jens Axboe <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Benjamin Herrenschmidt <[email protected]>
LKML-Reference: <1261326987.4314.24.camel@laptop>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7ffde2a..87f1f47 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2346,7 +2346,7 @@ int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
* not worry about this generic constraint ]
*/
if (unlikely(!cpumask_test_cpu(cpu, &p->cpus_allowed) ||
- !cpu_active(cpu)))
+ !cpu_online(cpu)))
cpu = select_fallback_rq(task_cpu(p), p);

return cpu;