2004-04-18 17:05:31

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: CPU Hotplug broken -mm5 onwards

Hi,
I found that I can't boot with CONFIG_HOTPLUG_CPU defined in both
mm5 and mm6. Debugging this revealed it to be because exec path can now require
cpu hotplug sem (sched_migrate_task) and this has lead to a deadlock between
flush_workqueue and __call_usermodehelper.

flush_workqueue takes cpu hotplug sem and blocks until workqueue is flushed.
__call_usermodehelper, one of the queued work function, blocks because it
also needs cpu hotplug sem during exec. As of result of this, exec does not
progress and system does not boot.

I feel we can fix this by converting cpucontrol to a reader-writer semaphore or
big-reader-lock(?). One problem with reader-writer semaphore is there does not
seem to be any down_write_interruptible, which is needed by cpu_down/up.

Comments?

BTW, I think a cpu_is_offline check is needed in sched_migrate_task, since
dest_cpu could have been downed by the time it has acquired the semaphore.
In which case, we could end up adding the task to dead cpu's runqueue?
An alternate solution would be to put the same check in __migrate_task.



--


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017


2004-04-19 03:34:23

by Nick Piggin

[permalink] [raw]
Subject: Re: CPU Hotplug broken -mm5 onwards

Srivatsa Vaddagiri wrote:
> Hi,
> I found that I can't boot with CONFIG_HOTPLUG_CPU defined in both
> mm5 and mm6. Debugging this revealed it to be because exec path can now require
> cpu hotplug sem (sched_migrate_task) and this has lead to a deadlock between
> flush_workqueue and __call_usermodehelper.
>
> flush_workqueue takes cpu hotplug sem and blocks until workqueue is flushed.
> __call_usermodehelper, one of the queued work function, blocks because it
> also needs cpu hotplug sem during exec. As of result of this, exec does not
> progress and system does not boot.
>
> I feel we can fix this by converting cpucontrol to a reader-writer semaphore or
> big-reader-lock(?). One problem with reader-writer semaphore is there does not
> seem to be any down_write_interruptible, which is needed by cpu_down/up.
>
> Comments?
>

You are right, but it wasn't introduced in -mm or sched-domains
patches. However, one of Ingo's recent patches does balance on
exec for SMP, not just NUMA so it will make this more common.

So, Rusty has to fix it ;)

I think a rwsem might be a good idea anyway, because
sched_migrate_task can end up being called pretty often with
balance on exec and balance on clone. The semaphore could easily
place undue serialisation on that path.

> BTW, I think a cpu_is_offline check is needed in sched_migrate_task, since
> dest_cpu could have been downed by the time it has acquired the semaphore.
> In which case, we could end up adding the task to dead cpu's runqueue?
> An alternate solution would be to put the same check in __migrate_task.
>

Yes you are correct.

Can we arrange some of these checks to disappear when HOTPLUG_CPU
is not set? For example, make cpu_is_offline only valid to call for
CPUs that have been online sometime, and can evaluate to 0 if
HOTPLUG_CPU is not set?

2004-04-19 12:57:59

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [lhcs-devel] Re: CPU Hotplug broken -mm5 onwards

On Mon, Apr 19, 2004 at 01:34:14PM +1000, Nick Piggin wrote:
> I think a rwsem might be a good idea anyway, because
> sched_migrate_task can end up being called pretty often with
> balance on exec and balance on clone. The semaphore could easily
> place undue serialisation on that path.

I found that r/w sem does not help here ..It can still lead to deadlocks.
One example I hit is :

cpu_up takes write lock, sends out CPU_UP_PREPARE notification. As part
of it, many do kthread_create, which uses workqueue. The work function
is never processed because keventd would be blocked on a previous
work function, waiting for hotplug sem in exec path.

So, as Rusty said, I think we really need to consider removing
lock_cpu_hotplug from sched_migrate_task. AFAICS that lock
was needed to prevent adding tasks to dead cpus. The same
can be accomplished by removing lock_cpu_hotplug from sched_migrate_task
and adding a cpu_is_offline check in __migrate_task.
This will eliminate all the deadlocks I have been hitting.


> Can we arrange some of these checks to disappear when HOTPLUG_CPU
> is not set? For example, make cpu_is_offline only valid to call for
> CPUs that have been online sometime, and can evaluate to 0 if
> HOTPLUG_CPU is not set?

I think this is already being done in include/linux/cpu.h



--


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2004-04-19 22:55:29

by Nick Piggin

[permalink] [raw]
Subject: Re: [lhcs-devel] Re: CPU Hotplug broken -mm5 onwards

Srivatsa Vaddagiri wrote:
> On Mon, Apr 19, 2004 at 01:34:14PM +1000, Nick Piggin wrote:
>
>>I think a rwsem might be a good idea anyway, because
>>sched_migrate_task can end up being called pretty often with
>>balance on exec and balance on clone. The semaphore could easily
>>place undue serialisation on that path.
>
>
> I found that r/w sem does not help here ..It can still lead to deadlocks.
> One example I hit is :
>
> cpu_up takes write lock, sends out CPU_UP_PREPARE notification. As part
> of it, many do kthread_create, which uses workqueue. The work function
> is never processed because keventd would be blocked on a previous
> work function, waiting for hotplug sem in exec path.
>
> So, as Rusty said, I think we really need to consider removing
> lock_cpu_hotplug from sched_migrate_task. AFAICS that lock
> was needed to prevent adding tasks to dead cpus. The same
> can be accomplished by removing lock_cpu_hotplug from sched_migrate_task
> and adding a cpu_is_offline check in __migrate_task.
> This will eliminate all the deadlocks I have been hitting.
>

Yes this would be a better idea. Care to send Andrew a patch
against -mm?

>
>
>>Can we arrange some of these checks to disappear when HOTPLUG_CPU
>>is not set? For example, make cpu_is_offline only valid to call for
>>CPUs that have been online sometime, and can evaluate to 0 if
>>HOTPLUG_CPU is not set?
>
>
> I think this is already being done in include/linux/cpu.h
>

Yes I see. I didn't realise the first one was under an ifdef :P
Sorry.

2004-04-19 23:08:42

by Rusty Russell

[permalink] [raw]
Subject: Re: [lhcs-devel] Re: CPU Hotplug broken -mm5 onwards

On Tue, 2004-04-20 at 08:55, Nick Piggin wrote:
> > So, as Rusty said, I think we really need to consider removing
> > lock_cpu_hotplug from sched_migrate_task. AFAICS that lock
> > was needed to prevent adding tasks to dead cpus. The same
> > can be accomplished by removing lock_cpu_hotplug from sched_migrate_task
> > and adding a cpu_is_offline check in __migrate_task.
> > This will eliminate all the deadlocks I have been hitting.
> >
>
> Yes this would be a better idea. Care to send Andrew a patch
> against -mm?

What surprises me is that this is a regression. The original hotplug
code on top of Nicksched(TM) removed that lock as part of the "don't
change cpus_allowed to migrate on exec" fix. When we pushed the patch
straight into Linus' tree, we had to do lock_cpu_hotplug because we
didn't have that code.

Obviously, it escaped somewhere...

Cheers,
Rusty.
--
Anyone who quotes me in their signature is an idiot -- Rusty Russell

2004-04-21 16:55:39

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: CPU Hotplug broken -mm5 onwards

On Wed, Apr 21, 2004 at 03:29:39PM +0530, Srivatsa Vaddagiri wrote:
> On Wed, Apr 21, 2004 at 02:36:50AM -0700, Andrew Morton wrote:
> >
> > Not sure who to aim this at, but: poke.
>
> I found that removing the lock in exec path has other consequences
> which I am fixing/testing. I will be sending an updated patch later today
> (against 2.6.6-rc1-mm1) to fix all those issues.

Found that lockless migration of execing task is _extremely_ racy.
The races I hit are described below, alongwith probable solutions.

Task migration done elsewhere should be safe (?) since they either
hold the lock (sys_sched_setaffinity) or are done entirely with preemption
disabled (load_balance).


sched_balance_exec does:

a. disables preemption
b. finds new_cpu for current
c. enables preemption
d. calls sched_migrate_task to migrate current to new_cpu

and sched_migrate_task does:

e. task_rq_lock(p)
f. migrate_task(p, dest_cpu ..)
(if we have to wait for migration thread)
g. task_rq_unlock()
h. wake_up_process(rq->migration_thread)
i. wait_for_completion()


Several things can happen here:

1. new_cpu can go down after h and before migration thread has
got around to handle the request

==> we need to add a cpu_is_offline check in __migrate_task

2. new_cpu can go down between c and d or before f.

===> Even though this case is automatically handled by the above
change (migrate_task being called on a running task, current,
will delegate migration to migration thread), would it be
good practice to avoid calling migrate_task in the first place
itself when dest_cpu is offline. This means adding another
cpu_is_offline check after e in sched_migrate_task

3. The 'current' task can get preempted _immediately_ after
g and when it comes back, task_cpu(p) can be dead. In
which case, it is invalid to do wake_up on a non-existent migration
thread. (rq->migration_thread can be NULL).

===> We should disable preemption thr' g and h

4. Before migration thread gets around to handle the request, its cpu
goes dead. This will leave unhandled migration requests in the dead
cpu.

===> We need to wakeup sleeping requestors (if any) in CPU_DEAD
notification.

I really wonder if we can get rid of these issues by avoiding balancing at
exec time and instead have it balanced during load_balance ..Alternately
if this is valuable and we want to retain it, I think we still need to
consider a read/write sem, with sched_migrate_task doing down_read_trylock.
This may eliminate the deadlock I hit between cpu_up and CPU_UP_PREPARE
notification, which had forced me away from r/w sem.

Anyway patch below addresses the above races. Its against 2.6.6-rc2-mm1
and has been tested on a 4way Intel Pentium SMP m/c. I have added a
cpu_is_offlince check in migration_thread(). If that is true, migration
thread stop processing and just waits for kthread_stop.

Let me know what you think!



---

linux-2.6.6-rc2-mm1-root/kernel/sched.c | 40 ++++++++++++++++++++++++++++----
1 files changed, 36 insertions(+), 4 deletions(-)

diff -puN kernel/sched.c~remove_hotplug_lock_in_sched_migrate_task kernel/sched.c
--- linux-2.6.6-rc2-mm1/kernel/sched.c~remove_hotplug_lock_in_sched_migrate_task 2004-04-20 20:42:03.000000000 +0530
+++ linux-2.6.6-rc2-mm1-root/kernel/sched.c 2004-04-20 22:22:16.527816944 +0530
@@ -1420,16 +1420,18 @@ static void sched_migrate_task(task_t *p
runqueue_t *rq;
unsigned long flags;

- lock_cpu_hotplug();
rq = task_rq_lock(p, &flags);
- if (!cpu_isset(dest_cpu, p->cpus_allowed))
+ if (!cpu_isset(dest_cpu, p->cpus_allowed) ||
+ unlikely(cpu_is_offline(dest_cpu)))
goto out;

/* force the process onto the specified CPU */
if (migrate_task(p, dest_cpu, &req)) {
/* Need to wait for migration thread. */
+ preempt_disable();
task_rq_unlock(rq, &flags);
wake_up_process(rq->migration_thread);
+ preempt_enable();
wait_for_completion(&req.done);

/*
@@ -1438,13 +1440,11 @@ static void sched_migrate_task(task_t *p
* the migration.
*/
tlb_migrate_prepare(current->mm);
- unlock_cpu_hotplug();

return;
}
out:
task_rq_unlock(rq, &flags);
- unlock_cpu_hotplug();
}

/*
@@ -3485,6 +3485,9 @@ static void __migrate_task(struct task_s
{
runqueue_t *rq_dest;

+ if (unlikely(cpu_is_offline(dest_cpu)))
+ return;
+
rq_dest = cpu_rq(dest_cpu);

double_rq_lock(this_rq(), rq_dest);
@@ -3540,6 +3543,8 @@ static int migration_thread(void * data)
if (list_empty(head)) {
spin_unlock_irq(&rq->lock);
schedule();
+ if (unlikely(cpu_is_offline(cpu)))
+ goto wait_to_die;
continue;
}
req = list_entry(head->next, migration_req_t, list);
@@ -3560,6 +3565,15 @@ static int migration_thread(void * data)
complete(&req->done);
}
return 0;
+wait_to_die:
+ /* Wait for kthread_stop */
+ set_current_state(TASK_INTERRUPTIBLE);
+ while (!kthread_should_stop()) {
+ schedule();
+ set_current_state(TASK_INTERRUPTIBLE);
+ }
+ __set_current_state(TASK_RUNNING);
+ return 0;
}

#ifdef CONFIG_HOTPLUG_CPU
@@ -3630,6 +3644,8 @@ static int migration_call(struct notifie
struct task_struct *p;
struct runqueue *rq;
unsigned long flags;
+ struct list_head *head;
+ migration_req_t *req, *tmp;

switch (action) {
case CPU_UP_PREPARE:
@@ -3652,6 +3668,22 @@ static int migration_call(struct notifie
/* Unbind it from offline cpu so it can run. Fall thru. */
kthread_bind(cpu_rq(cpu)->migration_thread,smp_processor_id());
case CPU_DEAD:
+ rq = cpu_rq(cpu);
+ head = &rq->migration_queue;
+ spin_lock_irq(&rq->lock);
+ list_for_each_entry_safe(req, tmp, head, list) {
+
+ BUG_ON(req->type != REQ_MOVE_TASK);
+
+ list_del_init(&req->list);
+
+ /* No need to migrate the task in the request. It would
+ * have already been migrated (maybe to a different
+ * CPU). Just wake up the requestor.
+ */
+ complete(&req->done);
+ }
+ spin_unlock_irq(&rq->lock);
kthread_stop(cpu_rq(cpu)->migration_thread);
cpu_rq(cpu)->migration_thread = NULL;
BUG_ON(cpu_rq(cpu)->nr_running != 0);

_



The above patch is running fine against 2.6.6-rc2-mm1 for the last 3 hrs.

The same patch when I had tried against 2.6.6-rc1-mm1 lead to
this BUG in include/asm-i386/mmu_context.h after running for 1 hr.

BUG_ON(cpu_tlbstate[cpu].active_mm != next);

Stack trace was:

------------[ cut here ]------------
kernel BUG at include/asm/mmu_context.h:53!
invalid operand: 0000 [#1]
PREEMPT SMP
CPU: 0
EIP: 0060:[<c03c94a2>] Not tainted VLI
EFLAGS: 00010087 (2.6.6-rc1-mm1)
EIP is at schedule+0x482/0x62d
eax: 00000000 ebx: d34a0a00 ecx: 00000000 edx: f73f1a80
esi: d01b2b70 edi: c170bca0 ebp: ebcbbfb0 esp: ebcbbf5c
ds: 007b es: 007b ss: 0068
Process kstopmachine (pid: 19332, threadinfo=ebcba000 task=d34a0850)
Stack: d34a0850 c170bca0 00000001 ebcbbf8c f73f1a80 d01b2b70 00000003 00000000
cad73ee8 ebcba000 cad73ee0 00000292 d01b2b70 00002fcb c04c53d9 0000037b
d34a0850 d34a0a00 ebcba000 fffffff0 cad73ed8 c013bd3f c013bd7b 00000000
Call Trace:
[<c013bd3f>] do_stop+0x0/0x6f
[<c013bd7b>] do_stop+0x3c/0x6f
[<c0133d39>] kthread+0xb7/0xbd
[<c0133c82>] kthread+0x0/0xbd
[<c01023a1>] kernel_thread_helper+0x5/0xb


Don't know if it is related to something that is fixed in 2.6.6-rc2-mm1.

Will update this list tomorrow with how my tests are faring on 2.6.6-rc2-mm1.




--


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2004-04-22 07:03:54

by Andrew Morton

[permalink] [raw]
Subject: Re: CPU Hotplug broken -mm5 onwards

Srivatsa Vaddagiri <[email protected]> wrote:
>
> Anyway patch below addresses the above races. Its against 2.6.6-rc2-mm1
> and has been tested on a 4way Intel Pentium SMP m/c. I have added a
> cpu_is_offlince check in migration_thread(). If that is true, migration
> thread stop processing and just waits for kthread_stop.

I'll take the stunned silence as agreement ;)

It needed a warning fixup:

kernel/sched.c: In function `migration_call':
kernel/sched.c:3648: warning: unused variable `tmp'
kernel/sched.c:3648: warning: unused variable `req'
kernel/sched.c:3647: warning: unused variable `head'

---

25-akpm/kernel/sched.c | 44 ++++++++++++++++++++++++--------------------
1 files changed, 24 insertions(+), 20 deletions(-)

diff -puN kernel/sched.c~sched-remove_hotplug_lock_in_sched_migrate_task-warnings kernel/sched.c
--- 25/kernel/sched.c~sched-remove_hotplug_lock_in_sched_migrate_task-warnings 2004-04-21 23:56:43.736745848 -0700
+++ 25-akpm/kernel/sched.c 2004-04-21 23:58:42.628671528 -0700
@@ -3339,8 +3339,6 @@ static int migration_call(struct notifie
struct task_struct *p;
struct runqueue *rq;
unsigned long flags;
- struct list_head *head;
- migration_req_t *req, *tmp;

switch (action) {
case CPU_UP_PREPARE:
@@ -3363,25 +3361,31 @@ static int migration_call(struct notifie
/* Unbind it from offline cpu so it can run. Fall thru. */
kthread_bind(cpu_rq(cpu)->migration_thread,smp_processor_id());
case CPU_DEAD:
- rq = cpu_rq(cpu);
- head = &rq->migration_queue;
- spin_lock_irq(&rq->lock);
- list_for_each_entry_safe(req, tmp, head, list) {
-
- BUG_ON(req->type != REQ_MOVE_TASK);
-
- list_del_init(&req->list);
-
- /* No need to migrate the task in the request. It would
- * have already been migrated (maybe to a different
- * CPU). Just wake up the requestor.
- */
- complete(&req->done);
+ {
+ struct list_head *head;
+ migration_req_t *req, *tmp;
+
+ rq = cpu_rq(cpu);
+ head = &rq->migration_queue;
+ spin_lock_irq(&rq->lock);
+ list_for_each_entry_safe(req, tmp, head, list) {
+
+ BUG_ON(req->type != REQ_MOVE_TASK);
+
+ list_del_init(&req->list);
+
+ /*
+ * No need to migrate the task in the request.
+ * It would have already been migrated (maybe to
+ * a different CPU). Just wake up the requestor
+ */
+ complete(&req->done);
+ }
+ spin_unlock_irq(&rq->lock);
+ kthread_stop(cpu_rq(cpu)->migration_thread);
+ cpu_rq(cpu)->migration_thread = NULL;
+ BUG_ON(cpu_rq(cpu)->nr_running != 0);
}
- spin_unlock_irq(&rq->lock);
- kthread_stop(cpu_rq(cpu)->migration_thread);
- cpu_rq(cpu)->migration_thread = NULL;
- BUG_ON(cpu_rq(cpu)->nr_running != 0);
break;
#endif
}

_