2004-06-01 20:09:59

by Bjorn Helgaas

[permalink] [raw]
Subject: [PATCH] active_load_balance() deadlock

active_load_balance() looks susceptible to deadlock when busiest==rq.
Without the following patch, my 128-way box deadlocks consistently
during boot-time driver init.

===== kernel/sched.c 1.304 vs edited =====
--- 1.304/kernel/sched.c 2004-05-27 02:26:55 -06:00
+++ edited/kernel/sched.c 2004-05-31 12:03:32 -06:00
@@ -1847,6 +1847,8 @@
}

rq = cpu_rq(push_cpu);
+ if (busiest == rq)
+ goto next_group;
double_lock_balance(busiest, rq);
move_tasks(rq, push_cpu, busiest, 1, sd, IDLE);
spin_unlock(&rq->lock);





2004-06-01 20:20:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] active_load_balance() deadlock



On Tue, 1 Jun 2004, Bjorn Helgaas wrote:
>
> active_load_balance() looks susceptible to deadlock when busiest==rq.
> Without the following patch, my 128-way box deadlocks consistently
> during boot-time driver init.

Makes sense. The regular "load_balance()" already has that test, although
it also makes it a WARN_ON() for some unexplained reason (I assume
find_busiest_group() isn't supposed to find the local group, although it
doesn't seem to be documented anywhere).

Ingo, Andrew?

Linus

2004-06-01 20:44:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] active_load_balance() deadlock


* Linus Torvalds <[email protected]> wrote:

> On Tue, 1 Jun 2004, Bjorn Helgaas wrote:
> >
> > active_load_balance() looks susceptible to deadlock when busiest==rq.
> > Without the following patch, my 128-way box deadlocks consistently
> > during boot-time driver init.
>
> Makes sense. The regular "load_balance()" already has that test,
> although it also makes it a WARN_ON() for some unexplained reason (I
> assume find_busiest_group() isn't supposed to find the local group,
> although it doesn't seem to be documented anywhere).
>
> Ingo, Andrew?

looks good to me. The condition is 'impossible', but the whole balancing
code is (intentionally) a bit racy:

cpus_and(tmp, group->cpumask, cpu_online_map);
if (!cpus_weight(tmp))
goto next_group;

for_each_cpu_mask(i, tmp) {
if (!idle_cpu(i))
goto next_group;
push_cpu = i;
}

rq = cpu_rq(push_cpu);
double_lock_balance(busiest, rq);
move_tasks(rq, push_cpu, busiest, 1, sd, IDLE);

in the for_each_cpu_mask() loop we specifically check for each CPU in
the target group to be idle - so push_cpu's runqueue == busiest [==
current runqueue] cannot be true because the current CPU is not idle, we
are running in the migration thread ... But this is not a real problem,
load-balancing we do in a racy way to reduce overhead [and it's all
statistics anyway so absolute accuracy is impossible], and active
balancing itself is somewhat racy due to the migration-thread wakeup
(and the active_balance flag) going outside the runqueue locks [for
similar reasons].

so it all looks quite plausible - the normal SMP boxes dont trigger it,
but Bjorn's 128-CPU setup with a non-trivial domain hiearachy triggers
it.

Signed-off-by: Ingo Molnar <[email protected]>

Ingo