Hi Ingo,
the patch eliminating the frozen_lock which went into 2.5.22 requires
macros for prepare_arch_schedule(), etc... On IA64 I think that we need
to use the "complex" macros, otherwise there is a potential deadlock.
Unfortunately the "complex" macros seem to have the problem that the "prev"
task can be stolen right before the context switch. In your description to
the patch you wrote:
> architectures that need to unlock the runqueue before doing the switch can
> do the following:
>
> #define prepare_arch_schedule(prev) task_lock(prev)
> #define finish_arch_schedule(prev) task_unlock(prev)
> #define prepare_arch_switch(rq) spin_unlock(&(rq)->lock)
> #define finish_arch_switch(rq) __sti()
>
> this way the task-lock makes sure that a task is not scheduled on some
> other CPU before the switch-out finishes, but the runqueue lock is
> dropped.
Suppose we have
cpu1: idle1
cpu2: prev2 -> next2 (in the switch)
I don't understand how task_lock(prev2) done on cpu2 can prevent cpu1 to
schedule prev2, which it stole after the RQ#2 lock release. It will just
try to task_lock(idle1), which will be successfull.
The problems I have are with a small and ugly testprogram which generates
128 threads which all just do system("exit"); Running this program
continuously sooner or later leads to attempts to switch to tasks which
have already exited. Inserting a small udelay after prepare_arch_switch
reveals the problem earlier.
I also tried inserting "while (spin_is_locked(&(next)->alloc_lock));"
but it didn't help.
Any idea how to fix this problem?
Thanks a lot in advance,
best regards,
Erich
PS: Below is the relevant part of schedule():
prepare_arch_schedule(prev);
...
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
prepare_arch_switch(rq);
//while (spin_is_locked(&(next)->alloc_lock));
prev = context_switch(prev, next);
barrier();
rq = this_rq();
finish_arch_switch(rq);
} else
spin_unlock_irq(&rq->lock);
finish_arch_schedule(prev);
PPS: The potential deadlock on IA64 for the "simple" macros comes from
the fact that we sometimes wrap the context counter and need to
read_lock(tasklist_lock). Doing this without releasing the RQ can lead
to a deadlock with sys_wait4, where a write_lock(&tasklist_lock) is
needed, inside of which a __wake_up needs the RQ lock :-(
On Tue, 9 Jul 2002, Erich Focht wrote:
> Suppose we have
> cpu1: idle1
> cpu2: prev2 -> next2 (in the switch)
>
> I don't understand how task_lock(prev2) done on cpu2 can prevent cpu1 to
> schedule prev2, which it stole after the RQ#2 lock release. It will just
> try to task_lock(idle1), which will be successfull.
you are right - the 'complex' macros also need to lock the 'next' task,
not only the 'previous' task - but to do that deadlock-free, they need to
drop the runqueue lock ...
Ingo
the best solution might be to just lock the 'next' task - this needs a new
per-task irq-safe spinlock, to avoid deadlocks. This way whenever a task
is in the middle of a context-switch it cannot be scheduled on another
CPU.
in fact this solution simplifies things - only two per-arch macros are
needed. I've done this in my current 2.5.25 tree:
http://redhat.com/~mingo/O(1)-scheduler/sched-2.5.25-A4
check out the sparc64 changes for the 'complex' locking scenario - it's
untested, please give it a go on ia64, does that solve your problems? x86
is tested and works just fine.
Ingo
Hi Ingo,
thanks for the quick response!
> the best solution might be to just lock the 'next' task - this needs a new
> per-task irq-safe spinlock, to avoid deadlocks. This way whenever a task
> is in the middle of a context-switch it cannot be scheduled on another
> CPU.
We tested this and it looked good. But inserting a udelay(100) like:
...
prepare_arch_switch(rq, next);
udelay(100);
prev = context_switch(prev, next);
...
leads to a crash after 10 minutes. Again this looks like accessing an
empty page.
Does anything speak against such a test? It is there just to show up
quickly problems which we might normally get only after hours of running.
Regards,
Erich
On Wed, 10 Jul 2002, Erich Focht wrote:
> > the best solution might be to just lock the 'next' task - this needs a new
> > per-task irq-safe spinlock, to avoid deadlocks. This way whenever a task
> > is in the middle of a context-switch it cannot be scheduled on another
> > CPU.
>
> We tested this and it looked good. But inserting a udelay(100) like:
> ...
> prepare_arch_switch(rq, next);
> udelay(100);
> prev = context_switch(prev, next);
> ...
> leads to a crash after 10 minutes. Again this looks like accessing an
> empty page.
there is one more detail - wait_task_inactive() needs to consider the
->switch_lock as well - otherwise exit() might end up freeing the
pagetables earlier than the context-switch has truly finished. The
udelay(100) test should trigger this race.
i've fixed this and uploaded the -A8 patch:
http://redhat.com/~mingo/O(1)-scheduler/sched-2.5.25-A8
does this fix the ia64 crashes? you need to define an ia64-specific
task_running(rq, p) macro, which should be something like:
#define task_running(rq, p) \
((rq)->curr == (p)) && !spin_is_locked(&(p)->switch_lock)
a number of other places needed to be updated to use the task_running()
macro. For load_balance() and set_cpus_allowed() it's technically not
necessery, but i've added it to make things cleaner and safer for the time
being.
the default locking is still as lightweight as it used to be.
> Does anything speak against such a test? It is there just to show up
> quickly problems which we might normally get only after hours of
> running.
the udelay() test should be fine otherwise. (as long as ia64 udelay doesnt
do anything weird.)
Ingo
> #define task_running(rq, p) \
> ((rq)->curr == (p)) && !spin_is_locked(&(p)->switch_lock)
one more implementational note: the above test is not 'sharp' in the sense
that on SMP it's only correct (the test has no barriers) if the runqueue
lock is held. This is true for all the critical task_running() uses in
sched.c - and the cases that use it outside the runqueue lock are
optimizations so they dont need an exact test.
Ingo
the correct macro is:
#define task_running(rq, p) \
(((rq)->curr == (p)) || spin_is_locked(&(p)->switch_lock))
Ingo
Hi Ingo,
> there is one more detail - wait_task_inactive() needs to consider the
> ->switch_lock as well - otherwise exit() might end up freeing the
> pagetables earlier than the context-switch has truly finished. The
> udelay(100) test should trigger this race.
>
> i've fixed this and uploaded the -A8 patch:
>
> http://redhat.com/~mingo/O(1)-scheduler/sched-2.5.25-A8
>
> does this fix the ia64 crashes? you need to define an ia64-specific
looks good! Rock solid despite udelay. Though a bit slower than before...
Thanks a lot for the help!
Erich
Ingo,
the last fix survived one night of testing with included udelay(100), so
I would consider it stable. Thanks again!
The performance with the new "complex" macros is as expected worse than
with the simple ones, we see 10% in some cases, which is hurting a lot.
Would you please consider moving the location of the switch_lock from
the end of the task_struct to the hot area near the scheduler related
variables? The effect may vary depending on the cache line size but
having it behind *array or sleep_timestamp would probably save us a cache
miss here.
Regards,
Erich
Hi!
> > #define task_running(rq, p) \
> > ((rq)->curr == (p)) && !spin_is_locked(&(p)->switch_lock)
>
> one more implementational note: the above test is not 'sharp' in the sense
> that on SMP it's only correct (the test has no barriers) if the runqueue
> lock is held. This is true for all the critical task_running() uses in
> sched.c - and the cases that use it outside the runqueue lock are
> optimizations so they dont need an exact test.
I believe this is worth a *big fat* comment.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?