the attached patch, against BK-curr, fixes an SMP window where the kernel
could miss to handle a signal, and increase signal delivery latency up to
200 msecs. Sun has reported to Ulrich that their JVM sees occasional
unexpected signal delays under Linux. The more CPUs, the more delays.
the cause of the problem is that the current signal wakeup implementation
is racy in kernel/signal.c:signal_wake_up():
if (t->state == TASK_RUNNING)
kick_if_running(t);
...
if (t->state & mask) {
wake_up_process(t);
return;
}
if thread (or process) 't' is woken up on another CPU right after the
TASK_RUNNING check, and thread starts to run, then the wake_up_process()
here will do nothing, and the signal stays pending up until the thread
will call into the kernel next time - which can be up to 200 msecs later.
the solution is to do the 'kicking' of a running thread on a remote CPU
atomically with the wakeup. For this i've added wake_up_process_kick().
There is no slowdown for the other wakeup codepaths, the new flag to
try_to_wake_up() is compiled off for them. Some other subsystems might
want to use this wakeup facility as well in the future (eg. AIO).
in fact this race triggers quite often under Volanomark rusg, with this
change added, Volanomark performance is up from 500-800 to 2000-3000, on a
4-way x86 box.
please apply,
Ingo
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -530,6 +530,7 @@ extern void do_timer(struct pt_regs *);
extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state));
extern int FASTCALL(wake_up_process(struct task_struct * tsk));
+extern int FASTCALL(wake_up_process_kick(struct task_struct * tsk));
extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk));
extern void FASTCALL(sched_exit(task_t * p));
@@ -643,7 +644,6 @@ extern void wait_task_inactive(task_t *
#else
#define wait_task_inactive(p) do { } while (0)
#endif
-extern void kick_if_running(task_t * p);
#define remove_parent(p) list_del_init(&(p)->sibling)
#define add_parent(p, parent) list_add_tail(&(p)->sibling,&(parent)->children)
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -454,27 +454,12 @@ repeat:
}
#endif
-/*
- * kick_if_running - kick the remote CPU if the task is running currently.
- *
- * This code is used by the signal code to signal tasks
- * which are in user-mode, as quickly as possible.
- *
- * (Note that we do this lockless - if the task does anything
- * while the message is in flight then it will notice the
- * sigpending condition anyway.)
- */
-void kick_if_running(task_t * p)
-{
- if ((task_running(task_rq(p), p)) && (task_cpu(p) != smp_processor_id()))
- resched_task(p);
-}
-
/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
* @state: the mask of task states that can be woken
* @sync: do a synchronous wakeup?
+ * @kick: kick the CPU if the task is already running?
*
* Put it on the run-queue if it's not already there. The "current"
* thread is always on the run-queue (except when the actual
@@ -484,7 +469,7 @@ void kick_if_running(task_t * p)
*
* returns failure only if the task is already active.
*/
-static int try_to_wake_up(task_t * p, unsigned int state, int sync)
+static int try_to_wake_up(task_t * p, unsigned int state, int sync, int kick)
{
int success = 0, requeue_waker = 0;
unsigned long flags;
@@ -518,7 +503,9 @@ repeat_lock_task:
resched_task(rq->curr);
}
success = 1;
- }
+ } else
+ if (unlikely(kick) && task_running(rq, p))
+ resched_task(rq->curr);
p->state = TASK_RUNNING;
}
task_rq_unlock(rq, &flags);
@@ -543,12 +530,17 @@ repeat_lock_task:
int wake_up_process(task_t * p)
{
- return try_to_wake_up(p, TASK_STOPPED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
+ return try_to_wake_up(p, TASK_STOPPED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0, 0);
+}
+
+int wake_up_process_kick(task_t * p)
+{
+ return try_to_wake_up(p, TASK_STOPPED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0, 1);
}
int wake_up_state(task_t *p, unsigned int state)
{
- return try_to_wake_up(p, state, 0);
+ return try_to_wake_up(p, state, 0, 0);
}
/*
@@ -1389,7 +1381,7 @@ need_resched:
int default_wake_function(wait_queue_t *curr, unsigned mode, int sync)
{
task_t *p = curr->task;
- return try_to_wake_up(p, mode, sync);
+ return try_to_wake_up(p, mode, sync, 0);
}
/*
--- linux/kernel/signal.c.orig
+++ linux/kernel/signal.c
@@ -521,18 +521,6 @@ inline void signal_wake_up(struct task_s
set_tsk_thread_flag(t,TIF_SIGPENDING);
/*
- * If the task is running on a different CPU
- * force a reschedule on the other CPU to make
- * it notice the new signal quickly.
- *
- * The code below is a tad loose and might occasionally
- * kick the wrong CPU if we catch the process in the
- * process of changing - but no harm is done by that
- * other than doing an extra (lightweight) IPI interrupt.
- */
- if (t->state == TASK_RUNNING)
- kick_if_running(t);
- /*
* If resume is set, we want to wake it up in the TASK_STOPPED case.
* We don't check for TASK_STOPPED because there is a race with it
* executing another processor and just now entering stopped state.
@@ -543,7 +531,7 @@ inline void signal_wake_up(struct task_s
if (resume)
mask |= TASK_STOPPED;
if (t->state & mask) {
- wake_up_process(t);
+ wake_up_process_kick(t);
return;
}
}
the attached patch, against BK-curr, fixes an SMP window where the
kernel could miss to handle a signal, and increase signal delivery
latency up to 200 msecs.
This has shown some good results on our testing but it yielded a hang
under some circumstances. I think I've located the problem.
When you decide to call resched_task(), you're holding a runqueue lock.
I think this gets us into the same mess the mm stuff did a couple of
months ago with IPIs.
proc A proc B
grabs rq lock tries to grab rq lock
rescheds a task on that rq spins with interrupts disabled
sends IPI to all processors <hangs>
<hangs>
releases lock
Holding the lock does two things: it allows us to set p's state to
RUNNING, and insures our task_running() check is valid for more than
an instant. In the case of the call to resched_task(), the small window
between unlocking the runqueue and calling resched_task() should be ok.
If p is no longer running, there's no real harm done that I can see.
Here's a patch which releases the runqueue lock before calling
resched_task(). It's possible it can be prettied up a bit (lots of
unlocks for each if branch now, it seems), but it compiles and gets the
job done. I've tested this to make sure it doesn't break, but have
been unable to test it (so far) on the machine which exhibits the hang.
Rick
diff -rup linux-2.5.69/include/linux/sched.h linux-2.5.69-sl/include/linux/sched.h
--- linux-2.5.69/include/linux/sched.h Sun May 4 16:53:02 2003
+++ linux-2.5.69-sl/include/linux/sched.h Tue May 20 17:29:48 2003
@@ -525,6 +525,7 @@ extern void do_timer(struct pt_regs *);
extern int FASTCALL(wake_up_state(struct task_struct * tsk, unsigned int state));
extern int FASTCALL(wake_up_process(struct task_struct * tsk));
+extern int FASTCALL(wake_up_process_kick(struct task_struct * tsk));
extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk));
extern void FASTCALL(sched_exit(task_t * p));
@@ -638,7 +639,6 @@ extern void wait_task_inactive(task_t *
#else
#define wait_task_inactive(p) do { } while (0)
#endif
-extern void kick_if_running(task_t * p);
#define remove_parent(p) list_del_init(&(p)->sibling)
#define add_parent(p, parent) list_add_tail(&(p)->sibling,&(parent)->children)
diff -rup linux-2.5.69/kernel/sched.c linux-2.5.69-sl/kernel/sched.c
--- linux-2.5.69/kernel/sched.c Sun May 4 16:53:37 2003
+++ linux-2.5.69-sl/kernel/sched.c Tue May 20 17:35:44 2003
@@ -454,27 +454,12 @@ repeat:
}
#endif
-/*
- * kick_if_running - kick the remote CPU if the task is running currently.
- *
- * This code is used by the signal code to signal tasks
- * which are in user-mode, as quickly as possible.
- *
- * (Note that we do this lockless - if the task does anything
- * while the message is in flight then it will notice the
- * sigpending condition anyway.)
- */
-void kick_if_running(task_t * p)
-{
- if ((task_running(task_rq(p), p)) && (task_cpu(p) != smp_processor_id()))
- resched_task(p);
-}
-
/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
* @state: the mask of task states that can be woken
* @sync: do a synchronous wakeup?
+ * @kick: kick the CPU if the task is already running?
*
* Put it on the run-queue if it's not already there. The "current"
* thread is always on the run-queue (except when the actual
@@ -484,7 +469,7 @@ void kick_if_running(task_t * p)
*
* returns failure only if the task is already active.
*/
-static int try_to_wake_up(task_t * p, unsigned int state, int sync)
+static int try_to_wake_up(task_t * p, unsigned int state, int sync, int kick)
{
int success = 0, requeue_waker = 0;
unsigned long flags;
@@ -518,10 +503,20 @@ repeat_lock_task:
resched_task(rq->curr);
}
success = 1;
+ p->state = TASK_RUNNING;
+ task_rq_unlock(rq, &flags);
+ } else {
+ if (unlikely(kick) && task_running(rq, p)) {
+ task_rq_unlock(rq, &flags);
+ resched_task(rq->curr);
+ } else {
+ p->state = TASK_RUNNING;
+ task_rq_unlock(rq, &flags);
+ }
}
- p->state = TASK_RUNNING;
- }
- task_rq_unlock(rq, &flags);
+ } else
+ task_rq_unlock(rq, &flags);
+
/*
* We have to do this outside the other spinlock, the two
@@ -543,12 +538,17 @@ repeat_lock_task:
int wake_up_process(task_t * p)
{
- return try_to_wake_up(p, TASK_STOPPED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);
+ return try_to_wake_up(p, TASK_STOPPED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0, 0);
+}
+
+int wake_up_process_kick(task_t * p)
+{
+ return try_to_wake_up(p, TASK_STOPPED | TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0, 1);
}
int wake_up_state(task_t *p, unsigned int state)
{
- return try_to_wake_up(p, state, 0);
+ return try_to_wake_up(p, state, 0, 0);
}
/*
@@ -1389,7 +1389,7 @@ need_resched:
int default_wake_function(wait_queue_t *curr, unsigned mode, int sync)
{
task_t *p = curr->task;
- return try_to_wake_up(p, mode, sync);
+ return try_to_wake_up(p, mode, sync, 0);
}
/*
diff -rup linux-2.5.69/kernel/signal.c linux-2.5.69-sl/kernel/signal.c
--- linux-2.5.69/kernel/signal.c Sat Apr 19 19:49:11 2003
+++ linux-2.5.69-sl/kernel/signal.c Tue May 20 17:29:48 2003
@@ -521,18 +521,6 @@ inline void signal_wake_up(struct task_s
set_tsk_thread_flag(t,TIF_SIGPENDING);
/*
- * If the task is running on a different CPU
- * force a reschedule on the other CPU to make
- * it notice the new signal quickly.
- *
- * The code below is a tad loose and might occasionally
- * kick the wrong CPU if we catch the process in the
- * process of changing - but no harm is done by that
- * other than doing an extra (lightweight) IPI interrupt.
- */
- if (t->state == TASK_RUNNING)
- kick_if_running(t);
- /*
* If resume is set, we want to wake it up in the TASK_STOPPED case.
* We don't check for TASK_STOPPED because there is a race with it
* executing another processor and just now entering stopped state.
@@ -543,7 +531,7 @@ inline void signal_wake_up(struct task_s
if (resume)
mask |= TASK_STOPPED;
if (t->state & mask) {
- wake_up_process(t);
+ wake_up_process_kick(t);
return;
}
}
On Wed, 21 May 2003, Rick Lindsley wrote:
> This has shown some good results on our testing [...]
(do you have any numeric results?)
> [...] but it yielded a hang under some circumstances. I think I've
> located the problem.
>
> When you decide to call resched_task(), you're holding a runqueue lock.
> I think this gets us into the same mess the mm stuff did a couple of
> months ago with IPIs.
It's perfectly safe to _generate_ an IPI from under the runqueue lock. We
do it in a number of cases, and did it for years. What precise hardware
did you run this on?
> proc A proc B
>
> grabs rq lock tries to grab rq lock
> rescheds a task on that rq spins with interrupts disabled
> sends IPI to all processors <hangs>
> <hangs>
> releases lock
firstly, the IPI is not sent to all processors, it's sent to a single
target CPU. Secondly, where and why would the hang happen?
we do loop in one place in the x86 APIC code, apic_wait_icr_idle(). I
believe this should never loop indefinitely, unless the hw is broken.
> Holding the lock does two things: it allows us to set p's state to
> RUNNING, and insures our task_running() check is valid for more than an
> instant. In the case of the call to resched_task(), the small window
> between unlocking the runqueue and calling resched_task() should be ok.
> If p is no longer running, there's no real harm done that I can see.
i dont disagree in theory with offloading the SMP-message sending outside
the critical section (it makes things more parallel, and the APIC message
itself goes asynchronously anyway), but i'd like to understand the precise
reason for the hang first. The current code should not hang.
Ingo
> This has shown some good results on our testing [...]
(do you have any numeric results?)
Sure. We were running a web server benchmark with which we've had
latency problems with in the past. With your patch applied, it showed
about an 8% improvement, but reliably produced the hang upon shutdown
of the benchmark.
It's perfectly safe to _generate_ an IPI from under the runqueue
lock. We do it in a number of cases, and did it for years. What
precise hardware did you run this on?
This was x86 -- since we have a variety of configurations I'll have to
check back with the person who reported the problem and tested the patch
for me to get more precise info.
True, generation isn't a problem. What *could* be a problem is if we're
waiting for any kind of response or status change from a processor as a
result of that IPI (or perhaps, in this case, as a prerequisite to sending
it.) If the message is at all bidirectional, we may have problems.
firstly, the IPI is not sent to all processors, it's sent to a single
target CPU. Secondly, where and why would the hang happen?
You're right, I misspoke about it going to all processors, confusing it with
the flush tlb code above it. But it doesn't really change my premise.
I initially believed that one processor was unable to respond to the IPI
because it had blocked interrupts. So my thought was that one process was
blocked on the runqueue lock (with interrupts disabled), while another
was blocked sending the IPI to that processor. You've introduced some
legitimate questions about that being the problem, despite the patch
making the hang disappear.
we do loop in one place in the x86 APIC code, apic_wait_icr_idle(). I
believe this should never loop indefinitely, unless the hw is broken.
This is probably the premiere gray area for me now. I don't know enough
about APICs to tell under what conditions APIC_ICR_BUSY will be true.
Is it enough that you've disabled interrupts -- will the APIC show
busy then?
but i'd like to understand the precise reason for the hang first.
I'm all for that.
We've been unsuccessful getting much information from the hangs
themselves. nmi_watchdog has given us a couple of stack traces but it's
always hard to tell the culprits from the victims. We had this nasty
interaction with runqueue locks and IPIs before with the tlb flush code,
and although the evidence wasn't nearly so clear I thought it might be
worth trying the patch. So far it has worked, but it wouldn't be the
first time new code just shifted windows around rather than closing
them completely.
Rick