wait_task_inactive() must return success only when/if the task leaves
the runqueue, it doesn't work correctly when !SMP && PREEMPT because
in that case we use the "dummy" version which doesn't check .on_rq.
Change the "#if" around the full-blown version from SMP to SMP || PREEMPT.
The patch looks monstrous because it moves the (unchanged) definition
of wait_task_inactive() outside of "#ifdef CONFIG_SMP", but it is quite
trivial.
Signed-off-by: Oleg Nesterov <[email protected]>
include/linux/sched.h | 2
kernel/sched.c | 210 +++++++++++++++++++++++++-------------------------
2 files changed, 107 insertions(+), 105 deletions(-)
--- 27/include/linux/sched.h~3_WTI_PREEMPT 2008-07-30 20:28:31.000000000 +0400
+++ 27/include/linux/sched.h 2008-07-30 20:47:46.000000000 +0400
@@ -1874,7 +1874,7 @@ struct task_struct *fork_idle(int);
extern void set_task_comm(struct task_struct *tsk, char *from);
extern char *get_task_comm(char *to, struct task_struct *tsk);
-#ifdef CONFIG_SMP
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
extern unsigned long wait_task_inactive(struct task_struct *, long match_state);
#else
static inline unsigned long wait_task_inactive(struct task_struct *p,
--- 27/kernel/sched.c~3_WTI_PREEMPT 2008-07-30 19:47:56.000000000 +0400
+++ 27/kernel/sched.c 2008-07-30 20:43:41.000000000 +0400
@@ -1864,110 +1864,6 @@ migrate_task(struct task_struct *p, int
return 1;
}
-/*
- * wait_task_inactive - wait for a thread to unschedule.
- *
- * If @match_state is nonzero, it's the @p->state value just checked and
- * not expected to change. If it changes, i.e. @p might have woken up,
- * then return zero. When we succeed in waiting for @p to be off its CPU,
- * we return a positive number (its total switch count). If a second call
- * a short while later returns the same number, the caller can be sure that
- * @p has remained unscheduled the whole time.
- *
- * The caller must ensure that the task *will* unschedule sometime soon,
- * else this function might spin for a *long* time. This function can't
- * be called with interrupts off, or it may introduce deadlock with
- * smp_call_function() if an IPI is sent by the same process we are
- * waiting to become inactive.
- */
-unsigned long wait_task_inactive(struct task_struct *p, long match_state)
-{
- unsigned long flags;
- int running, on_rq;
- unsigned long ncsw;
- struct rq *rq;
-
- for (;;) {
- /*
- * We do the initial early heuristics without holding
- * any task-queue locks at all. We'll only try to get
- * the runqueue lock when things look like they will
- * work out!
- */
- rq = task_rq(p);
-
- /*
- * If the task is actively running on another CPU
- * still, just relax and busy-wait without holding
- * any locks.
- *
- * NOTE! Since we don't hold any locks, it's not
- * even sure that "rq" stays as the right runqueue!
- * But we don't care, since "task_running()" will
- * return false if the runqueue has changed and p
- * is actually now running somewhere else!
- */
- while (task_running(rq, p)) {
- if (match_state && unlikely(p->state != match_state))
- return 0;
- cpu_relax();
- }
-
- /*
- * Ok, time to look more closely! We need the rq
- * lock now, to be *sure*. If we're wrong, we'll
- * just go back and repeat.
- */
- rq = task_rq_lock(p, &flags);
- running = task_running(rq, p);
- on_rq = p->se.on_rq;
- ncsw = 0;
- if (!match_state || p->state == match_state)
- ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
- task_rq_unlock(rq, &flags);
-
- /*
- * If it changed from the expected state, bail out now.
- */
- if (unlikely(!ncsw))
- break;
-
- /*
- * Was it really running after all now that we
- * checked with the proper locks actually held?
- *
- * Oops. Go back and try again..
- */
- if (unlikely(running)) {
- cpu_relax();
- continue;
- }
-
- /*
- * It's not enough that it's not actively running,
- * it must be off the runqueue _entirely_, and not
- * preempted!
- *
- * So if it wa still runnable (but just not actively
- * running right now), it's preempted, and we should
- * yield - it could be a while.
- */
- if (unlikely(on_rq)) {
- schedule_timeout_uninterruptible(1);
- continue;
- }
-
- /*
- * Ahh, all good. It wasn't running, and it wasn't
- * runnable, which means that it will never become
- * running in the future either. We're all done!
- */
- break;
- }
-
- return ncsw;
-}
-
/***
* kick_process - kick a running thread to enter/exit the kernel
* @p: the to-be-kicked thread
@@ -2176,6 +2072,112 @@ static int sched_balance_self(int cpu, i
#endif /* CONFIG_SMP */
+#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
+/*
+ * wait_task_inactive - wait for a thread to unschedule.
+ *
+ * If @match_state is nonzero, it's the @p->state value just checked and
+ * not expected to change. If it changes, i.e. @p might have woken up,
+ * then return zero. When we succeed in waiting for @p to be off its CPU,
+ * we return a positive number (its total switch count). If a second call
+ * a short while later returns the same number, the caller can be sure that
+ * @p has remained unscheduled the whole time.
+ *
+ * The caller must ensure that the task *will* unschedule sometime soon,
+ * else this function might spin for a *long* time. This function can't
+ * be called with interrupts off, or it may introduce deadlock with
+ * smp_call_function() if an IPI is sent by the same process we are
+ * waiting to become inactive.
+ */
+unsigned long wait_task_inactive(struct task_struct *p, long match_state)
+{
+ unsigned long flags;
+ int running, on_rq;
+ unsigned long ncsw;
+ struct rq *rq;
+
+ for (;;) {
+ /*
+ * We do the initial early heuristics without holding
+ * any task-queue locks at all. We'll only try to get
+ * the runqueue lock when things look like they will
+ * work out!
+ */
+ rq = task_rq(p);
+
+ /*
+ * If the task is actively running on another CPU
+ * still, just relax and busy-wait without holding
+ * any locks.
+ *
+ * NOTE! Since we don't hold any locks, it's not
+ * even sure that "rq" stays as the right runqueue!
+ * But we don't care, since "task_running()" will
+ * return false if the runqueue has changed and p
+ * is actually now running somewhere else!
+ */
+ while (task_running(rq, p)) {
+ if (match_state && unlikely(p->state != match_state))
+ return 0;
+ cpu_relax();
+ }
+
+ /*
+ * Ok, time to look more closely! We need the rq
+ * lock now, to be *sure*. If we're wrong, we'll
+ * just go back and repeat.
+ */
+ rq = task_rq_lock(p, &flags);
+ running = task_running(rq, p);
+ on_rq = p->se.on_rq;
+ ncsw = 0;
+ if (!match_state || p->state == match_state)
+ ncsw = p->nvcsw | LONG_MIN; /* sets MSB */
+ task_rq_unlock(rq, &flags);
+
+ /*
+ * If it changed from the expected state, bail out now.
+ */
+ if (unlikely(!ncsw))
+ break;
+
+ /*
+ * Was it really running after all now that we
+ * checked with the proper locks actually held?
+ *
+ * Oops. Go back and try again..
+ */
+ if (unlikely(running)) {
+ cpu_relax();
+ continue;
+ }
+
+ /*
+ * It's not enough that it's not actively running,
+ * it must be off the runqueue _entirely_, and not
+ * preempted!
+ *
+ * So if it wa still runnable (but just not actively
+ * running right now), it's preempted, and we should
+ * yield - it could be a while.
+ */
+ if (unlikely(on_rq)) {
+ schedule_timeout_uninterruptible(1);
+ continue;
+ }
+
+ /*
+ * Ahh, all good. It wasn't running, and it wasn't
+ * runnable, which means that it will never become
+ * running in the future either. We're all done!
+ */
+ break;
+ }
+
+ return ncsw;
+}
+#endif
+
/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
On Wed, 30 Jul 2008, Oleg Nesterov wrote:
>
> The patch looks monstrous because it moves the (unchanged) definition
> of wait_task_inactive() outside of "#ifdef CONFIG_SMP", but it is quite
> trivial.
Hmm. Doesn't this just deadlock in UP (PREEMPT) if wait_task_interactive()
is ever called from a no-preempt context?
And if that's never the case, the comment should be updated to reflect
that (right now it says that it's only invalid to call it with interrupts
disabled to avoid cross-IPI deadlocks).
Oh, and shouldn't it do a "yield()" instead of a cpu_relax() on UP?
Inquiring minds want to know. That function was very much expressly
designed for SMP, not for preemption, and I want to understand why it's
ok (_if_ it's ok).
Linus
On 07/30, Linus Torvalds wrote:
>
> On Wed, 30 Jul 2008, Oleg Nesterov wrote:
> >
> > The patch looks monstrous because it moves the (unchanged) definition
> > of wait_task_inactive() outside of "#ifdef CONFIG_SMP", but it is quite
> > trivial.
>
> Hmm. Doesn't this just deadlock in UP (PREEMPT) if wait_task_interactive()
> is ever called from a no-preempt context?
Given that it calls schedule_timeout_uninterruptible(), it can't be used
from the no-preempt context,
> And if that's never the case, the comment should be updated to reflect
> that (right now it says that it's only invalid to call it with interrupts
> disabled to avoid cross-IPI deadlocks).
Yes, I think this function is might_sleep(),
> Oh, and shouldn't it do a "yield()" instead of a cpu_relax() on UP?
I _think_ that rq->curr must be == current without CONFIG_SMP, but
> and I want to understand why it's
> ok (_if_ it's ok).
me too.
Hopefully Ingo can ack/nack.
Oleg.
2008/7/30 Linus Torvalds <[email protected]>:
>
>
> On Wed, 30 Jul 2008, Oleg Nesterov wrote:
>>
>> The patch looks monstrous because it moves the (unchanged) definition
>> of wait_task_inactive() outside of "#ifdef CONFIG_SMP", but it is quite
>> trivial.
>
> Hmm. Doesn't this just deadlock in UP (PREEMPT) if wait_task_interactive()
> is ever called from a no-preempt context?
>
> And if that's never the case, the comment should be updated to reflect
> that (right now it says that it's only invalid to call it with interrupts
> disabled to avoid cross-IPI deadlocks).
>
> Oh, and shouldn't it do a "yield()" instead of a cpu_relax() on UP?
This part could have been skipped for UP. task_running(rq, p) just
can't give 'true' for UP (otherwise it's a bug). The only relevant
part is "on_rq = p->se.on_rq".
>
> Inquiring minds want to know. That function was very much expressly
> designed for SMP,
It looks so. Otherwise it's behavior is not symmetric and I think,
either [1] it shouldn't be a "nop" for !SMP or [2] there shouldn't be
a version for !SMP at all --> so no one can make false assumptions.
(if [1], then I think a separate function for PREEMPT would look
better, i.e. without parts with task_running())
e.g. consider this code from kthread_bind():
/* Must have done schedule() in kthread() before we set_task_cpu */
wait_task_inactive(k, 0);
set_task_cpu(k, cpu);
k->cpus_allowed = cpumask_of_cpu(cpu);
k->rt.nr_cpus_allowed = 1;
k->flags |= PF_THREAD_BOUND;
set_task_cpu(k, cpu) is safe _only_ if 'k' is not on the run-queue
(and can't be placed onto it behind our back -- heh, a bit subtle).
Now, for !SMP + PREEMPT it's not a case. set_task_cpu() may be called
while 'k' is still on the run-queue (more precisely, preempted in
kthread() between complete(&create->started); and schedule();).
Yes, set_task_cpu() is a "nop" for UP so that's ok in this particular
case. But let's suppose, another use-case would be introduced with
'false' assumptions causing troubles for !SMP.
> not for preemption, and I want to understand why it's
> ok (_if_ it's ok).
>
> Linus
--
Best regards,
Dmitry Adamushko
I'm not convinced we need this at all (as per prior thread
"Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT").
Let's not get hung up on what the explanation of the function's semantics
supposedly is. We can change the comments to match the reality. We can
rename the function to better reflect its true guarantee, if you like.
There are two uses of wait_task_inactive. One is kthread_bind.
We have established that a nop is fine for !SMP there.
The other uses are for tracing. Presently that is the traditional use in
ptrace, and task_current_syscall.
I've described that the requirement of the traditional ptrace call is quite
weak. A nop is fine for !SMP. Preemption is not an issue because all that
matters is that no access to pt_regs/thread_struct can race with the actual
context switch (by definition has to be on another CPU to be simultaneous).
For task_current_syscall, it also doesn't matter that it's left the run
queue. It just matters that the call before looking at pt_regs and the
call after indicate whether it stayed switched out for the duration. As
per the prior thread on this, I think that's covered by using nivcsw+nvcsw
as I did originally rather than nvcsw alone. So a short inline is still
fine.
The other uses I anticipate (e.g. utrace) follow the same model as
task_current_syscall (check before and after) and have only the same
requirements.
Thanks,
Roland
On 07/30, Dmitry Adamushko wrote:
>
> 2008/7/30 Linus Torvalds <[email protected]>:
> >
> > Oh, and shouldn't it do a "yield()" instead of a cpu_relax() on UP?
>
> This part could have been skipped for UP. task_running(rq, p) just
> can't give 'true' for UP (otherwise it's a bug). The only relevant
> part is "on_rq = p->se.on_rq".
That was my understanding, thanks for your confirmation.
> (if [1], then I think a separate function for PREEMPT would look
> better, i.e. without parts with task_running())
in that case we can also eliminate task_rq_lock() afaics.
> e.g. consider this code from kthread_bind():
>
> /* Must have done schedule() in kthread() before we set_task_cpu */
> wait_task_inactive(k, 0);
>
> set_task_cpu(k, cpu);
> k->cpus_allowed = cpumask_of_cpu(cpu);
> k->rt.nr_cpus_allowed = 1;
> k->flags |= PF_THREAD_BOUND;
>
> set_task_cpu(k, cpu) is safe _only_ if 'k' is not on the run-queue
> (and can't be placed onto it behind our back -- heh, a bit subtle).
>
> Now, for !SMP + PREEMPT it's not a case. set_task_cpu() may be called
> while 'k' is still on the run-queue (more precisely, preempted in
> kthread() between complete(&create->started); and schedule();).
>
> Yes, set_task_cpu() is a "nop" for UP so that's ok in this particular
> case. But let's suppose, another use-case would be introduced with
> 'false' assumptions causing troubles for !SMP.
Completely agreed.
Currently the only user which can suffer on UP && PREEMPT is
task_current_syscall(). As Roland suggested we can fix it if we change
the !SMP version to return ->nivcsw + ->nvcsw. But personally I don't
like the fact that we will have the subltle difference in behaviour
depending on CONFIG_SMP. We have to wait for .on_rq == 0 on SMP, imho
it is better to do the same on UP even if none of the current callers
needs this.
That said, I think this problem is minor and I don't (and can't) have
the strong opinion on how to fix it. I'd better wait for the authoritative
verdict.
Oleg.