2002-06-06 23:20:58

by Mike Kravetz

[permalink] [raw]
Subject: Scheduler Bug (set_cpus_allowed)

At my suggestion the following optimization/fast path
was added to set_cpus_allowed:

/*
* If the task is not on a runqueue, then it is sufficient
* to simply update the task's cpu field.
*/
if (!p->array) {
p->thread_info->cpu = __ffs(p->cpus_allowed);
task_rq_unlock(rq, &flags);
goto out;
}

Unfortunately, the assumption made here is not true.

Consider the case where a task is to give up the CPU
and schedule() is called. In such a case the current
task is removed from the runqueue (via deactivate_task).
Now, further assume that there are no runnable tasks
on the runqueue and we end up calling load_balance().
In load_balance, we potentially drop the runqueue lock
to obtain multiple runqueue locks in the proper order.
Now, when we drop the runqueue lock we will still
be running in the context of task p. However, p does
not reside on a runqueue. It is now possible for
set_cpus_allowed() to be called for p. We can get the
runqueue lock and take the fast path to simply update
the task's cpu field. If this happens, bad (very bad)
things will happen!

My first thought was to simply add a check to the
above code to ensure that p was not currently running
(p != rq->curr). However, even with this change I
'think' the same problem exists later on in schedule()
where we drop the runqueue lock before doing a context
switch. At this point, p is not on the runqueue and
p != rq->curr, yet we are still runnning in the context
of p until we actually do the context switch. To tell
the truth, I'm not exactly sure what 'rq->frozen' lock
is for. Also, the routine 'schedule_tail' does not
appear to be used anywhere.

Comments?/Thoughts?/Suggestions?
--
Mike


2002-06-07 18:36:58

by Robert Love

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)

On Thu, 2002-06-06 at 16:20, Mike Kravetz wrote:

> Consider the case where a task is to give up the CPU
> and schedule() is called. In such a case the current
> task is removed from the runqueue (via deactivate_task).
> Now, further assume that there are no runnable tasks
> on the runqueue and we end up calling load_balance().
> In load_balance, we potentially drop the runqueue lock
> to obtain multiple runqueue locks in the proper order.
> Now, when we drop the runqueue lock we will still
> be running in the context of task p. However, p does
> not reside on a runqueue. It is now possible for
> set_cpus_allowed() to be called for p. We can get the
> runqueue lock and take the fast path to simply update
> the task's cpu field. If this happens, bad (very bad)
> things will happen!

Ugh I think you are right. This is an incredibly small race, though!

> My first thought was to simply add a check to the
> above code to ensure that p was not currently running
> (p != rq->curr). However, even with this change I
> 'think' the same problem exists later on in schedule()
> where we drop the runqueue lock before doing a context
> switch. At this point, p is not on the runqueue and
> p != rq->curr, yet we are still runnning in the context
> of p until we actually do the context switch. To tell
> the truth, I'm not exactly sure what 'rq->frozen' lock
> is for. Also, the routine 'schedule_tail' does not
> appear to be used anywhere.

I agree here, too.

Fyi, schedule_tail is called from assembly code on SMP machines. See
entry.S

rq->frozen is admittedly a bit confusing. Dave Miller added it - on
some architectures mm->page_table_lock is grabbed during switch_mm().
Additionally, swap_out() and others grab page_table_lock during wakeups
which also grab rq->lock. Bad locking... Dave's solution was to make
another lock, the "frozen state lock", which is held around context
switches. This way we can protect the "not switched out yet" task
without grabbing the whole runqueue lock.

Anyhow, with this issue, I guess we need to fix it... I'll send a patch
to Linus.

Robert Love

2002-06-07 19:13:45

by Mike Kravetz

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)

On Fri, Jun 07, 2002 at 11:36:46AM -0700, Robert Love wrote:
> On Thu, 2002-06-06 at 16:20, Mike Kravetz wrote:
>
> > Consider the case where a task is to give up the CPU
> > and schedule() is called. In such a case the current
> > task is removed from the runqueue (via deactivate_task).
> > Now, further assume that there are no runnable tasks
> > on the runqueue and we end up calling load_balance().
> > In load_balance, we potentially drop the runqueue lock
> > to obtain multiple runqueue locks in the proper order.
> > Now, when we drop the runqueue lock we will still
> > be running in the context of task p. However, p does
> > not reside on a runqueue. It is now possible for
> > set_cpus_allowed() to be called for p. We can get the
> > runqueue lock and take the fast path to simply update
> > the task's cpu field. If this happens, bad (very bad)
> > things will happen!
>
> Ugh I think you are right. This is an incredibly small race, though!

I agree that the race is small. I 'found' this while playing
with some experimental code that does the same thing as the
fast path in set_cpus_allowed: it sets the cpu field while
holding the rq lock if the task is not on the rq. This code
runs as at higher frequency (than would be expected of
set_cpus_allowed) and found this hole.

> > My first thought was to simply add a check to the
> > above code to ensure that p was not currently running
> > (p != rq->curr). However, even with this change I
> > 'think' the same problem exists later on in schedule()
> > where we drop the runqueue lock before doing a context
> > switch. At this point, p is not on the runqueue and
> > p != rq->curr, yet we are still runnning in the context
> > of p until we actually do the context switch. To tell
> > the truth, I'm not exactly sure what 'rq->frozen' lock
> > is for. Also, the routine 'schedule_tail' does not
> > appear to be used anywhere.
>
> I agree here, too.
>
> Fyi, schedule_tail is called from assembly code on SMP machines. See
> entry.S
>
> rq->frozen is admittedly a bit confusing. Dave Miller added it - on
> some architectures mm->page_table_lock is grabbed during switch_mm().
> Additionally, swap_out() and others grab page_table_lock during wakeups
> which also grab rq->lock. Bad locking... Dave's solution was to make
> another lock, the "frozen state lock", which is held around context
> switches. This way we can protect the "not switched out yet" task
> without grabbing the whole runqueue lock.
>
> Anyhow, with this issue, I guess we need to fix it... I'll send a patch
> to Linus.

What about the code in schedule before calling context_switch?
Consider the case where the task relinquishing the CPU is still
runnable. Before dropping the rq lock, we set rq->curr = next
(where next != current). After dropping the rq lock, can't we
race with the code in load_balance. load_balance will steal
'runnable' tasks as long as the task is not specified by rq->curr.
Therefore, theoretically load_balance could steal a currently
running task.

Now, the window for this race is only from the time we drop the
rq lock to the time we do the context switch. During this time
we are running with interrupts disabled, so nothing should delay
us. In addition, code path on a racing CPU would be much longer:
move task via load balance and schedule it on another CPU.

I believe there is a race, but it is a race we will never lose.
Does that make it a race at all. :)

In the old scheduler both tasks (prev and next) were marked as
'have_cpu' during the context switch. As such they were 'hands
off' to other CPUs. Should we implement something like this
here?

--
Mike

2002-06-07 23:20:11

by J.A. Magallon

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)


On 2002.06.07 Robert Love wrote:
>
>Anyhow, with this issue, I guess we need to fix it... I'll send a patch
>to Linus.
>

Plz, could you also post it in the list ? -aa will need also...

--
J.A. Magallon # Let the source be with you...
mailto:[email protected]
Mandrake Linux release 8.3 (Cooker) for i586
Linux werewolf 2.4.19-pre10-jam2 #1 SMP vie jun 7 17:04:23 CEST 2002 i686

2002-06-10 18:51:02

by Mike Kravetz

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)

Robert,

It looks like you got the set_cpus_allowed() optimization removed
in the 2.5.21 release. Thanks!

However, even with this code removed there are still some races in
the scheduler. Specifically, load_balance() can race with schedule()
in the code around the call to context switch(). Below, is a patch
that addresses these races. It may not be the most elegant solution,
but is seemed the most straight forward. It also reintroduces the
set_cpus_allowed() optimization, but my main concern is with the
other races. Whether or not the optimization goes in is no big deal.

--
Mike


diff -Naur linux-2.5.21/kernel/sched.c linux-2.5.21-fix/kernel/sched.c
--- linux-2.5.21/kernel/sched.c Sun Jun 9 05:28:13 2002
+++ linux-2.5.21-fix/kernel/sched.c Mon Jun 10 18:26:55 2002
@@ -138,7 +138,7 @@
spinlock_t frozen;
unsigned long nr_running, nr_switches, expired_timestamp;
signed long nr_uninterruptible;
- task_t *curr, *idle;
+ task_t *prev, *curr, *idle;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
task_t *migration_thread;
@@ -152,6 +152,7 @@
#define task_rq(p) cpu_rq((p)->thread_info->cpu)
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)
+#define running_on_rq(p, rq) (((p) == (rq)->curr) || ((p) == (rq)->prev))

static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
{
@@ -284,12 +285,12 @@
repeat:
preempt_disable();
rq = task_rq(p);
- while (unlikely(rq->curr == p)) {
+ while (unlikely(running_on_rq(p, rq))) {
cpu_relax();
barrier();
}
rq = task_rq_lock(p, &flags);
- if (unlikely(rq->curr == p)) {
+ if (unlikely(running_on_rq(p, rq))) {
task_rq_unlock(rq, &flags);
preempt_enable();
goto repeat;
@@ -604,7 +605,7 @@

#define CAN_MIGRATE_TASK(p,rq,this_cpu) \
((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \
- ((p) != (rq)->curr) && \
+ (!running_on_rq(p, rq)) && \
((p)->cpus_allowed & (1UL << (this_cpu))))

if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
@@ -827,6 +828,7 @@

if (likely(prev != next)) {
rq->nr_switches++;
+ rq->prev = prev;
rq->curr = next;
spin_lock(&rq->frozen);
spin_unlock(&rq->lock);
@@ -840,6 +842,7 @@
*/
mb();
rq = this_rq();
+ rq->prev = NULL;
spin_unlock_irq(&rq->frozen);
} else {
spin_unlock_irq(&rq->lock);
@@ -1687,6 +1690,16 @@
task_rq_unlock(rq, &flags);
goto out;
}
+
+ /*
+ * If the task is not on a runqueue (and not running), then
+ * it is sufficient to simply update the task's cpu field.
+ */
+ if (!p->array && !running_on_rq(p, rq)) {
+ p->thread_info->cpu = __ffs(p->cpus_allowed);
+ task_rq_unlock(rq, &flags);
+ goto out;
+ }

init_MUTEX_LOCKED(&req.sem);
req.task = p;

2002-06-10 20:04:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)


Mike,

On Mon, 10 Jun 2002, Mike Kravetz wrote:

> However, even with this code removed there are still some races in the
> scheduler. Specifically, load_balance() can race with schedule() in the
> code around the call to context switch(). [...]

the fundamental problem is the dropping of the runqueue lock before doing
the context_switch(), which leaves the scheduler data structures in an
inconsistent state. Reintroducing rq->prev works, we had something like
that in 2.4, but i'd like to avoid such complexity as much as possible.

the fundamental problem is that the current rq->frozen code is broken,
unfortunately. It simply does not work, and in fact it never worked since
it was introduced in 2.5.7. Nothing synchronizes on rq->frozen, only the
rq-local CPU locks/unlocks it, so it has *zero* functional effect, it's an
expensive no-op.

I've removed it from the scheduler in the attached patch (against 2.5.21),
which in turn should also fix the race(s) Mike noticed.

the rq->frozen code was added for the sake of the Sparc port, which has a
legitim reason: it wants to do some more complex things (like calling into
the scheduler ...) in context_switch(). But i really dislike the effects
of this: it makes the context switch non-atomic, which triggers a set of
races and complexities.

David, would it be possible to somehow not recurse back into the scheduler
code (like wakeup) from within the port-specific switch_to() code? If
something like that absolutely has to be done then a better solution would
be to eg. introduce an arch-optional post-context-switch callback of
sorts, which would happen with the runqueue lock dropped. The overhead of
this solution can be controlled and it has no impact on the design of the
scheduler. No coding is needed on your side, i'll code it all up if it
solves your problems.

Ingo

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.493 -> 1.494
# kernel/sched.c 1.83 -> 1.84
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/06/10 [email protected] 1.494
# - get rid of rq->frozen, fix context switch races.
# --------------------------------------------
#
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Mon Jun 10 22:01:41 2002
+++ b/kernel/sched.c Mon Jun 10 22:01:41 2002
@@ -135,7 +135,6 @@
*/
struct runqueue {
spinlock_t lock;
- spinlock_t frozen;
unsigned long nr_running, nr_switches, expired_timestamp;
signed long nr_uninterruptible;
task_t *curr, *idle;
@@ -403,7 +402,7 @@
#if CONFIG_SMP || CONFIG_PREEMPT
asmlinkage void schedule_tail(void)
{
- spin_unlock_irq(&this_rq()->frozen);
+ spin_unlock_irq(&this_rq()->lock);
}
#endif

@@ -828,9 +827,6 @@
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
- spin_lock(&rq->frozen);
- spin_unlock(&rq->lock);
-
context_switch(prev, next);

/*
@@ -840,10 +836,8 @@
*/
mb();
rq = this_rq();
- spin_unlock_irq(&rq->frozen);
- } else {
- spin_unlock_irq(&rq->lock);
}
+ spin_unlock_irq(&rq->lock);

reacquire_kernel_lock(current);
preempt_enable_no_resched();
@@ -1599,7 +1593,6 @@
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- spin_lock_init(&rq->frozen);
INIT_LIST_HEAD(&rq->migration_queue);

for (j = 0; j < 2; j++) {

2002-06-10 20:15:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)


On Fri, 7 Jun 2002, Mike Kravetz wrote:

> > > Consider the case where a task is to give up the CPU
> > > and schedule() is called. In such a case the current
> > > task is removed from the runqueue (via deactivate_task).
> > > Now, further assume that there are no runnable tasks
> > > on the runqueue and we end up calling load_balance().
> > > In load_balance, we potentially drop the runqueue lock
> > > to obtain multiple runqueue locks in the proper order.
> > > Now, when we drop the runqueue lock we will still
> > > be running in the context of task p. However, p does
> > > not reside on a runqueue. It is now possible for
> > > set_cpus_allowed() to be called for p. We can get the
> > > runqueue lock and take the fast path to simply update
> > > the task's cpu field. If this happens, bad (very bad)
> > > things will happen!
> >
> > Ugh I think you are right. This is an incredibly small race, though!
>
> I agree that the race is small. I 'found' this while playing with some
> experimental code that does the same thing as the fast path in
> set_cpus_allowed: it sets the cpu field while holding the rq lock if the
> task is not on the rq. This code runs as at higher frequency (than
> would be expected of set_cpus_allowed) and found this hole.

i agree that this is a subtle issue. My first version of the migration
code did something like this - it's a natural desire to manually migrate
the target task (there are various levels of doing this - the very first
version of the code directly unlinked the thread from the current runqueue
and linked it into the target runqueue), instead of having to switch to a
migration-thread.

the fundamental reason for this fragility is the following: the room to
move the concept of migration into the O(1) design is very very small, if
the condition is to not increase the cost of the scheduler fastpath.

the only robust way i found was to use a highprio helper thread which
naturally unschedules the current task. Attempts to somehow unschedule a
to-be-migrated task without having to switch into the helper thread turned
out to be problematic.

Ingo

2002-06-10 20:58:06

by Mike Kravetz

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)

On Mon, Jun 10, 2002 at 10:12:46PM +0200, Ingo Molnar wrote:
>
> i agree that this is a subtle issue. My first version of the migration
> code did something like this - it's a natural desire to manually migrate
> the target task (there are various levels of doing this - the very first
> version of the code directly unlinked the thread from the current runqueue
> and linked it into the target runqueue), instead of having to switch to a
> migration-thread.
>
> the fundamental reason for this fragility is the following: the room to
> move the concept of migration into the O(1) design is very very small, if
> the condition is to not increase the cost of the scheduler fastpath.
>
> the only robust way i found was to use a highprio helper thread which
> naturally unschedules the current task. Attempts to somehow unschedule a
> to-be-migrated task without having to switch into the helper thread turned
> out to be problematic.
>

Ingo, I saw your patch to remove the frozen lock from the scheduler
and agree this is the best way to go. Once this change is made, I
think it is then safe to add a fast path for migration
(to set_cpus_allowed) as:

/*
* If the task is not on a runqueue (and not running), then
* it is sufficient to simply update the task's cpu field.
*/
if (!p->array && (p != rq->curr)) {
p->thread_info->cpu = __ffs(p->cpus_allowed);
task_rq_unlock(rq, &flags);
goto out;
}

Would you agree that this is now safe? My concern is not so
much with the performance of set_cpus_allowed, but rather in
using the same concept to 'move' tasks in this state.

Consider the '__wake_up_sync' functionality that existed in the
old scheduler for pipes. One result of __wake_up_sync is that
the reader and writer of the pipe were scheduled on the same
CPU. This seemed to help with pipe bandwidth. Perhaps we could
add code something like the above to wakeup a task on a specific
CPU. This could be used in VERY VERY specific cases (such as
blocking reader/writer on pipe) to increase performance.

--
Mike

2002-06-10 22:37:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)


On Mon, 10 Jun 2002, Mike Kravetz wrote:

> Ingo, I saw your patch to remove the frozen lock from the scheduler and
> agree this is the best way to go. Once this change is made, I think it
> is then safe to add a fast path for migration (to set_cpus_allowed) as:
>
> /*
> * If the task is not on a runqueue (and not running), then
> * it is sufficient to simply update the task's cpu field.
> */
> if (!p->array && (p != rq->curr)) {
> p->thread_info->cpu = __ffs(p->cpus_allowed);
> task_rq_unlock(rq, &flags);
> goto out;
> }

yes, if the task is absolutely not running and not even in any runqueue
then indeed we can do this - the scheduler doesnt even know that this task
exists.

The test for 'is this task running or runnable' is interesting. The
!p->array condition covers 99.9% of the cases. But the (p != rq->curr)
test is needed to shield against the following rare path: freshly
deactivated tasks that call into load_balance() might unlock the current
runqueue to aquire runqueue locks in order. We could get rid of the (p !=
rq->curr) test if we deactivated tasks *after* calling load_balance(), but
this is problematic because load_balance() needs to know whether the
current task is going to stay runnable or not.

> Would you agree that this is now safe? My concern is not so much with
> the performance of set_cpus_allowed, but rather in using the same
> concept to 'move' tasks in this state.

this code should be safe, we safely lock the 'potential' runqueue of the
task first, via task_rq_lock(), then we set p->thread_info->cpu to a
different CPU. The only requirement i can see is that this setting of
p->thread_info->cpu must be SMP-atomic - which it currently is.

> Consider the '__wake_up_sync' functionality that existed in the old
> scheduler for pipes. One result of __wake_up_sync is that the reader
> and writer of the pipe were scheduled on the same CPU. This seemed to
> help with pipe bandwidth. Perhaps we could add code something like the
> above to wakeup a task on a specific CPU. This could be used in VERY
> VERY specific cases (such as blocking reader/writer on pipe) to increase
> performance.

agreed. I removed the _sync code mainly because there was no
idle_resched() to migrate a task actively, and the migration bits i tried
were incomplete. But with your above conditional it should cover all the
practical cases we care about, in an elegant way.

i ported your sync wakeup resurrection patch to 2.5.21 (attached). I did
some modifications:

- wake_up() needs to check (rq->curr != p) as well, not only !p->array.

- make __wake_up_sync dependent on CONFIG_SMP

- export __wake_up_sync().

(the attached patch includes both the ->frozen change plus the sync wakeup
resurrection, it's against vanilla 2.5.21.)

appears to work for me just fine (compiles, boots and works under SMP & UP
alike), and does the trick for bw_pipe and lat_pipe. Comments?

Ingo

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.493 ->
# fs/pipe.c 1.12 -> 1.13
# kernel/ksyms.c 1.96 -> 1.97
# include/linux/sched.h 1.65 -> 1.66
# kernel/sched.c 1.83 -> 1.85
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/06/10 [email protected] 1.494
# - get rid of rq->frozen, fix context switch races.
# --------------------------------------------
# 02/06/11 [email protected] 1.495
# - put the sync wakeup feature back in, based on Mike Kravetz's patch.
# --------------------------------------------
#
diff -Nru a/fs/pipe.c b/fs/pipe.c
--- a/fs/pipe.c Tue Jun 11 00:34:25 2002
+++ b/fs/pipe.c Tue Jun 11 00:34:25 2002
@@ -119,7 +119,7 @@
* writers synchronously that there is more
* room.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
if (!PIPE_EMPTY(*inode))
BUG();
@@ -219,7 +219,7 @@
* is going to give up this CPU, so it doesnt have
* to do idle reschedules.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
PIPE_WAITING_WRITERS(*inode)++;
pipe_wait(inode);
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Tue Jun 11 00:34:25 2002
+++ b/include/linux/sched.h Tue Jun 11 00:34:25 2002
@@ -491,6 +491,7 @@
extern unsigned long prof_shift;

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
signed long timeout));
@@ -507,6 +508,11 @@
#define wake_up_interruptible(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_nr(x, nr) __wake_up((x),TASK_INTERRUPTIBLE, nr)
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
+#ifdef CONFIG_SMP
+#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
+#else
+#define wake_up_interruptible_sync(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
+#endif
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c Tue Jun 11 00:34:25 2002
+++ b/kernel/ksyms.c Tue Jun 11 00:34:25 2002
@@ -457,6 +457,9 @@
/* process management */
EXPORT_SYMBOL(complete_and_exit);
EXPORT_SYMBOL(__wake_up);
+#if CONFIG_SMP
+EXPORT_SYMBOL_GPL(__wake_up_sync); /* internal use only */
+#endif
EXPORT_SYMBOL(wake_up_process);
EXPORT_SYMBOL(sleep_on);
EXPORT_SYMBOL(sleep_on_timeout);
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Tue Jun 11 00:34:25 2002
+++ b/kernel/sched.c Tue Jun 11 00:34:25 2002
@@ -135,7 +135,6 @@
*/
struct runqueue {
spinlock_t lock;
- spinlock_t frozen;
unsigned long nr_running, nr_switches, expired_timestamp;
signed long nr_uninterruptible;
task_t *curr, *idle;
@@ -322,31 +321,43 @@
* "current->state = TASK_RUNNING" to mark yourself runnable
* without the overhead of this.
*/
-static int try_to_wake_up(task_t * p)
+static int try_to_wake_up(task_t * p, int sync)
{
unsigned long flags;
int success = 0;
long old_state;
runqueue_t *rq;

+repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
- p->state = TASK_RUNNING;
if (!p->array) {
+ if (unlikely(sync && (rq->curr != p))) {
+ if (p->thread_info->cpu != smp_processor_id()) {
+ p->thread_info->cpu = smp_processor_id();
+ task_rq_unlock(rq, &flags);
+ goto repeat_lock_task;
+ }
+ }
if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
activate_task(p, rq);
+ /*
+ * If sync is set, a resched_task() is a NOOP
+ */
if (p->prio < rq->curr->prio)
resched_task(rq->curr);
success = 1;
}
+ p->state = TASK_RUNNING;
task_rq_unlock(rq, &flags);
+
return success;
}

int wake_up_process(task_t * p)
{
- return try_to_wake_up(p);
+ return try_to_wake_up(p, 0);
}

void wake_up_forked_process(task_t * p)
@@ -403,7 +414,7 @@
#if CONFIG_SMP || CONFIG_PREEMPT
asmlinkage void schedule_tail(void)
{
- spin_unlock_irq(&this_rq()->frozen);
+ spin_unlock_irq(&this_rq()->lock);
}
#endif

@@ -828,9 +839,6 @@
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
- spin_lock(&rq->frozen);
- spin_unlock(&rq->lock);
-
context_switch(prev, next);

/*
@@ -840,10 +848,8 @@
*/
mb();
rq = this_rq();
- spin_unlock_irq(&rq->frozen);
- } else {
- spin_unlock_irq(&rq->lock);
}
+ spin_unlock_irq(&rq->lock);

reacquire_kernel_lock(current);
preempt_enable_no_resched();
@@ -880,7 +886,7 @@
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
-static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync)
{
struct list_head *tmp;
unsigned int state;
@@ -891,7 +897,7 @@
curr = list_entry(tmp, wait_queue_t, task_list);
p = curr->task;
state = p->state;
- if ((state & mode) && try_to_wake_up(p) &&
+ if ((state & mode) && try_to_wake_up(p, sync) &&
((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
break;
}
@@ -905,17 +911,36 @@
return;

spin_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr_exclusive);
+ __wake_up_common(q, mode, nr_exclusive, 0);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+
+#if CONFIG_SMP
+
+void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+{
+ unsigned long flags;
+
+ if (unlikely(!q))
+ return;
+
+ spin_lock_irqsave(&q->lock, flags);
+ if (likely(nr_exclusive))
+ __wake_up_common(q, mode, nr_exclusive, 1);
+ else
+ __wake_up_common(q, mode, nr_exclusive, 0);
spin_unlock_irqrestore(&q->lock, flags);
}

+#endif
+
void complete(struct completion *x)
{
unsigned long flags;

spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1);
+ __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, 0);
spin_unlock_irqrestore(&x->wait.lock, flags);
}

@@ -1599,7 +1624,6 @@
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- spin_lock_init(&rq->frozen);
INIT_LIST_HEAD(&rq->migration_queue);

for (j = 0; j < 2; j++) {

2002-06-10 23:16:39

by Mike Kravetz

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)

On Tue, Jun 11, 2002 at 12:35:38AM +0200, Ingo Molnar wrote:
>
> agreed. I removed the _sync code mainly because there was no
> idle_resched() to migrate a task actively, and the migration bits i tried
> were incomplete. But with your above conditional it should cover all the
> practical cases we care about, in an elegant way.
>
> i ported your sync wakeup resurrection patch to 2.5.21 (attached). I did
> some modifications:
>
> - wake_up() needs to check (rq->curr != p) as well, not only !p->array.
>
> - make __wake_up_sync dependent on CONFIG_SMP
>
> - export __wake_up_sync().
>
> (the attached patch includes both the ->frozen change plus the sync wakeup
> resurrection, it's against vanilla 2.5.21.)
>
> appears to work for me just fine (compiles, boots and works under SMP & UP
> alike), and does the trick for bw_pipe and lat_pipe. Comments?
>
> Ingo

Great! Thanks!

You might also consider adding the optimization/fast path to
set_cpus_allowed(). Once again, I don't expect this routine
(or this code path) to be used much, but I just hate to see
us scheudle a migration task to set the cpu field when it is
safe to do it within the routine.

--
Mike

2002-06-10 23:26:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)


On Mon, 10 Jun 2002, Mike Kravetz wrote:

> You might also consider adding the optimization/fast path to
> set_cpus_allowed(). Once again, I don't expect this routine (or this
> code path) to be used much, but I just hate to see us scheudle a
> migration task to set the cpu field when it is safe to do it within the
> routine.

sure, agreed. I've added it to my tree.

Ingo

2002-06-10 23:28:17

by Robert Love

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)

On Mon, 2002-06-10 at 16:24, Ingo Molnar wrote:

> sure, agreed. I've added it to my tree.

What do you think of this?

No more explicit preempt disables...

Robert Love

diff -urN linux-2.5.21/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.21/kernel/sched.c Sat Jun 8 22:28:13 2002
+++ linux/kernel/sched.c Sun Jun 9 13:01:32 2002
@@ -153,17 +153,22 @@
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)

+/*
+ * task_rq_lock - lock the runqueue a given task resides on and disable
+ * interrupts. Note the ordering: we can safely lookup the task_rq without
+ * explicitly disabling preemption.
+ */
static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
{
struct runqueue *rq;

repeat_lock_task:
- preempt_disable();
+ local_irq_save(*flags);
rq = task_rq(p);
- spin_lock_irqsave(&rq->lock, *flags);
+ spin_lock(&rq->lock);
if (unlikely(rq != task_rq(p))) {
- spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
+ spin_unlock(&rq->lock);
+ local_irq_restore(*flags);
goto repeat_lock_task;
}
return rq;
@@ -171,8 +176,25 @@

static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
{
- spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
+ spin_unlock(&rq->lock);
+ local_irq_restore(*flags);
+}
+
+/*
+ * rq_lock - lock a given runqueue and disable interrupts.
+ */
+static inline runqueue_t *rq_lock(runqueue_t *rq)
+{
+ local_irq_disable();
+ rq = this_rq();
+ spin_lock(&rq->lock);
+ return rq;
+}
+
+static inline void rq_unlock(runqueue_t *rq)
+{
+ spin_unlock(&rq->lock);
+ local_irq_enable();
}

/*
@@ -353,9 +375,7 @@
{
runqueue_t *rq;

- preempt_disable();
- rq = this_rq();
- spin_lock_irq(&rq->lock);
+ rq = rq_lock(rq);

p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -371,8 +391,7 @@
p->thread_info->cpu = smp_processor_id();
activate_task(p, rq);

- spin_unlock_irq(&rq->lock);
- preempt_enable();
+ rq_unlock(rq);
}

/*
@@ -1342,8 +1361,7 @@
runqueue_t *rq;
prio_array_t *array;

- preempt_disable();
- rq = this_rq();
+ rq = rq_lock(rq);

/*
* Decrease the yielding task's priority by one, to avoid
@@ -1353,7 +1371,6 @@
* If priority is already MAX_PRIO-1 then we still
* roundrobin the task within the runlist.
*/
- spin_lock_irq(&rq->lock);
array = current->array;
/*
* If the task has reached maximum priority (or is a RT task)
@@ -1371,7 +1388,6 @@
__set_bit(current->prio, array->bitmap);
}
spin_unlock(&rq->lock);
- preempt_enable_no_resched();

schedule();



2002-06-11 00:07:42

by Ingo Molnar

[permalink] [raw]
Subject: [patch] current scheduler bits, 2.5.21


On 10 Jun 2002, Robert Love wrote:

> > sure, agreed. I've added it to my tree.
>
> What do you think of this?
>
> No more explicit preempt disables...

Thanks. I've applied your patch with the following additional
improvements:

- The spin_unlock_irqrestore path does not have to be split up like the
spin_lock path: spin_unlock() + local_irq_restore() ==
spin_unlock_irqrestore() ... this is true both in task_rq_unlock() and
rq_unlock() code.

- in sys_sched_yield() you removed an optimization: the final spin_unlock
does not have to check for resched explicitly, we'll call into
schedule() anyway. I've introduced a new spin_unlock variant:
spin_unlock_no_resched(), which uses preempt_enable_no_resched().

otherwise it's looking good. The attached patch is my current tree which
includes the rq-lock/unlock optimization plus the previous patches
(race-fix and sync-wakeup), against 2.5.21-vanilla.

Ingo

diff -Nru a/fs/pipe.c b/fs/pipe.c
--- a/fs/pipe.c Tue Jun 11 02:02:56 2002
+++ b/fs/pipe.c Tue Jun 11 02:02:56 2002
@@ -119,7 +119,7 @@
* writers synchronously that there is more
* room.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
if (!PIPE_EMPTY(*inode))
BUG();
@@ -219,7 +219,7 @@
* is going to give up this CPU, so it doesnt have
* to do idle reschedules.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
PIPE_WAITING_WRITERS(*inode)++;
pipe_wait(inode);
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Tue Jun 11 02:02:56 2002
+++ b/include/linux/sched.h Tue Jun 11 02:02:56 2002
@@ -491,6 +491,7 @@
extern unsigned long prof_shift;

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
signed long timeout));
@@ -507,6 +508,11 @@
#define wake_up_interruptible(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_nr(x, nr) __wake_up((x),TASK_INTERRUPTIBLE, nr)
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
+#ifdef CONFIG_SMP
+#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
+#else
+#define wake_up_interruptible_sync(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
+#endif
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
diff -Nru a/include/linux/spinlock.h b/include/linux/spinlock.h
--- a/include/linux/spinlock.h Tue Jun 11 02:02:56 2002
+++ b/include/linux/spinlock.h Tue Jun 11 02:02:56 2002
@@ -157,6 +157,12 @@
preempt_enable(); \
} while (0)

+#define spin_unlock_no_resched(lock) \
+do { \
+ _raw_spin_unlock(lock); \
+ preempt_enable_no_resched(); \
+} while (0)
+
#define read_lock(lock) ({preempt_disable(); _raw_read_lock(lock);})
#define read_unlock(lock) ({_raw_read_unlock(lock); preempt_enable();})
#define write_lock(lock) ({preempt_disable(); _raw_write_lock(lock);})
@@ -166,20 +172,21 @@

#else

-#define preempt_get_count() (0)
-#define preempt_disable() do { } while (0)
+#define preempt_get_count() (0)
+#define preempt_disable() do { } while (0)
#define preempt_enable_no_resched() do {} while(0)
-#define preempt_enable() do { } while (0)
+#define preempt_enable() do { } while (0)

-#define spin_lock(lock) _raw_spin_lock(lock)
-#define spin_trylock(lock) _raw_spin_trylock(lock)
-#define spin_unlock(lock) _raw_spin_unlock(lock)
+#define spin_lock(lock) _raw_spin_lock(lock)
+#define spin_trylock(lock) _raw_spin_trylock(lock)
+#define spin_unlock(lock) _raw_spin_unlock(lock)
+#define spin_unlock_no_resched(lock) _raw_spin_unlock(lock)

-#define read_lock(lock) _raw_read_lock(lock)
-#define read_unlock(lock) _raw_read_unlock(lock)
-#define write_lock(lock) _raw_write_lock(lock)
-#define write_unlock(lock) _raw_write_unlock(lock)
-#define write_trylock(lock) _raw_write_trylock(lock)
+#define read_lock(lock) _raw_read_lock(lock)
+#define read_unlock(lock) _raw_read_unlock(lock)
+#define write_lock(lock) _raw_write_lock(lock)
+#define write_unlock(lock) _raw_write_unlock(lock)
+#define write_trylock(lock) _raw_write_trylock(lock)
#endif

/* "lock on reference count zero" */
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c Tue Jun 11 02:02:56 2002
+++ b/kernel/ksyms.c Tue Jun 11 02:02:56 2002
@@ -457,6 +457,9 @@
/* process management */
EXPORT_SYMBOL(complete_and_exit);
EXPORT_SYMBOL(__wake_up);
+#if CONFIG_SMP
+EXPORT_SYMBOL_GPL(__wake_up_sync); /* internal use only */
+#endif
EXPORT_SYMBOL(wake_up_process);
EXPORT_SYMBOL(sleep_on);
EXPORT_SYMBOL(sleep_on_timeout);
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Tue Jun 11 02:02:56 2002
+++ b/kernel/sched.c Tue Jun 11 02:02:56 2002
@@ -135,7 +135,6 @@
*/
struct runqueue {
spinlock_t lock;
- spinlock_t frozen;
unsigned long nr_running, nr_switches, expired_timestamp;
signed long nr_uninterruptible;
task_t *curr, *idle;
@@ -153,17 +152,21 @@
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)

+/*
+ * task_rq_lock - lock the runqueue a given task resides on and disable
+ * interrupts. Note the ordering: we can safely lookup the task_rq without
+ * explicitly disabling preemption.
+ */
static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
{
struct runqueue *rq;

repeat_lock_task:
- preempt_disable();
+ local_irq_save(*flags);
rq = task_rq(p);
- spin_lock_irqsave(&rq->lock, *flags);
+ spin_lock(&rq->lock);
if (unlikely(rq != task_rq(p))) {
spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
goto repeat_lock_task;
}
return rq;
@@ -172,7 +175,23 @@
static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
{
spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
+}
+
+/*
+ * rq_lock - lock a given runqueue and disable interrupts.
+ */
+static inline runqueue_t *rq_lock(runqueue_t *rq)
+{
+ local_irq_disable();
+ rq = this_rq();
+ spin_lock(&rq->lock);
+ return rq;
+}
+
+static inline void rq_unlock(runqueue_t *rq)
+{
+ spin_unlock(&rq->lock);
+ local_irq_enable();
}

/*
@@ -322,40 +341,50 @@
* "current->state = TASK_RUNNING" to mark yourself runnable
* without the overhead of this.
*/
-static int try_to_wake_up(task_t * p)
+static int try_to_wake_up(task_t * p, int sync)
{
unsigned long flags;
int success = 0;
long old_state;
runqueue_t *rq;

+repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
- p->state = TASK_RUNNING;
if (!p->array) {
+ if (unlikely(sync && (rq->curr != p))) {
+ if (p->thread_info->cpu != smp_processor_id()) {
+ p->thread_info->cpu = smp_processor_id();
+ task_rq_unlock(rq, &flags);
+ goto repeat_lock_task;
+ }
+ }
if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
activate_task(p, rq);
+ /*
+ * If sync is set, a resched_task() is a NOOP
+ */
if (p->prio < rq->curr->prio)
resched_task(rq->curr);
success = 1;
}
+ p->state = TASK_RUNNING;
task_rq_unlock(rq, &flags);
+
return success;
}

int wake_up_process(task_t * p)
{
- return try_to_wake_up(p);
+ return try_to_wake_up(p, 0);
}

void wake_up_forked_process(task_t * p)
{
runqueue_t *rq;

- preempt_disable();
- rq = this_rq();
- spin_lock_irq(&rq->lock);
+ rq = rq_lock(rq);

p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -371,8 +400,7 @@
p->thread_info->cpu = smp_processor_id();
activate_task(p, rq);

- spin_unlock_irq(&rq->lock);
- preempt_enable();
+ rq_unlock(rq);
}

/*
@@ -403,7 +431,7 @@
#if CONFIG_SMP || CONFIG_PREEMPT
asmlinkage void schedule_tail(void)
{
- spin_unlock_irq(&this_rq()->frozen);
+ spin_unlock_irq(&this_rq()->lock);
}
#endif

@@ -828,9 +856,6 @@
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
- spin_lock(&rq->frozen);
- spin_unlock(&rq->lock);
-
context_switch(prev, next);

/*
@@ -840,10 +865,8 @@
*/
mb();
rq = this_rq();
- spin_unlock_irq(&rq->frozen);
- } else {
- spin_unlock_irq(&rq->lock);
}
+ spin_unlock_irq(&rq->lock);

reacquire_kernel_lock(current);
preempt_enable_no_resched();
@@ -880,7 +903,7 @@
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
-static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync)
{
struct list_head *tmp;
unsigned int state;
@@ -891,7 +914,7 @@
curr = list_entry(tmp, wait_queue_t, task_list);
p = curr->task;
state = p->state;
- if ((state & mode) && try_to_wake_up(p) &&
+ if ((state & mode) && try_to_wake_up(p, sync) &&
((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
break;
}
@@ -905,17 +928,36 @@
return;

spin_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr_exclusive);
+ __wake_up_common(q, mode, nr_exclusive, 0);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+
+#if CONFIG_SMP
+
+void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+{
+ unsigned long flags;
+
+ if (unlikely(!q))
+ return;
+
+ spin_lock_irqsave(&q->lock, flags);
+ if (likely(nr_exclusive))
+ __wake_up_common(q, mode, nr_exclusive, 1);
+ else
+ __wake_up_common(q, mode, nr_exclusive, 0);
spin_unlock_irqrestore(&q->lock, flags);
}

+#endif
+
void complete(struct completion *x)
{
unsigned long flags;

spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1);
+ __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, 0);
spin_unlock_irqrestore(&x->wait.lock, flags);
}

@@ -1342,8 +1384,7 @@
runqueue_t *rq;
prio_array_t *array;

- preempt_disable();
- rq = this_rq();
+ rq = rq_lock(rq);

/*
* Decrease the yielding task's priority by one, to avoid
@@ -1353,7 +1394,6 @@
* If priority is already MAX_PRIO-1 then we still
* roundrobin the task within the runlist.
*/
- spin_lock_irq(&rq->lock);
array = current->array;
/*
* If the task has reached maximum priority (or is a RT task)
@@ -1370,8 +1410,7 @@
list_add_tail(&current->run_list, array->queue + current->prio);
__set_bit(current->prio, array->bitmap);
}
- spin_unlock(&rq->lock);
- preempt_enable_no_resched();
+ spin_unlock_no_resched(&rq->lock);

schedule();

@@ -1599,7 +1638,6 @@
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- spin_lock_init(&rq->frozen);
INIT_LIST_HEAD(&rq->migration_queue);

for (j = 0; j < 2; j++) {

2002-06-11 00:27:38

by Robert Love

[permalink] [raw]
Subject: Re: [patch] current scheduler bits, 2.5.21

On Mon, 2002-06-10 at 17:05, Ingo Molnar wrote:

> Thanks. I've applied your patch with the following additional
> improvements:
>
> - The spin_unlock_irqrestore path does not have to be split up like the
> spin_lock path: spin_unlock() + local_irq_restore() ==
> spin_unlock_irqrestore() ... this is true both in task_rq_unlock() and
> rq_unlock() code.

I know. The reason I split them up is to maintain consistency through
the way we lock vs unlock and enable vs disable interrupts. Partly for
style, partly in case we ever decide to hook the different calls in a
different manner.

I do not see this in your patch, though ...

> - in sys_sched_yield() you removed an optimization: the final spin_unlock
> does not have to check for resched explicitly, we'll call into
> schedule() anyway. I've introduced a new spin_unlock variant:
> spin_unlock_no_resched(), which uses preempt_enable_no_resched().

Ah yes, very good. I was too busy noticing the optimization I _did_ put
in: not calling rq_unlock here as we can just leave interrupts disabled
on return...

Very good.

> otherwise it's looking good. The attached patch is my current tree which
> includes the rq-lock/unlock optimization plus the previous patches
> (race-fix and sync-wakeup), against 2.5.21-vanilla.

Excellent.

Robert Love

2002-06-11 17:37:53

by Ingo Molnar

[permalink] [raw]
Subject: [patch] current scheduler bits #2, 2.5.21


On 10 Jun 2002, Robert Love wrote:

> > - The spin_unlock_irqrestore path does not have to be split up like the
> > spin_lock path: spin_unlock() + local_irq_restore() ==
> > spin_unlock_irqrestore() ... this is true both in task_rq_unlock() and
> > rq_unlock() code.
>
> I know. The reason I split them up is to maintain consistency through
> the way we lock vs unlock and enable vs disable interrupts. Partly for
> style, partly in case we ever decide to hook the different calls in a
> different manner.

well, i'm simply using a shorter code form, it's equivalent.

anyway, my current tree has a new type of optimization which again changes
the way task_rq_lock() and task_rq_unlock() looks like: in this specific
case we can rely on __cli() disabling preemption, we do not need to
elevate the preemption count in this case. (Some special care has to be
taken to not cause a preemption with the raw rq spinlock held, which the
patch does.)

and i found and fixed a preemption latency source in wait_task_inactive().
That function can create *very* bad latencies if the target task happens
to not go away from its CPU for many milliseconds (for whatever reason).
So we should and can check the resched bit.

the full patch is attached, against vanilla 2.5.21.

Ingo

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.453 ->
# fs/pipe.c 1.12 -> 1.13
# kernel/ksyms.c 1.96 -> 1.97
# include/linux/sched.h 1.65 -> 1.66
# kernel/sched.c 1.83 -> 1.89
# include/linux/spinlock.h 1.11 -> 1.13
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/06/10 [email protected] 1.447.16.1
# - get rid of rq->frozen, fix context switch races.
# --------------------------------------------
# 02/06/11 [email protected] 1.447.16.2
# - put the sync wakeup feature back in, based on Mike Kravetz's patch.
# --------------------------------------------
# 02/06/11 [email protected] 1.447.16.3
# - rq-lock optimization in the preemption case, from Robert Love, plus some more cleanups.
# --------------------------------------------
# 02/06/11 [email protected] 1.455
# - set_cpus_allowed() optimization from Mike Kravetz: we can
# set p->thread_info->cpu directly if the task is not running
# and is not on any runqueue.
# --------------------------------------------
# 02/06/11 [email protected] 1.456
# - wait_task_inactive() preemption-latency optimization: we should
# enable/disable preemption to not spend too much time with
# preemption disabled. wait_task_inactive() can take quite some
# time occasionally ...
# --------------------------------------------
# 02/06/11 [email protected] 1.457
# - squeeze a few more cycles out of the wakeup hotpath.
# --------------------------------------------
#
diff -Nru a/fs/pipe.c b/fs/pipe.c
--- a/fs/pipe.c Tue Jun 11 19:27:46 2002
+++ b/fs/pipe.c Tue Jun 11 19:27:46 2002
@@ -119,7 +119,7 @@
* writers synchronously that there is more
* room.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
if (!PIPE_EMPTY(*inode))
BUG();
@@ -219,7 +219,7 @@
* is going to give up this CPU, so it doesnt have
* to do idle reschedules.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
PIPE_WAITING_WRITERS(*inode)++;
pipe_wait(inode);
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Tue Jun 11 19:27:46 2002
+++ b/include/linux/sched.h Tue Jun 11 19:27:46 2002
@@ -491,6 +491,7 @@
extern unsigned long prof_shift;

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
signed long timeout));
@@ -507,6 +508,11 @@
#define wake_up_interruptible(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_nr(x, nr) __wake_up((x),TASK_INTERRUPTIBLE, nr)
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
+#ifdef CONFIG_SMP
+#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
+#else
+#define wake_up_interruptible_sync(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
+#endif
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
diff -Nru a/include/linux/spinlock.h b/include/linux/spinlock.h
--- a/include/linux/spinlock.h Tue Jun 11 19:27:46 2002
+++ b/include/linux/spinlock.h Tue Jun 11 19:27:46 2002
@@ -26,6 +26,7 @@
#define write_lock_bh(lock) do { local_bh_disable(); write_lock(lock); } while (0)

#define spin_unlock_irqrestore(lock, flags) do { spin_unlock(lock); local_irq_restore(flags); } while (0)
+#define _raw_spin_unlock_irqrestore(lock, flags) do { _raw_spin_unlock(lock); local_irq_restore(flags); } while (0)
#define spin_unlock_irq(lock) do { spin_unlock(lock); local_irq_enable(); } while (0)
#define spin_unlock_bh(lock) do { spin_unlock(lock); local_bh_enable(); } while (0)

@@ -143,6 +144,12 @@
preempt_schedule(); \
} while (0)

+#define preempt_check_resched() \
+do { \
+ if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
+ preempt_schedule(); \
+} while (0)
+
#define spin_lock(lock) \
do { \
preempt_disable(); \
@@ -157,6 +164,12 @@
preempt_enable(); \
} while (0)

+#define spin_unlock_no_resched(lock) \
+do { \
+ _raw_spin_unlock(lock); \
+ preempt_enable_no_resched(); \
+} while (0)
+
#define read_lock(lock) ({preempt_disable(); _raw_read_lock(lock);})
#define read_unlock(lock) ({_raw_read_unlock(lock); preempt_enable();})
#define write_lock(lock) ({preempt_disable(); _raw_write_lock(lock);})
@@ -166,20 +179,22 @@

#else

-#define preempt_get_count() (0)
-#define preempt_disable() do { } while (0)
+#define preempt_get_count() (0)
+#define preempt_disable() do { } while (0)
#define preempt_enable_no_resched() do {} while(0)
-#define preempt_enable() do { } while (0)
+#define preempt_enable() do { } while (0)
+#define preempt_check_resched() do { } while (0)

-#define spin_lock(lock) _raw_spin_lock(lock)
-#define spin_trylock(lock) _raw_spin_trylock(lock)
-#define spin_unlock(lock) _raw_spin_unlock(lock)
-
-#define read_lock(lock) _raw_read_lock(lock)
-#define read_unlock(lock) _raw_read_unlock(lock)
-#define write_lock(lock) _raw_write_lock(lock)
-#define write_unlock(lock) _raw_write_unlock(lock)
-#define write_trylock(lock) _raw_write_trylock(lock)
+#define spin_lock(lock) _raw_spin_lock(lock)
+#define spin_trylock(lock) _raw_spin_trylock(lock)
+#define spin_unlock(lock) _raw_spin_unlock(lock)
+#define spin_unlock_no_resched(lock) _raw_spin_unlock(lock)
+
+#define read_lock(lock) _raw_read_lock(lock)
+#define read_unlock(lock) _raw_read_unlock(lock)
+#define write_lock(lock) _raw_write_lock(lock)
+#define write_unlock(lock) _raw_write_unlock(lock)
+#define write_trylock(lock) _raw_write_trylock(lock)
#endif

/* "lock on reference count zero" */
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c Tue Jun 11 19:27:46 2002
+++ b/kernel/ksyms.c Tue Jun 11 19:27:46 2002
@@ -457,6 +457,9 @@
/* process management */
EXPORT_SYMBOL(complete_and_exit);
EXPORT_SYMBOL(__wake_up);
+#if CONFIG_SMP
+EXPORT_SYMBOL_GPL(__wake_up_sync); /* internal use only */
+#endif
EXPORT_SYMBOL(wake_up_process);
EXPORT_SYMBOL(sleep_on);
EXPORT_SYMBOL(sleep_on_timeout);
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Tue Jun 11 19:27:46 2002
+++ b/kernel/sched.c Tue Jun 11 19:27:46 2002
@@ -135,7 +135,6 @@
*/
struct runqueue {
spinlock_t lock;
- spinlock_t frozen;
unsigned long nr_running, nr_switches, expired_timestamp;
signed long nr_uninterruptible;
task_t *curr, *idle;
@@ -153,17 +152,27 @@
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)

+/*
+ * task_rq_lock - lock the runqueue a given task resides on and disable
+ * interrupts. Note the ordering: we can safely lookup the task_rq without
+ * explicitly disabling preemption.
+ *
+ * WARNING: to squeeze out a few more cycles we do not disable preemption
+ * explicitly (or implicitly), we just keep interrupts disabled. This means
+ * that within task_rq_lock/unlock sections you must be careful
+ * about locking/unlocking spinlocks, since they could cause an unexpected
+ * preemption.
+ */
static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
{
struct runqueue *rq;

repeat_lock_task:
- preempt_disable();
+ local_irq_save(*flags);
rq = task_rq(p);
- spin_lock_irqsave(&rq->lock, *flags);
+ _raw_spin_lock(&rq->lock);
if (unlikely(rq != task_rq(p))) {
- spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
+ _raw_spin_unlock_irqrestore(&rq->lock, *flags);
goto repeat_lock_task;
}
return rq;
@@ -171,8 +180,24 @@

static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
{
- spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
+ _raw_spin_unlock_irqrestore(&rq->lock, *flags);
+}
+
+/*
+ * rq_lock - lock a given runqueue and disable interrupts.
+ */
+static inline runqueue_t *rq_lock(runqueue_t *rq)
+{
+ local_irq_disable();
+ rq = this_rq();
+ spin_lock(&rq->lock);
+ return rq;
+}
+
+static inline void rq_unlock(runqueue_t *rq)
+{
+ spin_unlock(&rq->lock);
+ local_irq_enable();
}

/*
@@ -263,8 +288,15 @@
nrpolling |= test_tsk_thread_flag(p,TIF_POLLING_NRFLAG);

if (!need_resched && !nrpolling && (p->thread_info->cpu != smp_processor_id()))
+ /*
+ * NOTE: smp_send_reschedule() can be called from
+ * spinlocked sections which do not have an elevated
+ * preemption count. So the code either has to avoid
+ * spinlocks, or has to put preempt_disable() and
+ * preempt_enable_no_resched() around the code.
+ */
smp_send_reschedule(p->thread_info->cpu);
- preempt_enable();
+ preempt_enable_no_resched();
#else
set_tsk_need_resched(p);
#endif
@@ -284,9 +316,15 @@
repeat:
preempt_disable();
rq = task_rq(p);
- while (unlikely(rq->curr == p)) {
+ if (unlikely(rq->curr == p)) {
cpu_relax();
- barrier();
+ /*
+ * enable/disable preemption just to make this
+ * a preemption point - we are busy-waiting
+ * anyway.
+ */
+ preempt_enable();
+ goto repeat;
}
rq = task_rq_lock(p, &flags);
if (unlikely(rq->curr == p)) {
@@ -309,8 +347,10 @@
*/
void kick_if_running(task_t * p)
{
- if (p == task_rq(p)->curr)
+ if (p == task_rq(p)->curr) {
resched_task(p);
+ preempt_check_resched();
+ }
}
#endif

@@ -322,40 +362,50 @@
* "current->state = TASK_RUNNING" to mark yourself runnable
* without the overhead of this.
*/
-static int try_to_wake_up(task_t * p)
+static int try_to_wake_up(task_t * p, int sync)
{
unsigned long flags;
int success = 0;
long old_state;
runqueue_t *rq;

+repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
- p->state = TASK_RUNNING;
if (!p->array) {
+ if (unlikely(sync && (rq->curr != p))) {
+ if (p->thread_info->cpu != smp_processor_id()) {
+ p->thread_info->cpu = smp_processor_id();
+ task_rq_unlock(rq, &flags);
+ goto repeat_lock_task;
+ }
+ }
if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
activate_task(p, rq);
+ /*
+ * If sync is set, a resched_task() is a NOOP
+ */
if (p->prio < rq->curr->prio)
resched_task(rq->curr);
success = 1;
}
+ p->state = TASK_RUNNING;
task_rq_unlock(rq, &flags);
+
return success;
}

int wake_up_process(task_t * p)
{
- return try_to_wake_up(p);
+ return try_to_wake_up(p, 0);
}

void wake_up_forked_process(task_t * p)
{
runqueue_t *rq;

- preempt_disable();
- rq = this_rq();
- spin_lock_irq(&rq->lock);
+ rq = rq_lock(rq);

p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -371,8 +421,7 @@
p->thread_info->cpu = smp_processor_id();
activate_task(p, rq);

- spin_unlock_irq(&rq->lock);
- preempt_enable();
+ rq_unlock(rq);
}

/*
@@ -403,7 +452,7 @@
#if CONFIG_SMP || CONFIG_PREEMPT
asmlinkage void schedule_tail(void)
{
- spin_unlock_irq(&this_rq()->frozen);
+ spin_unlock_irq(&this_rq()->lock);
}
#endif

@@ -828,9 +877,6 @@
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
- spin_lock(&rq->frozen);
- spin_unlock(&rq->lock);
-
context_switch(prev, next);

/*
@@ -840,10 +886,8 @@
*/
mb();
rq = this_rq();
- spin_unlock_irq(&rq->frozen);
- } else {
- spin_unlock_irq(&rq->lock);
}
+ spin_unlock_irq(&rq->lock);

reacquire_kernel_lock(current);
preempt_enable_no_resched();
@@ -880,7 +924,7 @@
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
-static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync)
{
struct list_head *tmp;
unsigned int state;
@@ -891,7 +935,7 @@
curr = list_entry(tmp, wait_queue_t, task_list);
p = curr->task;
state = p->state;
- if ((state & mode) && try_to_wake_up(p) &&
+ if ((state & mode) && try_to_wake_up(p, sync) &&
((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
break;
}
@@ -905,17 +949,36 @@
return;

spin_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr_exclusive);
+ __wake_up_common(q, mode, nr_exclusive, 0);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+
+#if CONFIG_SMP
+
+void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+{
+ unsigned long flags;
+
+ if (unlikely(!q))
+ return;
+
+ spin_lock_irqsave(&q->lock, flags);
+ if (likely(nr_exclusive))
+ __wake_up_common(q, mode, nr_exclusive, 1);
+ else
+ __wake_up_common(q, mode, nr_exclusive, 0);
spin_unlock_irqrestore(&q->lock, flags);
}

+#endif
+
void complete(struct completion *x)
{
unsigned long flags;

spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1);
+ __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, 0);
spin_unlock_irqrestore(&x->wait.lock, flags);
}

@@ -1342,8 +1405,7 @@
runqueue_t *rq;
prio_array_t *array;

- preempt_disable();
- rq = this_rq();
+ rq = rq_lock(rq);

/*
* Decrease the yielding task's priority by one, to avoid
@@ -1353,7 +1415,6 @@
* If priority is already MAX_PRIO-1 then we still
* roundrobin the task within the runlist.
*/
- spin_lock_irq(&rq->lock);
array = current->array;
/*
* If the task has reached maximum priority (or is a RT task)
@@ -1370,8 +1431,7 @@
list_add_tail(&current->run_list, array->queue + current->prio);
__set_bit(current->prio, array->bitmap);
}
- spin_unlock(&rq->lock);
- preempt_enable_no_resched();
+ spin_unlock_no_resched(&rq->lock);

schedule();

@@ -1599,7 +1659,6 @@
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- spin_lock_init(&rq->frozen);
INIT_LIST_HEAD(&rq->migration_queue);

for (j = 0; j < 2; j++) {
@@ -1687,7 +1746,15 @@
task_rq_unlock(rq, &flags);
goto out;
}
-
+ /*
+ * If the task is not on a runqueue (and not running), then
+ * it is sufficient to simply update the task's cpu field.
+ */
+ if (!p->array && (p != rq->curr)) {
+ p->thread_info->cpu = __ffs(p->cpus_allowed);
+ task_rq_unlock(rq, &flags);
+ goto out;
+ }
init_MUTEX_LOCKED(&req.sem);
req.task = p;
list_add(&req.list, &rq->migration_queue);

2002-06-11 18:25:18

by Robert Love

[permalink] [raw]
Subject: Re: [patch] current scheduler bits #2, 2.5.21

On Tue, 2002-06-11 at 10:35, Ingo Molnar wrote:

> anyway, my current tree has a new type of optimization which again changes
> the way task_rq_lock() and task_rq_unlock() looks like: in this specific
> case we can rely on __cli() disabling preemption, we do not need to
> elevate the preemption count in this case. (Some special care has to be
> taken to not cause a preemption with the raw rq spinlock held, which the
> patch does.)

Ow, I thought about doing this but decided against it. If we merge
this, we have to be _very_ careful. Any code that sets need_resched
explicitly will cause a preemption on a preempt_enable (implicitly off
any unlock, etc ...). Is this worth it to save an inc/dec?

Hm,

static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
{
_raw_spin_unlock_irqrestore(&rq->lock, *flags);
}

Don't we want to explicitly check for a preemption here? Using your new
preempt_check_resched() would make sense...

> and i found and fixed a preemption latency source in wait_task_inactive().
> That function can create *very* bad latencies if the target task happens
> to not go away from its CPU for many milliseconds (for whatever reason).
> So we should and can check the resched bit.

Hm, this may be the cause of the latency issues some have raised.
Dieter Nutzel, for example, has been saying sometime early in O(1)'s
life it killed latency with the preemptible kernel... this may be it.

I'll check it out. Good job.

Robert Love

2002-06-11 18:36:00

by Ingo Molnar

[permalink] [raw]
Subject: [patch] current scheduler bits #3, 2.5.21


On 11 Jun 2002, Robert Love wrote:

> static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
> {
> _raw_spin_unlock_irqrestore(&rq->lock, *flags);
> }
>
> Don't we want to explicitly check for a preemption here? Using your new
> preempt_check_resched() would make sense...

indeed, good spotting. Fixed this, new patch attached.

Ingo

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.453 ->
# fs/pipe.c 1.12 -> 1.13
# kernel/ksyms.c 1.96 -> 1.97
# include/linux/sched.h 1.65 -> 1.66
# kernel/sched.c 1.83 -> 1.90
# include/linux/spinlock.h 1.11 -> 1.13
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/06/10 [email protected] 1.447.16.1
# - get rid of rq->frozen, fix context switch races.
# --------------------------------------------
# 02/06/11 [email protected] 1.447.16.2
# - put the sync wakeup feature back in, based on Mike Kravetz's patch.
# --------------------------------------------
# 02/06/11 [email protected] 1.447.16.3
# - rq-lock optimization in the preemption case, from Robert Love, plus some more cleanups.
# --------------------------------------------
# 02/06/11 [email protected] 1.454
# Merge k:t into elte.hu:/home/mingo/BK/mine/linux-2.5-latest
# --------------------------------------------
# 02/06/11 [email protected] 1.455
# - set_cpus_allowed() optimization from Mike Kravetz: we can
# set p->thread_info->cpu directly if the task is not running
# and is not on any runqueue.
# --------------------------------------------
# 02/06/11 [email protected] 1.456
# - wait_task_inactive() preemption-latency optimization: we should
# enable/disable preemption to not spend too much time with
# preemption disabled. wait_task_inactive() can take quite some
# time occasionally ...
# --------------------------------------------
# 02/06/11 [email protected] 1.457
# - squeeze a few more cycles out of the wakeup hotpath.
# --------------------------------------------
# 02/06/11 [email protected] 1.458
# - do a manual preemption-check in task_rq_unlock().
# --------------------------------------------
#
diff -Nru a/fs/pipe.c b/fs/pipe.c
--- a/fs/pipe.c Tue Jun 11 20:32:27 2002
+++ b/fs/pipe.c Tue Jun 11 20:32:27 2002
@@ -119,7 +119,7 @@
* writers synchronously that there is more
* room.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
if (!PIPE_EMPTY(*inode))
BUG();
@@ -219,7 +219,7 @@
* is going to give up this CPU, so it doesnt have
* to do idle reschedules.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
PIPE_WAITING_WRITERS(*inode)++;
pipe_wait(inode);
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Tue Jun 11 20:32:27 2002
+++ b/include/linux/sched.h Tue Jun 11 20:32:27 2002
@@ -491,6 +491,7 @@
extern unsigned long prof_shift;

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
signed long timeout));
@@ -507,6 +508,11 @@
#define wake_up_interruptible(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_nr(x, nr) __wake_up((x),TASK_INTERRUPTIBLE, nr)
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
+#ifdef CONFIG_SMP
+#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
+#else
+#define wake_up_interruptible_sync(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
+#endif
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
diff -Nru a/include/linux/spinlock.h b/include/linux/spinlock.h
--- a/include/linux/spinlock.h Tue Jun 11 20:32:27 2002
+++ b/include/linux/spinlock.h Tue Jun 11 20:32:27 2002
@@ -26,6 +26,7 @@
#define write_lock_bh(lock) do { local_bh_disable(); write_lock(lock); } while (0)

#define spin_unlock_irqrestore(lock, flags) do { spin_unlock(lock); local_irq_restore(flags); } while (0)
+#define _raw_spin_unlock_irqrestore(lock, flags) do { _raw_spin_unlock(lock); local_irq_restore(flags); } while (0)
#define spin_unlock_irq(lock) do { spin_unlock(lock); local_irq_enable(); } while (0)
#define spin_unlock_bh(lock) do { spin_unlock(lock); local_bh_enable(); } while (0)

@@ -143,6 +144,12 @@
preempt_schedule(); \
} while (0)

+#define preempt_check_resched() \
+do { \
+ if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
+ preempt_schedule(); \
+} while (0)
+
#define spin_lock(lock) \
do { \
preempt_disable(); \
@@ -157,6 +164,12 @@
preempt_enable(); \
} while (0)

+#define spin_unlock_no_resched(lock) \
+do { \
+ _raw_spin_unlock(lock); \
+ preempt_enable_no_resched(); \
+} while (0)
+
#define read_lock(lock) ({preempt_disable(); _raw_read_lock(lock);})
#define read_unlock(lock) ({_raw_read_unlock(lock); preempt_enable();})
#define write_lock(lock) ({preempt_disable(); _raw_write_lock(lock);})
@@ -166,20 +179,22 @@

#else

-#define preempt_get_count() (0)
-#define preempt_disable() do { } while (0)
+#define preempt_get_count() (0)
+#define preempt_disable() do { } while (0)
#define preempt_enable_no_resched() do {} while(0)
-#define preempt_enable() do { } while (0)
+#define preempt_enable() do { } while (0)
+#define preempt_check_resched() do { } while (0)

-#define spin_lock(lock) _raw_spin_lock(lock)
-#define spin_trylock(lock) _raw_spin_trylock(lock)
-#define spin_unlock(lock) _raw_spin_unlock(lock)
-
-#define read_lock(lock) _raw_read_lock(lock)
-#define read_unlock(lock) _raw_read_unlock(lock)
-#define write_lock(lock) _raw_write_lock(lock)
-#define write_unlock(lock) _raw_write_unlock(lock)
-#define write_trylock(lock) _raw_write_trylock(lock)
+#define spin_lock(lock) _raw_spin_lock(lock)
+#define spin_trylock(lock) _raw_spin_trylock(lock)
+#define spin_unlock(lock) _raw_spin_unlock(lock)
+#define spin_unlock_no_resched(lock) _raw_spin_unlock(lock)
+
+#define read_lock(lock) _raw_read_lock(lock)
+#define read_unlock(lock) _raw_read_unlock(lock)
+#define write_lock(lock) _raw_write_lock(lock)
+#define write_unlock(lock) _raw_write_unlock(lock)
+#define write_trylock(lock) _raw_write_trylock(lock)
#endif

/* "lock on reference count zero" */
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c Tue Jun 11 20:32:27 2002
+++ b/kernel/ksyms.c Tue Jun 11 20:32:27 2002
@@ -457,6 +457,9 @@
/* process management */
EXPORT_SYMBOL(complete_and_exit);
EXPORT_SYMBOL(__wake_up);
+#if CONFIG_SMP
+EXPORT_SYMBOL_GPL(__wake_up_sync); /* internal use only */
+#endif
EXPORT_SYMBOL(wake_up_process);
EXPORT_SYMBOL(sleep_on);
EXPORT_SYMBOL(sleep_on_timeout);
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Tue Jun 11 20:32:27 2002
+++ b/kernel/sched.c Tue Jun 11 20:32:27 2002
@@ -135,7 +135,6 @@
*/
struct runqueue {
spinlock_t lock;
- spinlock_t frozen;
unsigned long nr_running, nr_switches, expired_timestamp;
signed long nr_uninterruptible;
task_t *curr, *idle;
@@ -153,17 +152,27 @@
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)

+/*
+ * task_rq_lock - lock the runqueue a given task resides on and disable
+ * interrupts. Note the ordering: we can safely lookup the task_rq without
+ * explicitly disabling preemption.
+ *
+ * WARNING: to squeeze out a few more cycles we do not disable preemption
+ * explicitly (or implicitly), we just keep interrupts disabled. This means
+ * that within task_rq_lock/unlock sections you must be careful
+ * about locking/unlocking spinlocks, since they could cause an unexpected
+ * preemption.
+ */
static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
{
struct runqueue *rq;

repeat_lock_task:
- preempt_disable();
+ local_irq_save(*flags);
rq = task_rq(p);
- spin_lock_irqsave(&rq->lock, *flags);
+ _raw_spin_lock(&rq->lock);
if (unlikely(rq != task_rq(p))) {
- spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
+ _raw_spin_unlock_irqrestore(&rq->lock, *flags);
goto repeat_lock_task;
}
return rq;
@@ -171,8 +180,25 @@

static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
{
- spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
+ _raw_spin_unlock_irqrestore(&rq->lock, *flags);
+ preempt_check_resched();
+}
+
+/*
+ * rq_lock - lock a given runqueue and disable interrupts.
+ */
+static inline runqueue_t *rq_lock(runqueue_t *rq)
+{
+ local_irq_disable();
+ rq = this_rq();
+ spin_lock(&rq->lock);
+ return rq;
+}
+
+static inline void rq_unlock(runqueue_t *rq)
+{
+ spin_unlock(&rq->lock);
+ local_irq_enable();
}

/*
@@ -263,8 +289,15 @@
nrpolling |= test_tsk_thread_flag(p,TIF_POLLING_NRFLAG);

if (!need_resched && !nrpolling && (p->thread_info->cpu != smp_processor_id()))
+ /*
+ * NOTE: smp_send_reschedule() can be called from
+ * spinlocked sections which do not have an elevated
+ * preemption count. So the code either has to avoid
+ * spinlocks, or has to put preempt_disable() and
+ * preempt_enable_no_resched() around the code.
+ */
smp_send_reschedule(p->thread_info->cpu);
- preempt_enable();
+ preempt_enable_no_resched();
#else
set_tsk_need_resched(p);
#endif
@@ -284,9 +317,15 @@
repeat:
preempt_disable();
rq = task_rq(p);
- while (unlikely(rq->curr == p)) {
+ if (unlikely(rq->curr == p)) {
cpu_relax();
- barrier();
+ /*
+ * enable/disable preemption just to make this
+ * a preemption point - we are busy-waiting
+ * anyway.
+ */
+ preempt_enable();
+ goto repeat;
}
rq = task_rq_lock(p, &flags);
if (unlikely(rq->curr == p)) {
@@ -309,8 +348,10 @@
*/
void kick_if_running(task_t * p)
{
- if (p == task_rq(p)->curr)
+ if (p == task_rq(p)->curr) {
resched_task(p);
+ preempt_check_resched();
+ }
}
#endif

@@ -322,40 +363,50 @@
* "current->state = TASK_RUNNING" to mark yourself runnable
* without the overhead of this.
*/
-static int try_to_wake_up(task_t * p)
+static int try_to_wake_up(task_t * p, int sync)
{
unsigned long flags;
int success = 0;
long old_state;
runqueue_t *rq;

+repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
- p->state = TASK_RUNNING;
if (!p->array) {
+ if (unlikely(sync && (rq->curr != p))) {
+ if (p->thread_info->cpu != smp_processor_id()) {
+ p->thread_info->cpu = smp_processor_id();
+ task_rq_unlock(rq, &flags);
+ goto repeat_lock_task;
+ }
+ }
if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
activate_task(p, rq);
+ /*
+ * If sync is set, a resched_task() is a NOOP
+ */
if (p->prio < rq->curr->prio)
resched_task(rq->curr);
success = 1;
}
+ p->state = TASK_RUNNING;
task_rq_unlock(rq, &flags);
+
return success;
}

int wake_up_process(task_t * p)
{
- return try_to_wake_up(p);
+ return try_to_wake_up(p, 0);
}

void wake_up_forked_process(task_t * p)
{
runqueue_t *rq;

- preempt_disable();
- rq = this_rq();
- spin_lock_irq(&rq->lock);
+ rq = rq_lock(rq);

p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -371,8 +422,7 @@
p->thread_info->cpu = smp_processor_id();
activate_task(p, rq);

- spin_unlock_irq(&rq->lock);
- preempt_enable();
+ rq_unlock(rq);
}

/*
@@ -403,7 +453,7 @@
#if CONFIG_SMP || CONFIG_PREEMPT
asmlinkage void schedule_tail(void)
{
- spin_unlock_irq(&this_rq()->frozen);
+ spin_unlock_irq(&this_rq()->lock);
}
#endif

@@ -828,9 +878,6 @@
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
- spin_lock(&rq->frozen);
- spin_unlock(&rq->lock);
-
context_switch(prev, next);

/*
@@ -840,10 +887,8 @@
*/
mb();
rq = this_rq();
- spin_unlock_irq(&rq->frozen);
- } else {
- spin_unlock_irq(&rq->lock);
}
+ spin_unlock_irq(&rq->lock);

reacquire_kernel_lock(current);
preempt_enable_no_resched();
@@ -880,7 +925,7 @@
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
-static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync)
{
struct list_head *tmp;
unsigned int state;
@@ -891,7 +936,7 @@
curr = list_entry(tmp, wait_queue_t, task_list);
p = curr->task;
state = p->state;
- if ((state & mode) && try_to_wake_up(p) &&
+ if ((state & mode) && try_to_wake_up(p, sync) &&
((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
break;
}
@@ -905,17 +950,36 @@
return;

spin_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr_exclusive);
+ __wake_up_common(q, mode, nr_exclusive, 0);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+
+#if CONFIG_SMP
+
+void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+{
+ unsigned long flags;
+
+ if (unlikely(!q))
+ return;
+
+ spin_lock_irqsave(&q->lock, flags);
+ if (likely(nr_exclusive))
+ __wake_up_common(q, mode, nr_exclusive, 1);
+ else
+ __wake_up_common(q, mode, nr_exclusive, 0);
spin_unlock_irqrestore(&q->lock, flags);
}

+#endif
+
void complete(struct completion *x)
{
unsigned long flags;

spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1);
+ __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, 0);
spin_unlock_irqrestore(&x->wait.lock, flags);
}

@@ -1342,8 +1406,7 @@
runqueue_t *rq;
prio_array_t *array;

- preempt_disable();
- rq = this_rq();
+ rq = rq_lock(rq);

/*
* Decrease the yielding task's priority by one, to avoid
@@ -1353,7 +1416,6 @@
* If priority is already MAX_PRIO-1 then we still
* roundrobin the task within the runlist.
*/
- spin_lock_irq(&rq->lock);
array = current->array;
/*
* If the task has reached maximum priority (or is a RT task)
@@ -1370,8 +1432,7 @@
list_add_tail(&current->run_list, array->queue + current->prio);
__set_bit(current->prio, array->bitmap);
}
- spin_unlock(&rq->lock);
- preempt_enable_no_resched();
+ spin_unlock_no_resched(&rq->lock);

schedule();

@@ -1599,7 +1660,6 @@
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- spin_lock_init(&rq->frozen);
INIT_LIST_HEAD(&rq->migration_queue);

for (j = 0; j < 2; j++) {
@@ -1687,7 +1747,15 @@
task_rq_unlock(rq, &flags);
goto out;
}
-
+ /*
+ * If the task is not on a runqueue (and not running), then
+ * it is sufficient to simply update the task's cpu field.
+ */
+ if (!p->array && (p != rq->curr)) {
+ p->thread_info->cpu = __ffs(p->cpus_allowed);
+ task_rq_unlock(rq, &flags);
+ goto out;
+ }
init_MUTEX_LOCKED(&req.sem);
req.task = p;
list_add(&req.list, &rq->migration_queue);

2002-06-12 00:00:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)

Hello,

On Sat, Jun 08, 2002 at 01:20:04AM +0200, J.A. Magallon wrote:
>
> On 2002.06.07 Robert Love wrote:
> >
> >Anyhow, with this issue, I guess we need to fix it... I'll send a patch
> >to Linus.
> >
>
> Plz, could you also post it in the list ? -aa will need also...

So far I merged this Ingo's fix into my tree, I will overview the other
parts soon (as worse next week).

diff -urNp frozen-ref/include/linux/sched_runqueue.h frozen/include/linux/sched_runqueue.h
--- frozen-ref/include/linux/sched_runqueue.h Tue Jun 11 03:03:35 2002
+++ frozen/include/linux/sched_runqueue.h Tue Jun 11 03:00:46 2002
@@ -57,7 +57,6 @@ struct prio_array {
*/
struct runqueue {
spinlock_t lock;
- spinlock_t frozen;
unsigned long nr_running, nr_switches, expired_timestamp;
long quiescent; /* RCU */
task_t *curr, *idle;
diff -urNp frozen-ref/kernel/sched.c frozen/kernel/sched.c
--- frozen-ref/kernel/sched.c Tue Jun 11 03:03:35 2002
+++ frozen/kernel/sched.c Tue Jun 11 03:01:29 2002
@@ -352,7 +352,7 @@ void sched_exit(task_t * p)
#if CONFIG_SMP
asmlinkage void schedule_tail(task_t *prev)
{
- spin_unlock_irq(&this_rq()->frozen);
+ spin_unlock_irq(&this_rq()->lock);
}
#endif

@@ -762,9 +762,6 @@ switch_tasks:
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
- spin_lock(&rq->frozen);
- spin_unlock(&rq->lock);
-
context_switch(prev, next);

/*
@@ -774,10 +771,8 @@ switch_tasks:
*/
smp_mb();
rq = this_rq();
- spin_unlock_irq(&rq->frozen);
- } else {
- spin_unlock_irq(&rq->lock);
}
+ spin_unlock_irq(&rq->lock);

reacquire_kernel_lock(current);
if (need_resched())
@@ -1539,7 +1534,6 @@ void __init sched_init(void)
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- spin_lock_init(&rq->frozen);
INIT_LIST_HEAD(&rq->migration_queue);

for (j = 0; j < 2; j++) {

Andrea

2002-06-12 11:58:37

by David Miller

[permalink] [raw]
Subject: Re: Scheduler Bug (set_cpus_allowed)

From: Ingo Molnar <[email protected]>
Date: Mon, 10 Jun 2002 22:02:28 +0200 (CEST)

David, would it be possible to somehow not recurse back into the scheduler
code (like wakeup) from within the port-specific switch_to() code?

It's a locking problem more than anything else. It's not that
we call back into it, it's that we hold locks in switch_mm.

To recap, from the changeset 1.369.49.1:

Fix scheduler deadlock on some platforms.

Some platforms need to grab mm->page_table_lock during switch_mm().
On the other hand code like swap_out() in mm/vmscan.c needs to hold
mm->page_table_lock during wakeups which needs to grab the runqueue
lock. This creates a conflict and the resolution chosen here is to
not hold the runqueue lock during context_switch().

The implementation is specifically a "frozen" state implemented as a
spinlock, which is held around the context_switch() call. This allows
the runqueue lock to be dropped during this time yet prevent another cpu
from running the "not switched away from yet" task.

So if you're going to delete the frozen code, please replace it with
something that handles the above.

There is nothing weird about holding page_table_lock during
switch_mm, I can imagine many other ports doing it especially those
with TLB pids which want to optimize SMP tlb/cache flushes.

I think changing vmscan.c to not call wakeup is the wrong way to
go about this. I thought about doing that originally, but it looks
to be 100 times more difficult to implement (and verify) than the
scheduler side fix.

2002-06-12 16:59:52

by Ingo Molnar

[permalink] [raw]
Subject: [patch] switch_mm()'s desire to run without the rq lock


On Wed, 12 Jun 2002, David S. Miller wrote:

> There is nothing weird about holding page_table_lock during switch_mm, I
> can imagine many other ports doing it especially those with TLB pids
> which want to optimize SMP tlb/cache flushes.

as far as i can see only Sparc64 does it. And switch_mm() (along with
switch_to()) is a very scheduler-internal thing, we dont really (and must
not) do anything weird in it.

> I think changing vmscan.c to not call wakeup is the wrong way to go
> about this. I thought about doing that originally, but it looks to be
> 100 times more difficult to implement (and verify) than the scheduler
> side fix.

okay, understood.

here is a solution that allows us to eat and have the pudding at once.
(patch attached, against Linus' latest BK tree):

- i've extended the scheduler context-switch mechanism with the following
per-arch defines:

prepare_arch_schedule(prev_task);
finish_arch_schedule(prev_task);
prepare_arch_switch(rq);
finish_arch_switch(rq);

- plus switch_to() takes 3 parameters again:

switch_to(prev, next, last);

- schedule_tail() has the 'prev' task parameter again, it must be passed
over in switch_to() and passed in to the fork() startup path.

architectures that need to unlock the runqueue before doing the switch can
do the following:

#define prepare_arch_schedule(prev) task_lock(prev)
#define finish_arch_schedule(prev) task_unlock(prev)
#define prepare_arch_switch(rq) spin_unlock(&(rq)->lock)
#define finish_arch_switch(rq) __sti()

this way the task-lock makes sure that a task is not scheduled on some
other CPU before the switch-out finishes, but the runqueue lock is
dropped. (Local interrupts are kept disabled in this variant, just to
exclude things like TLB flushes - if that matters.)

architectures that can hold the runqueue lock during context-switch can do
the following simplification:

#define prepare_arch_schedule(prev) do { } while(0)
#define finish_arch_schedule(prev) do { } while(0)
#define prepare_arch_switch(rq) do { } while(0)
#define finish_arch_switch(rq) spin_unlock_irq(&(rq)->lock)

further optimizations possible in the 'simple' variant:

- an architecture does not have to handle the 'last' parameter in
switch_to() if the 'prev' parameter is unused in finish_arch_schedule().
This way the inlined return value of context_switch() too gets optimized
away at compile-time.

- an architecture does not have to pass the 'prev' pointer to
schedule_tail(), if the 'prev' parameter is unused in
finish_arch_schedule().

the x86 architecture makes use of these optimizations.

Via this solution we have a reasonably flexible context-switch setup which
falls back to the current (faster) code on x86, but on other platforms the
runqueue lock can be dropped before doing the context-switch as well.

Ingo

NOTE: i have coded and tested the 'complex' variant on x86 as well to make
sure it works for you on Sparc64 - but since x86's switch_mm() is
not too subtle it can use the simpler variant. [ The following
things had to be done to make x86 arch use the complex variant: the
4 complex macros have to be used in system.h, entry.S has to
'pushl %ebx' and 'addl $4, %esp' around the call to schedule_tail(),
and switch_to() had to be reverted to the 3-parameter variant
present in the 2.4 kernels.

NOTE2: prepare_to_switch() functionality has been moved into the
prepare_arch_switch() macro.

NOTE3: please use macros for prepare|finish_arch_switch() so that we can
keep the scheduler data structures internal to sched.c.

--- linux/arch/i386/kernel/entry.S.orig Wed Jun 12 16:53:02 2002
+++ linux/arch/i386/kernel/entry.S Wed Jun 12 17:47:18 2002
@@ -193,6 +193,7 @@

ENTRY(ret_from_fork)
#if CONFIG_SMP || CONFIG_PREEMPT
+ # NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
#endif
GET_THREAD_INFO(%ebx)
--- linux/include/asm-i386/system.h.orig Wed Jun 12 16:03:37 2002
+++ linux/include/asm-i386/system.h Wed Jun 12 18:36:34 2002
@@ -11,9 +11,12 @@
struct task_struct; /* one of the stranger aspects of C forward declarations.. */
extern void FASTCALL(__switch_to(struct task_struct *prev, struct task_struct *next));

-#define prepare_to_switch() do { } while(0)
+#define prepare_arch_schedule(prev) do { } while(0)
+#define finish_arch_schedule(prev) do { } while(0)
+#define prepare_arch_switch(rq) do { } while(0)
+#define finish_arch_switch(rq) spin_unlock_irq(&(rq)->lock)

-#define switch_to(prev,next) do { \
+#define switch_to(prev,next,last) do { \
asm volatile("pushl %%esi\n\t" \
"pushl %%edi\n\t" \
"pushl %%ebp\n\t" \
--- linux/kernel/sched.c.orig Wed Jun 12 15:47:30 2002
+++ linux/kernel/sched.c Wed Jun 12 18:15:06 2002
@@ -451,19 +451,18 @@
}

#if CONFIG_SMP || CONFIG_PREEMPT
-asmlinkage void schedule_tail(void)
+asmlinkage void schedule_tail(task_t *prev)
{
- spin_unlock_irq(&this_rq()->lock);
+ finish_arch_switch(this_rq());
+ finish_arch_schedule(prev);
}
#endif

-static inline void context_switch(task_t *prev, task_t *next)
+static inline task_t * context_switch(task_t *prev, task_t *next)
{
struct mm_struct *mm = next->mm;
struct mm_struct *oldmm = prev->active_mm;

- prepare_to_switch();
-
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
@@ -477,7 +476,9 @@
}

/* Here we just switch the register state and the stack. */
- switch_to(prev, next);
+ switch_to(prev, next, prev);
+
+ return prev;
}

unsigned long nr_running(void)
@@ -823,6 +824,7 @@
rq = this_rq();

release_kernel_lock(prev, smp_processor_id());
+ prepare_arch_schedule(prev);
prev->sleep_timestamp = jiffies;
spin_lock_irq(&rq->lock);

@@ -878,23 +880,20 @@
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
- context_switch(prev, next);
-
- /*
- * The runqueue pointer might be from another CPU
- * if the new task was last running on a different
- * CPU - thus re-load it.
- */
- mb();
+
+ prepare_arch_switch(rq);
+ prev = context_switch(prev, next);
+ barrier();
rq = this_rq();
- }
- spin_unlock_irq(&rq->lock);
+ finish_arch_switch(rq);
+ } else
+ spin_unlock_irq(&rq->lock);
+ finish_arch_schedule(prev);

reacquire_kernel_lock(current);
preempt_enable_no_resched();
if (test_thread_flag(TIF_NEED_RESCHED))
goto need_resched;
- return;
}

#ifdef CONFIG_PREEMPT


2002-06-12 21:59:23

by David Miller

[permalink] [raw]
Subject: Re: [patch] switch_mm()'s desire to run without the rq lock

From: Ingo Molnar <[email protected]>
Date: Wed, 12 Jun 2002 18:57:34 +0200 (CEST)

here is a solution that allows us to eat and have the pudding at once.
(patch attached, against Linus' latest BK tree):

Thanks for doing this, it looks fine to me. Also, thanks for being so
thorough in your testing. If only everyone would do this :-)

2002-06-12 22:36:04

by David Miller

[permalink] [raw]
Subject: Re: [patch] switch_mm()'s desire to run without the rq lock


Yikes!!! I just noticed a problem.

You can't just delete prepare_to_switch(), that is where we do
the register window flush on Sparc and it has the be at that
exact location.

Can you redo your diffs, preserving prepare_to_switch()?

2002-06-13 21:26:49

by Robert Love

[permalink] [raw]
Subject: [PATCH] Re: current scheduler bits #3, 2.5.21

On Tue, 2002-06-11 at 11:33, Ingo Molnar wrote:

> + * WARNING: to squeeze out a few more cycles we do not disable preemption
> + * explicitly (or implicitly), we just keep interrupts disabled. This means
> + * that within task_rq_lock/unlock sections you must be careful
> + * about locking/unlocking spinlocks, since they could cause an unexpected
> + * preemption.

This is not working for me. I am getting intermittent hard locks -
seems we are taking a hard fall with interrupts off. Smelt like the
preemption optimization, so I removed it and all is well...

...which is fine, because I do not entirely trust this and it is really
not worth it to save just an inc and a dec.

Attached is your #3 patch with that one change reverted. The rest of
the changes are in tact. It is running here very nicely...

Unless you have a solution - and really think this is worth it - I would
like to pull this change out.

Robert Love

diff -urN linux-2.5.21/fs/pipe.c linux/fs/pipe.c
--- linux-2.5.21/fs/pipe.c Sat Jun 8 22:26:29 2002
+++ linux/fs/pipe.c Thu Jun 13 14:10:48 2002
@@ -119,7 +119,7 @@
* writers synchronously that there is more
* room.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
if (!PIPE_EMPTY(*inode))
BUG();
@@ -219,7 +219,7 @@
* is going to give up this CPU, so it doesnt have
* to do idle reschedules.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
PIPE_WAITING_WRITERS(*inode)++;
pipe_wait(inode);
diff -urN linux-2.5.21/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.5.21/include/linux/sched.h Sat Jun 8 22:27:21 2002
+++ linux/include/linux/sched.h Thu Jun 13 14:10:48 2002
@@ -491,6 +491,7 @@
extern unsigned long prof_shift;

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
signed long timeout));
@@ -507,6 +508,11 @@
#define wake_up_interruptible(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_nr(x, nr) __wake_up((x),TASK_INTERRUPTIBLE, nr)
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
+#ifdef CONFIG_SMP
+#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
+#else
+#define wake_up_interruptible_sync(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
+#endif
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
diff -urN linux-2.5.21/include/linux/spinlock.h linux/include/linux/spinlock.h
--- linux-2.5.21/include/linux/spinlock.h Sat Jun 8 22:30:01 2002
+++ linux/include/linux/spinlock.h Thu Jun 13 14:11:43 2002
@@ -197,6 +197,12 @@
preempt_enable(); \
} while (0)

+#define spin_unlock_no_resched(lock) \
+do { \
+ _raw_spin_unlock(lock); \
+ preempt_enable_no_resched(); \
+} while (0)
+
#define read_lock(lock) ({preempt_disable(); _raw_read_lock(lock);})
#define read_unlock(lock) ({_raw_read_unlock(lock); preempt_enable();})
#define write_lock(lock) ({preempt_disable(); _raw_write_lock(lock);})
@@ -206,20 +212,21 @@

#else

-#define preempt_get_count() (0)
-#define preempt_disable() do { } while (0)
-#define preempt_enable_no_resched() do {} while(0)
-#define preempt_enable() do { } while (0)
-
-#define spin_lock(lock) _raw_spin_lock(lock)
-#define spin_trylock(lock) _raw_spin_trylock(lock)
-#define spin_unlock(lock) _raw_spin_unlock(lock)
-
-#define read_lock(lock) _raw_read_lock(lock)
-#define read_unlock(lock) _raw_read_unlock(lock)
-#define write_lock(lock) _raw_write_lock(lock)
-#define write_unlock(lock) _raw_write_unlock(lock)
-#define write_trylock(lock) _raw_write_trylock(lock)
+#define preempt_get_count() (0)
+#define preempt_disable() do {} while (0)
+#define preempt_enable_no_resched() do {} while (0)
+#define preempt_enable() do {} while (0)
+
+#define spin_lock(lock) _raw_spin_lock(lock)
+#define spin_trylock(lock) _raw_spin_trylock(lock)
+#define spin_unlock(lock) _raw_spin_unlock(lock)
+#define spin_unlock_no_resched(lock) _raw_spin_unlock(lock)
+
+#define read_lock(lock) _raw_read_lock(lock)
+#define read_unlock(lock) _raw_read_unlock(lock)
+#define write_lock(lock) _raw_write_lock(lock)
+#define write_unlock(lock) _raw_write_unlock(lock)
+#define write_trylock(lock) _raw_write_trylock(lock)
#endif

/* "lock on reference count zero" */
diff -urN linux-2.5.21/kernel/ksyms.c linux/kernel/ksyms.c
--- linux-2.5.21/kernel/ksyms.c Sat Jun 8 22:26:33 2002
+++ linux/kernel/ksyms.c Thu Jun 13 14:10:48 2002
@@ -457,6 +457,9 @@
/* process management */
EXPORT_SYMBOL(complete_and_exit);
EXPORT_SYMBOL(__wake_up);
+#if CONFIG_SMP
+EXPORT_SYMBOL_GPL(__wake_up_sync); /* internal use only */
+#endif
EXPORT_SYMBOL(wake_up_process);
EXPORT_SYMBOL(sleep_on);
EXPORT_SYMBOL(sleep_on_timeout);
diff -urN linux-2.5.21/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.21/kernel/sched.c Sat Jun 8 22:28:13 2002
+++ linux/kernel/sched.c Thu Jun 13 14:10:48 2002
@@ -135,7 +135,6 @@
*/
struct runqueue {
spinlock_t lock;
- spinlock_t frozen;
unsigned long nr_running, nr_switches, expired_timestamp;
signed long nr_uninterruptible;
task_t *curr, *idle;
@@ -153,17 +152,21 @@
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)

+/*
+ * task_rq_lock - lock the runqueue a given task resides on and disable
+ * interrupts. Note the ordering: we can safely lookup the task_rq without
+ * explicitly disabling preemption.
+ */
static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
{
struct runqueue *rq;

repeat_lock_task:
- preempt_disable();
+ local_irq_save(*flags);
rq = task_rq(p);
- spin_lock_irqsave(&rq->lock, *flags);
+ spin_lock(&rq->lock);
if (unlikely(rq != task_rq(p))) {
spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
goto repeat_lock_task;
}
return rq;
@@ -172,7 +175,23 @@
static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
{
spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
+}
+
+/*
+ * rq_lock - lock a given runqueue and disable interrupts.
+ */
+static inline runqueue_t *rq_lock(runqueue_t *rq)
+{
+ local_irq_disable();
+ rq = this_rq();
+ spin_lock(&rq->lock);
+ return rq;
+}
+
+static inline void rq_unlock(runqueue_t *rq)
+{
+ spin_unlock(&rq->lock);
+ local_irq_enable();
}

/*
@@ -284,9 +303,15 @@
repeat:
preempt_disable();
rq = task_rq(p);
- while (unlikely(rq->curr == p)) {
+ if (unlikely(rq->curr == p)) {
cpu_relax();
- barrier();
+ /*
+ * enable/disable preemption just to make this
+ * a preemption point - we are busy-waiting
+ * anyway.
+ */
+ preempt_enable();
+ goto repeat;
}
rq = task_rq_lock(p, &flags);
if (unlikely(rq->curr == p)) {
@@ -322,40 +347,50 @@
* "current->state = TASK_RUNNING" to mark yourself runnable
* without the overhead of this.
*/
-static int try_to_wake_up(task_t * p)
+static int try_to_wake_up(task_t * p, int sync)
{
unsigned long flags;
int success = 0;
long old_state;
runqueue_t *rq;

+repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
- p->state = TASK_RUNNING;
if (!p->array) {
+ if (unlikely(sync && (rq->curr != p))) {
+ if (p->thread_info->cpu != smp_processor_id()) {
+ p->thread_info->cpu = smp_processor_id();
+ task_rq_unlock(rq, &flags);
+ goto repeat_lock_task;
+ }
+ }
if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
activate_task(p, rq);
+ /*
+ * If sync is set, a resched_task() is a NOOP
+ */
if (p->prio < rq->curr->prio)
resched_task(rq->curr);
success = 1;
}
+ p->state = TASK_RUNNING;
task_rq_unlock(rq, &flags);
+
return success;
}

int wake_up_process(task_t * p)
{
- return try_to_wake_up(p);
+ return try_to_wake_up(p, 0);
}

void wake_up_forked_process(task_t * p)
{
runqueue_t *rq;

- preempt_disable();
- rq = this_rq();
- spin_lock_irq(&rq->lock);
+ rq = rq_lock(rq);

p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -371,8 +406,7 @@
p->thread_info->cpu = smp_processor_id();
activate_task(p, rq);

- spin_unlock_irq(&rq->lock);
- preempt_enable();
+ rq_unlock(rq);
}

/*
@@ -403,7 +437,7 @@
#if CONFIG_SMP || CONFIG_PREEMPT
asmlinkage void schedule_tail(void)
{
- spin_unlock_irq(&this_rq()->frozen);
+ spin_unlock_irq(&this_rq()->lock);
}
#endif

@@ -828,9 +862,6 @@
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
- spin_lock(&rq->frozen);
- spin_unlock(&rq->lock);
-
context_switch(prev, next);

/*
@@ -840,10 +871,8 @@
*/
mb();
rq = this_rq();
- spin_unlock_irq(&rq->frozen);
- } else {
- spin_unlock_irq(&rq->lock);
}
+ spin_unlock_irq(&rq->lock);

reacquire_kernel_lock(current);
preempt_enable_no_resched();
@@ -880,7 +909,7 @@
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
-static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync)
{
struct list_head *tmp;
unsigned int state;
@@ -891,7 +920,7 @@
curr = list_entry(tmp, wait_queue_t, task_list);
p = curr->task;
state = p->state;
- if ((state & mode) && try_to_wake_up(p) &&
+ if ((state & mode) && try_to_wake_up(p, sync) &&
((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
break;
}
@@ -905,17 +934,36 @@
return;

spin_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr_exclusive);
+ __wake_up_common(q, mode, nr_exclusive, 0);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+
+#if CONFIG_SMP
+
+void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+{
+ unsigned long flags;
+
+ if (unlikely(!q))
+ return;
+
+ spin_lock_irqsave(&q->lock, flags);
+ if (likely(nr_exclusive))
+ __wake_up_common(q, mode, nr_exclusive, 1);
+ else
+ __wake_up_common(q, mode, nr_exclusive, 0);
spin_unlock_irqrestore(&q->lock, flags);
}

+#endif
+
void complete(struct completion *x)
{
unsigned long flags;

spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1);
+ __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, 0);
spin_unlock_irqrestore(&x->wait.lock, flags);
}

@@ -1342,8 +1390,7 @@
runqueue_t *rq;
prio_array_t *array;

- preempt_disable();
- rq = this_rq();
+ rq = rq_lock(rq);

/*
* Decrease the yielding task's priority by one, to avoid
@@ -1353,7 +1400,6 @@
* If priority is already MAX_PRIO-1 then we still
* roundrobin the task within the runlist.
*/
- spin_lock_irq(&rq->lock);
array = current->array;
/*
* If the task has reached maximum priority (or is a RT task)
@@ -1370,8 +1416,7 @@
list_add_tail(&current->run_list, array->queue + current->prio);
__set_bit(current->prio, array->bitmap);
}
- spin_unlock(&rq->lock);
- preempt_enable_no_resched();
+ spin_unlock_no_resched(&rq->lock);

schedule();

@@ -1599,7 +1644,6 @@
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- spin_lock_init(&rq->frozen);
INIT_LIST_HEAD(&rq->migration_queue);

for (j = 0; j < 2; j++) {
@@ -1687,7 +1731,15 @@
task_rq_unlock(rq, &flags);
goto out;
}
-
+ /*
+ * If the task is not on a runqueue (and not running), then
+ * it is sufficient to simply update the task's cpu field.
+ */
+ if (!p->array && (p != rq->curr)) {
+ p->thread_info->cpu = __ffs(p->cpus_allowed);
+ task_rq_unlock(rq, &flags);
+ goto out;
+ }
init_MUTEX_LOCKED(&req.sem);
req.task = p;
list_add(&req.list, &rq->migration_queue);




2002-06-13 22:08:58

by Ingo Molnar

[permalink] [raw]
Subject: [patch] current scheduler bits #4, 2.5.21


On 13 Jun 2002, Robert Love wrote:

> This is not working for me. I am getting intermittent hard locks -
> seems we are taking a hard fall with interrupts off. Smelt like the
> preemption optimization, so I removed it and all is well...

okay, i agree that it's dangerous.

> ...which is fine, because I do not entirely trust this and it is really
> not worth it to save just an inc and a dec.

oh, we've been through bigger pain before, just to save single
instructions :)

but in this case it makes those codepaths really volatile - and
spin_lock/spin_unlock can cause an unexpected preemption. Especially on
architectures that implement smp_send_reschedule() via the SMP call
function, which in turn uses a spinlock, this approach poses real
immediate problems.

I've attached my current tree against 2.5.21, it has this optimization
removed. [i kept the spinlock.h enhancements, they might come handy in
less complex scenarios.]

Ingo

diff -rNu linux-2.5.21-final/arch/i386/kernel/entry.S linux/arch/i386/kernel/entry.S
--- linux-2.5.21-final/arch/i386/kernel/entry.S Sun Jun 9 07:28:19 2002
+++ linux/arch/i386/kernel/entry.S Fri Jun 14 00:00:16 2002
@@ -193,6 +193,7 @@

ENTRY(ret_from_fork)
#if CONFIG_SMP || CONFIG_PREEMPT
+ # NOTE: this function takes a parameter but it's unused on x86.
call schedule_tail
#endif
GET_THREAD_INFO(%ebx)
diff -rNu linux-2.5.21-final/fs/pipe.c linux/fs/pipe.c
--- linux-2.5.21-final/fs/pipe.c Sun Jun 9 07:26:29 2002
+++ linux/fs/pipe.c Thu Jun 13 23:59:56 2002
@@ -119,7 +119,7 @@
* writers synchronously that there is more
* room.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
if (!PIPE_EMPTY(*inode))
BUG();
@@ -219,7 +219,7 @@
* is going to give up this CPU, so it doesnt have
* to do idle reschedules.
*/
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
PIPE_WAITING_WRITERS(*inode)++;
pipe_wait(inode);
diff -rNu linux-2.5.21-final/include/asm-i386/system.h linux/include/asm-i386/system.h
--- linux-2.5.21-final/include/asm-i386/system.h Sun Jun 9 07:26:29 2002
+++ linux/include/asm-i386/system.h Fri Jun 14 00:00:16 2002
@@ -11,9 +11,12 @@
struct task_struct; /* one of the stranger aspects of C forward declarations.. */
extern void FASTCALL(__switch_to(struct task_struct *prev, struct task_struct *next));

-#define prepare_to_switch() do { } while(0)
+#define prepare_arch_schedule(prev) do { } while(0)
+#define finish_arch_schedule(prev) do { } while(0)
+#define prepare_arch_switch(rq) do { } while(0)
+#define finish_arch_switch(rq) spin_unlock_irq(&(rq)->lock)

-#define switch_to(prev,next) do { \
+#define switch_to(prev,next,last) do { \
asm volatile("pushl %%esi\n\t" \
"pushl %%edi\n\t" \
"pushl %%ebp\n\t" \
diff -rNu linux-2.5.21-final/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.5.21-final/include/linux/sched.h Sun Jun 9 07:27:21 2002
+++ linux/include/linux/sched.h Thu Jun 13 23:59:56 2002
@@ -491,6 +491,7 @@
extern unsigned long prof_shift;

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
signed long timeout));
@@ -507,6 +508,11 @@
#define wake_up_interruptible(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_nr(x, nr) __wake_up((x),TASK_INTERRUPTIBLE, nr)
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
+#ifdef CONFIG_SMP
+#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
+#else
+#define wake_up_interruptible_sync(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
+#endif
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
diff -rNu linux-2.5.21-final/include/linux/spinlock.h linux/include/linux/spinlock.h
--- linux-2.5.21-final/include/linux/spinlock.h Sun Jun 9 07:30:01 2002
+++ linux/include/linux/spinlock.h Fri Jun 14 00:00:09 2002
@@ -26,6 +26,7 @@
#define write_lock_bh(lock) do { local_bh_disable(); write_lock(lock); } while (0)

#define spin_unlock_irqrestore(lock, flags) do { spin_unlock(lock); local_irq_restore(flags); } while (0)
+#define _raw_spin_unlock_irqrestore(lock, flags) do { _raw_spin_unlock(lock); local_irq_restore(flags); } while (0)
#define spin_unlock_irq(lock) do { spin_unlock(lock); local_irq_enable(); } while (0)
#define spin_unlock_bh(lock) do { spin_unlock(lock); local_bh_enable(); } while (0)

@@ -183,6 +184,12 @@
preempt_schedule(); \
} while (0)

+#define preempt_check_resched() \
+do { \
+ if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
+ preempt_schedule(); \
+} while (0)
+
#define spin_lock(lock) \
do { \
preempt_disable(); \
@@ -197,6 +204,12 @@
preempt_enable(); \
} while (0)

+#define spin_unlock_no_resched(lock) \
+do { \
+ _raw_spin_unlock(lock); \
+ preempt_enable_no_resched(); \
+} while (0)
+
#define read_lock(lock) ({preempt_disable(); _raw_read_lock(lock);})
#define read_unlock(lock) ({_raw_read_unlock(lock); preempt_enable();})
#define write_lock(lock) ({preempt_disable(); _raw_write_lock(lock);})
@@ -206,20 +219,22 @@

#else

-#define preempt_get_count() (0)
-#define preempt_disable() do { } while (0)
+#define preempt_get_count() (0)
+#define preempt_disable() do { } while (0)
#define preempt_enable_no_resched() do {} while(0)
-#define preempt_enable() do { } while (0)
+#define preempt_enable() do { } while (0)
+#define preempt_check_resched() do { } while (0)

-#define spin_lock(lock) _raw_spin_lock(lock)
-#define spin_trylock(lock) _raw_spin_trylock(lock)
-#define spin_unlock(lock) _raw_spin_unlock(lock)
-
-#define read_lock(lock) _raw_read_lock(lock)
-#define read_unlock(lock) _raw_read_unlock(lock)
-#define write_lock(lock) _raw_write_lock(lock)
-#define write_unlock(lock) _raw_write_unlock(lock)
-#define write_trylock(lock) _raw_write_trylock(lock)
+#define spin_lock(lock) _raw_spin_lock(lock)
+#define spin_trylock(lock) _raw_spin_trylock(lock)
+#define spin_unlock(lock) _raw_spin_unlock(lock)
+#define spin_unlock_no_resched(lock) _raw_spin_unlock(lock)
+
+#define read_lock(lock) _raw_read_lock(lock)
+#define read_unlock(lock) _raw_read_unlock(lock)
+#define write_lock(lock) _raw_write_lock(lock)
+#define write_unlock(lock) _raw_write_unlock(lock)
+#define write_trylock(lock) _raw_write_trylock(lock)
#endif

/* "lock on reference count zero" */
diff -rNu linux-2.5.21-final/kernel/ksyms.c linux/kernel/ksyms.c
--- linux-2.5.21-final/kernel/ksyms.c Sun Jun 9 07:26:33 2002
+++ linux/kernel/ksyms.c Thu Jun 13 23:59:56 2002
@@ -457,6 +457,9 @@
/* process management */
EXPORT_SYMBOL(complete_and_exit);
EXPORT_SYMBOL(__wake_up);
+#if CONFIG_SMP
+EXPORT_SYMBOL_GPL(__wake_up_sync); /* internal use only */
+#endif
EXPORT_SYMBOL(wake_up_process);
EXPORT_SYMBOL(sleep_on);
EXPORT_SYMBOL(sleep_on_timeout);
diff -rNu linux-2.5.21-final/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.21-final/kernel/sched.c Sun Jun 9 07:28:13 2002
+++ linux/kernel/sched.c Fri Jun 14 00:00:21 2002
@@ -135,7 +135,6 @@
*/
struct runqueue {
spinlock_t lock;
- spinlock_t frozen;
unsigned long nr_running, nr_switches, expired_timestamp;
signed long nr_uninterruptible;
task_t *curr, *idle;
@@ -153,17 +152,21 @@
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)

+/*
+ * task_rq_lock - lock the runqueue a given task resides on and disable
+ * interrupts. Note the ordering: we can safely lookup the task_rq without
+ * explicitly disabling preemption.
+ */
static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
{
struct runqueue *rq;

repeat_lock_task:
- preempt_disable();
+ local_irq_save(*flags);
rq = task_rq(p);
- spin_lock_irqsave(&rq->lock, *flags);
+ spin_lock(&rq->lock);
if (unlikely(rq != task_rq(p))) {
spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
goto repeat_lock_task;
}
return rq;
@@ -172,7 +175,23 @@
static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
{
spin_unlock_irqrestore(&rq->lock, *flags);
- preempt_enable();
+}
+
+/*
+ * rq_lock - lock a given runqueue and disable interrupts.
+ */
+static inline runqueue_t *rq_lock(runqueue_t *rq)
+{
+ local_irq_disable();
+ rq = this_rq();
+ spin_lock(&rq->lock);
+ return rq;
+}
+
+static inline void rq_unlock(runqueue_t *rq)
+{
+ spin_unlock(&rq->lock);
+ local_irq_enable();
}

/*
@@ -284,9 +303,15 @@
repeat:
preempt_disable();
rq = task_rq(p);
- while (unlikely(rq->curr == p)) {
+ if (unlikely(rq->curr == p)) {
cpu_relax();
- barrier();
+ /*
+ * enable/disable preemption just to make this
+ * a preemption point - we are busy-waiting
+ * anyway.
+ */
+ preempt_enable();
+ goto repeat;
}
rq = task_rq_lock(p, &flags);
if (unlikely(rq->curr == p)) {
@@ -322,40 +347,50 @@
* "current->state = TASK_RUNNING" to mark yourself runnable
* without the overhead of this.
*/
-static int try_to_wake_up(task_t * p)
+static int try_to_wake_up(task_t * p, int sync)
{
unsigned long flags;
int success = 0;
long old_state;
runqueue_t *rq;

+repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
- p->state = TASK_RUNNING;
if (!p->array) {
+ if (unlikely(sync && (rq->curr != p))) {
+ if (p->thread_info->cpu != smp_processor_id()) {
+ p->thread_info->cpu = smp_processor_id();
+ task_rq_unlock(rq, &flags);
+ goto repeat_lock_task;
+ }
+ }
if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
activate_task(p, rq);
+ /*
+ * If sync is set, a resched_task() is a NOOP
+ */
if (p->prio < rq->curr->prio)
resched_task(rq->curr);
success = 1;
}
+ p->state = TASK_RUNNING;
task_rq_unlock(rq, &flags);
+
return success;
}

int wake_up_process(task_t * p)
{
- return try_to_wake_up(p);
+ return try_to_wake_up(p, 0);
}

void wake_up_forked_process(task_t * p)
{
runqueue_t *rq;

- preempt_disable();
- rq = this_rq();
- spin_lock_irq(&rq->lock);
+ rq = rq_lock(rq);

p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -371,8 +406,7 @@
p->thread_info->cpu = smp_processor_id();
activate_task(p, rq);

- spin_unlock_irq(&rq->lock);
- preempt_enable();
+ rq_unlock(rq);
}

/*
@@ -401,19 +435,18 @@
}

#if CONFIG_SMP || CONFIG_PREEMPT
-asmlinkage void schedule_tail(void)
+asmlinkage void schedule_tail(task_t *prev)
{
- spin_unlock_irq(&this_rq()->frozen);
+ finish_arch_switch(this_rq());
+ finish_arch_schedule(prev);
}
#endif

-static inline void context_switch(task_t *prev, task_t *next)
+static inline task_t * context_switch(task_t *prev, task_t *next)
{
struct mm_struct *mm = next->mm;
struct mm_struct *oldmm = prev->active_mm;

- prepare_to_switch();
-
if (unlikely(!mm)) {
next->active_mm = oldmm;
atomic_inc(&oldmm->mm_count);
@@ -427,7 +460,9 @@
}

/* Here we just switch the register state and the stack. */
- switch_to(prev, next);
+ switch_to(prev, next, prev);
+
+ return prev;
}

unsigned long nr_running(void)
@@ -773,6 +808,7 @@
rq = this_rq();

release_kernel_lock(prev, smp_processor_id());
+ prepare_arch_schedule(prev);
prev->sleep_timestamp = jiffies;
spin_lock_irq(&rq->lock);

@@ -828,28 +864,20 @@
if (likely(prev != next)) {
rq->nr_switches++;
rq->curr = next;
- spin_lock(&rq->frozen);
- spin_unlock(&rq->lock);
-
- context_switch(prev, next);
-
- /*
- * The runqueue pointer might be from another CPU
- * if the new task was last running on a different
- * CPU - thus re-load it.
- */
- mb();
+
+ prepare_arch_switch(rq);
+ prev = context_switch(prev, next);
+ barrier();
rq = this_rq();
- spin_unlock_irq(&rq->frozen);
- } else {
+ finish_arch_switch(rq);
+ } else
spin_unlock_irq(&rq->lock);
- }
+ finish_arch_schedule(prev);

reacquire_kernel_lock(current);
preempt_enable_no_resched();
if (test_thread_flag(TIF_NEED_RESCHED))
goto need_resched;
- return;
}

#ifdef CONFIG_PREEMPT
@@ -880,7 +908,7 @@
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
-static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync)
{
struct list_head *tmp;
unsigned int state;
@@ -891,7 +919,7 @@
curr = list_entry(tmp, wait_queue_t, task_list);
p = curr->task;
state = p->state;
- if ((state & mode) && try_to_wake_up(p) &&
+ if ((state & mode) && try_to_wake_up(p, sync) &&
((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
break;
}
@@ -905,17 +933,36 @@
return;

spin_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr_exclusive);
+ __wake_up_common(q, mode, nr_exclusive, 0);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+
+#if CONFIG_SMP
+
+void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+{
+ unsigned long flags;
+
+ if (unlikely(!q))
+ return;
+
+ spin_lock_irqsave(&q->lock, flags);
+ if (likely(nr_exclusive))
+ __wake_up_common(q, mode, nr_exclusive, 1);
+ else
+ __wake_up_common(q, mode, nr_exclusive, 0);
spin_unlock_irqrestore(&q->lock, flags);
}

+#endif
+
void complete(struct completion *x)
{
unsigned long flags;

spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1);
+ __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, 0);
spin_unlock_irqrestore(&x->wait.lock, flags);
}

@@ -1342,8 +1389,7 @@
runqueue_t *rq;
prio_array_t *array;

- preempt_disable();
- rq = this_rq();
+ rq = rq_lock(rq);

/*
* Decrease the yielding task's priority by one, to avoid
@@ -1353,7 +1399,6 @@
* If priority is already MAX_PRIO-1 then we still
* roundrobin the task within the runlist.
*/
- spin_lock_irq(&rq->lock);
array = current->array;
/*
* If the task has reached maximum priority (or is a RT task)
@@ -1370,8 +1415,7 @@
list_add_tail(&current->run_list, array->queue + current->prio);
__set_bit(current->prio, array->bitmap);
}
- spin_unlock(&rq->lock);
- preempt_enable_no_resched();
+ spin_unlock_no_resched(&rq->lock);

schedule();

@@ -1599,7 +1643,6 @@
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- spin_lock_init(&rq->frozen);
INIT_LIST_HEAD(&rq->migration_queue);

for (j = 0; j < 2; j++) {
@@ -1687,7 +1730,15 @@
task_rq_unlock(rq, &flags);
goto out;
}
-
+ /*
+ * If the task is not on a runqueue (and not running), then
+ * it is sufficient to simply update the task's cpu field.
+ */
+ if (!p->array && (p != rq->curr)) {
+ p->thread_info->cpu = __ffs(p->cpus_allowed);
+ task_rq_unlock(rq, &flags);
+ goto out;
+ }
init_MUTEX_LOCKED(&req.sem);
req.task = p;
list_add(&req.list, &rq->migration_queue);