2002-02-20 17:57:55

by Erich Focht

[permalink] [raw]
Subject: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

Hi,

this is another attempt to overcome a limitation of the
O(1) scheduler: set_cpus_allowed() can only be called for current
processes.

There are several patches around which need to be able to set the
cpus_allowed mask of an arbitrary process. They are mostly meant as
interfaces to external (user-level) load-balancers which restrict
processes to run only on particular CPUs.

The appended patch contains following changes:
- set_cpus_allowed treats tasks to be moved depending on their state:
- for tasks currently running on a foreign CPU:
an IPI is sent to their CPU,
the tasks are forced to unschedule,
an IPI is sent to the target CPU where the interrupt handler waits
for the task to unschedule and activates the task on its new runqueue.
- for tasks not running currently:
if the task is enqueued: deactivate from old RQ, activate on new RQ,
if the task is not enqueued: just change its cpu.
- if trying to move "current": same treatment as before.
- multiple simultaneous calls to set_cpus_allowed() are possible, the one
lock has been extended to an array of size NR_CPUS.
- the cpu of a forked task is chosen in do_fork() and not in
wake_up_forked_process(). This avoids problems with children forked
shortly after the cpus_allowed mask has been set (and the parent task
wasn't moved yet).

The patch is for 2.5.4-K3. I'm actually developing on IA-64 and tested it
on Itanium systems based on 2.4.17 kernels where it survived my
tests. I hope this works for i386 and is helpful to someone.

Best regards,

Erich Focht

diff -ur linux-2.5.4/arch/i386/kernel/smp.c linux-2.5.4-sca/arch/i386/kernel/smp.c
--- linux-2.5.4/arch/i386/kernel/smp.c Mon Feb 11 02:50:09 2002
+++ linux-2.5.4-sca/arch/i386/kernel/smp.c Wed Feb 20 18:48:01 2002
@@ -485,8 +485,9 @@
do_flush_tlb_all_local();
}

-static spinlock_t migration_lock = SPIN_LOCK_UNLOCKED;
-static task_t *new_task;
+static spinlock_t migration_lock[NR_CPUS] = { [0 ... NR_CPUS-1] = SPIN_LOCK_UNLOCKED};
+static task_t *new_task[NR_CPUS] = { [0 ... NR_CPUS-1] = NULL};
+static long migration_state[NR_CPUS];

/*
* This function sends a 'task migration' IPI to another CPU.
@@ -497,23 +498,65 @@
/*
* The target CPU will unlock the migration spinlock:
*/
- _raw_spin_lock(&migration_lock);
- new_task = p;
+ _raw_spin_lock(&migration_lock[cpu]);
+ new_task[cpu] = p;
+ if (p->cpus_allowed & (1UL<<cpu)) {
+ /*
+ * if sending IPI to target CPU: remember state
+ * (only called if migrating current)
+ */
+ migration_state[cpu] = p->state;
+ p->state = TASK_UNINTERRUPTIBLE;
+ }
send_IPI_mask(1 << cpu, TASK_MIGRATION_VECTOR);
}

/*
* Task migration callback.
+ * If the task is currently running, its CPU gets the interrupt. The task is
+ * forced to unschedule and an interrupt is sent to the target CPU. On the
+ * target CPU we wait for the task to unschedule, restore its old state and
+ * activate it.
*/
asmlinkage void smp_task_migration_interrupt(void)
{
- task_t *p;
+ int this_cpu = smp_processor_id();
+ task_t *p = new_task[this_cpu];
+ long state = migration_state[this_cpu];

ack_APIC_irq();
- p = new_task;
- _raw_spin_unlock(&migration_lock);
- sched_task_migrated(p);
+
+ _raw_spin_unlock(&migration_lock[this_cpu]);
+
+ if (p->cpus_allowed & (1UL<<this_cpu)) {
+ /*
+ * on target CPU
+ */
+ sched_task_migrated(p, state);
+ } else {
+ /*
+ * on source CPU,
+ */
+ int tgt_cpu = __ffs(p->cpus_allowed);
+
+ _raw_spin_lock(&migration_lock[tgt_cpu]);
+ if (p == current) {
+ migration_state[tgt_cpu] = p->state;
+ new_task[tgt_cpu] = p;
+ p->state = TASK_UNINTERRUPTIBLE;
+ current->need_resched = 1;
+ send_IPI_mask(1 << tgt_cpu, TASK_MIGRATION_VECTOR);
+ } else { /* task meanwhile off CPU */
+ _raw_spin_unlock(&migration_lock[tgt_cpu]);
+ /*
+ * following function fails only if p==current or p has
+ * migrated to another CPU (which must be a valid one).
+ */
+ sched_move_task(p, p->thread_info->cpu, tgt_cpu);
+ }
+ }
}
+
/*
* this function sends a 'reschedule' IPI to another CPU.
* it goes straight through and wastes no time serializing
diff -ur linux-2.5.4/include/linux/sched.h linux-2.5.4-sca/include/linux/sched.h
--- linux-2.5.4/include/linux/sched.h Wed Feb 20 18:09:02 2002
+++ linux-2.5.4-sca/include/linux/sched.h Wed Feb 20 18:49:12 2002
@@ -151,7 +151,8 @@
extern void update_one_process(struct task_struct *p, unsigned long user,
unsigned long system, int cpu);
extern void scheduler_tick(int user_tick, int system);
-extern void sched_task_migrated(struct task_struct *p);
+extern void sched_task_migrated(struct task_struct *p, long state);
+extern int sched_move_task(task_t *p, int src_cpu, int tgt_cpu);
extern void smp_migrate_task(int cpu, task_t *task);
extern unsigned long cache_decay_ticks;

diff -ur linux-2.5.4/kernel/fork.c linux-2.5.4-sca/kernel/fork.c
--- linux-2.5.4/kernel/fork.c Wed Feb 20 18:09:02 2002
+++ linux-2.5.4-sca/kernel/fork.c Wed Feb 20 18:37:22 2002
@@ -693,6 +693,10 @@
{
int i;

+ if (likely(p->cpus_allowed & (1UL<<smp_processor_id())))
+ p->thread_info->cpu = smp_processor_id();
+ else
+ p->thread_info->cpu = __ffs(p->cpus_allowed);
/* ?? should we just memset this ?? */
for(i = 0; i < smp_num_cpus; i++)
p->per_cpu_utime[cpu_logical_map(i)] =
diff -ur linux-2.5.4/kernel/sched.c linux-2.5.4-sca/kernel/sched.c
--- linux-2.5.4/kernel/sched.c Wed Feb 20 18:09:02 2002
+++ linux-2.5.4-sca/kernel/sched.c Wed Feb 20 18:54:57 2002
@@ -305,11 +305,21 @@
*
* This function must be called with interrupts disabled.
*/
-void sched_task_migrated(task_t *new_task)
+void sched_task_migrated(task_t *p, long state)
{
- wait_task_inactive(new_task);
- new_task->thread_info->cpu = smp_processor_id();
- wake_up_process(new_task);
+ runqueue_t *rq;
+ unsigned long flags;
+
+ wait_task_inactive(p);
+ p->thread_info->cpu = smp_processor_id();
+ rq = lock_task_rq(p, &flags);
+ p->state = state;
+ if (!p->array) {
+ activate_task(p, rq);
+ if ((rq->curr == rq->idle) || (p->prio < rq->curr->prio))
+ resched_task(rq->curr);
+ }
+ unlock_task_rq(rq, &flags);
}

/*
@@ -364,7 +374,7 @@
runqueue_t *rq;

preempt_disable();
- rq = this_rq();
+ rq = task_rq(p);

p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -378,7 +388,7 @@
p->prio = effective_prio(p);
}
spin_lock_irq(&rq->lock);
- p->thread_info->cpu = smp_processor_id();
+ //p->thread_info->cpu = smp_processor_id();
activate_task(p, rq);
spin_unlock_irq(&rq->lock);
preempt_enable();
@@ -1017,8 +1027,6 @@
new_mask &= cpu_online_map;
if (!new_mask)
BUG();
- if (p != current)
- BUG();

p->cpus_allowed = new_mask;
/*
@@ -1028,10 +1036,28 @@
if (new_mask & (1UL << smp_processor_id()))
return;
#if CONFIG_SMP
- current->state = TASK_UNINTERRUPTIBLE;
- smp_migrate_task(__ffs(new_mask), current);
+ {
+ int tgt_cpu = __ffs(p->cpus_allowed);

- schedule();
+ retry:
+ if (p == current) {
+ smp_migrate_task(tgt_cpu, p);
+ schedule();
+ } else {
+ if (p != task_rq(p)->curr) {
+ /*
+ * Task is not running currently
+ */
+ if (!sched_move_task(p, p->thread_info->cpu, tgt_cpu))
+ goto retry;
+ } else {
+ /*
+ * task is currently running on another CPU
+ */
+ smp_migrate_task(p->thread_info->cpu, p);
+ }
+ }
+ }
#endif
}

@@ -1488,6 +1514,36 @@
spin_unlock(&rq2->lock);
}

+/*
+ * Move a task from src_cpu to tgt_cpu.
+ * Return 0 if task is not on src_cpu or task is currently running.
+ * Return 1 if task was not running currently and moved successfully.
+ */
+int sched_move_task(task_t *p, int src_cpu, int tgt_cpu)
+{
+ int res = 0;
+ unsigned long flags;
+ runqueue_t *src = cpu_rq(src_cpu);
+ runqueue_t *tgt = cpu_rq(tgt_cpu);
+
+ local_irq_save(flags);
+ double_rq_lock(src, tgt);
+ if (task_rq(p) != src || p == src->curr)
+ goto out;
+ if (p->thread_info->cpu != tgt_cpu) {
+ p->thread_info->cpu = tgt_cpu;
+ if (p->array) {
+ deactivate_task(p, src);
+ activate_task(p, tgt);
+ }
+ }
+ res = 1;
+ out:
+ double_rq_unlock(src, tgt);
+ local_irq_restore(flags);
+ return res;
+}
+
void __init init_idle(task_t *idle, int cpu)
{
runqueue_t *idle_rq = cpu_rq(cpu), *rq = idle->array->rq;




2002-02-20 19:44:50

by Robert Love

[permalink] [raw]
Subject: Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On Wed, 2002-02-20 at 12:57, Erich Focht wrote:

> The patch is for 2.5.4-K3. I'm actually developing on IA-64 and tested it
> on Itanium systems based on 2.4.17 kernels where it survived my
> tests. I hope this works for i386 and is helpful to someone.

I was working on the same thing myself. I don't have a working
solution, so you beat me, and thus good job. I think we need this, for
various reasons, especially to implement a method of setting task
affinity that we can export to userspace.

I am a little surprised by how much code it took, though. Do we need
the function to act asynchronously? In other words, is it a requirement
that the task reschedule immediately, or only that when it next
reschedules it obeys its affinity?

Also, what is the reason for allowing multiple calls to
set_cpus_allowed? How often would that even occur?

Robert Love

2002-02-20 20:39:26

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On 20 Feb 2002, Robert Love, responding to Erich Focht, wrote:
>
> I was working on the same thing myself.

good news.

> ...
>
> I am a little surprised by how much code it took, though.

Not have looked at Ingo's patch yet, I can't tell how much
of Erich's patch is just Ingo's, and how much is the additional
change needed for non-current set_cpus_allowed().

I do see things like minor spacing changes in this patch
which might be bulking it up, but don't know the history
of such changes (comment prefix changed from 3 chars " * "
to 2 chars "* ", for example).

I also notice that the BUG() call in set_cpus_allowed() is
commented out:

- if (p != current)
- BUG();
+ //if (p != current)
+ // BUG();

Would it not be cleaner to remove the lines, not leave old
cruft laying around? There are a half dozen other such
commented out lines that this same notice might also apply to.


> Do we need the function to act asynchronously? In other words,
> is it a requirement that the task reschedule immediately, or
> only that when it next reschedules it obeys its affinity?

Excellent question, in my view.

I see three levels of synchonization possible here:

1) As Erich did, use IPI to get immediate application

2) Wakeup the target task, so that it will "soon" see the
cpus_allowed change, but don't bother with the IPI, or

3) Make no effort to expedite notifcation, and let the
target notice its changed cpus_allowed when (and if)
it ever happens to be scheduled active again.

Personally, I don't care, past hoping that the code is as simple
as possible (but no simpler ;). But I'm just implementing
an intermediate mechanism (CpuMemSets) that is intended to
provide a generic hook for a wide variety of load balancing
mechanisms. The implementors of those various load balancers
(NUMA extensions) likely have some minimum requirements.

I'd suspect that (2) would be sufficient, and much easier
and more portable to implement.


> Also, what is the reason for allowing multiple calls to
> set_cpus_allowed? How often would that even occur?

I haven't looked at this enough to understand your question.
Could you (Robert or Erich) explain more? I trust no one is
saying that it's ok to have code that is unsafe because it
will only crash rarely.


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2002-02-21 00:08:33

by Erich Focht

[permalink] [raw]
Subject: Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

Hi Paul & Robert,

> On 20 Feb 2002, Robert Love, responding to Erich Focht, wrote:
> >
> > I was working on the same thing myself.

Oh! Then I hope you can give it a try and test it with your application?


> I do see things like minor spacing changes in this patch
> which might be bulking it up, but don't know the history
> of such changes (comment prefix changed from 3 chars " * "
> to 2 chars "* ", for example).
[...]
>
> Would it not be cleaner to remove the lines, not leave old
> cruft laying around? There are a half dozen other such
> commented out lines that this same notice might also apply to.

You're right, my mistake... Not beeing a patch for the latest kernel
but for 2.5.4 (with which Ingo's latest patch comes), it was more intended
as a discussion basis with people interested on cpus_allowed and
Ingo. I'll take more care of such details in future.


> > Do we need the function to act asynchronously? In other words,
> > is it a requirement that the task reschedule immediately, or
> > only that when it next reschedules it obeys its affinity?
>
> Excellent question, in my view.
>
> I see three levels of synchonization possible here:
>
> 1) As Erich did, use IPI to get immediate application
>
> 2) Wakeup the target task, so that it will "soon" see the
> cpus_allowed change, but don't bother with the IPI, or
>
> 3) Make no effort to expedite notifcation, and let the
> target notice its changed cpus_allowed when (and if)
> it ever happens to be scheduled active again.

A running task has no reason to change the CPU! There is no code in
schedule() which makes it change the runqueue, this only happens if
load_balance() steals the task, but it wouldn't do so if there is only one
task running on a CPU. The alternative to sending an IPI is to put
additional code into schedule() which checks the cpus_allowed mask. This
is not desirable, I guess.

Another problem is the right moment to change the cpu field of the
task. One should better not change it while the task is enqueued because
there are a few calls to lock_task_rq() which could lock the wrong
RQ. Therefore we must wait for the task to be dequeued, which can be
achieved if we change it's state and wait for schedule() to throw it
out, then change the CPU. But which state should we restore? Between our
setting of the state and the schedule() the task may have changed state
itself... And might be woken up before we did our cpu change. I tried
some such variants and ended mostly in some deadlock :-( So I thought the
IPI to the source CPU is cleaner and avoids unnecessary wait in
set_cpus_allowed(). The IPI to the target CPU is the same as in the
initial design of Ingo. It has to wait for the task to unschedule and
knows it will find it dequeued.

Still, IPIs are only sent if we want to move a currently running
task. Other tasks are simply dequeued and enqueued, the additional code
for this (sched_move_task()) is minimal, in my oppinion.

> > Also, what is the reason for allowing multiple calls to
> > set_cpus_allowed? How often would that even occur?

Multiple calls can and will occur if you allow users to change the
cpu masks of their processes. And the feature was easy to get by just
replicating the spinlocks.

Regards,
Erich

2002-02-21 01:13:10

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On Thu, 21 Feb 2002, Erich Focht, responding to Paul Jackson, wrote:
>
> Hi Paul & Robert,
>
> > On 20 Feb 2002, Robert Love, responding to Erich Focht, wrote:
> > >
> > > ...
> > >
> > > Do we need the function to act asynchronously? In other words,
> > > is it a requirement that the task reschedule immediately, or
> > > only that when it next reschedules it obeys its affinity?
> >
> > ... [ we could ] ...
> >
> > 1) As Erich did, use IPI to get immediate application
> >
> > 2) Wakeup the target task, so that it will "soon" see the
> > cpus_allowed change, but don't bother with the IPI, or
> >
> > 3) Make no effort to expedite notifcation, and let the
> > target notice its changed cpus_allowed when (and if)
> > it ever happens to be scheduled active again.
>
> A running task has no reason to change the CPU!

well ... so ... it will check, every time slice or so,
whether it should give up the CPU, right? That is, the
code sched.c scheduler() does run every slice or so, right?


> ... The alternative to sending an IPI is to put additional
> code into schedule() which checks the cpus_allowed mask. This
> is not desirable, I guess.

Well ... not exactly "check the cpus_allowed mask" because that
mask is no longer something we can change from another process,
with Ingo's latest scheduler.

Rather check some other task field. Say we had a
"proposed_cpus_allowed" field of the task_struct, settable by
any process (perhaps including an associated lock, if need be).

Say this field was normally set to zero, but could be
set to some non-zero value by some other process wanting
to change the current process cpus_allowed. Then in the
scheduler, at the very bottom of schedule(), check to see if
current->proposed_cpus_allowed is non-zero. If it is, use
set_cpus_allowed() to set cpus_allowed to the proposed value.

[I am painfully aware that I am not an expert in this code,
so please be gentle if I'm spouting gibberish ...]


> Another problem is the right moment to change the cpu field of the
> task. ... The IPI to the target CPU is the same as in the
> initial design of Ingo. It has to wait for the task to unschedule and
> knows it will find it dequeued.

How about not changing anything of the target task synchronously,
except for some new "proposed_cpus_allowed" field. Set that
field and get out of there. Let the target process run the
set_cpus_allowed() routine on itself, next time it passes through
the schedule() code. Leave it the case that the set_cpus_allowed()
routine can only be run on the current process.

Perhaps others need this cpus_allowed change (and the migration
of a task to a different allowed cpu) to happen "right away".
But I don't, and I don't (yet) know that anyone else does.


> Still, IPIs are only sent if we want to move a currently running
> task. Other tasks are simply dequeued and enqueued, the additional code
> for this (sched_move_task()) is minimal, in my oppinion.

And the need for this (sched_move_task()) also seems minimal.


> > > Also, what is the reason for allowing multiple calls to
> > > set_cpus_allowed? How often would that even occur?
>
> Multiple calls can and will occur if you allow users to change the
> cpu masks of their processes. And the feature was easy to get by just
> replicating the spinlocks.

Isn't it customary in Linux kernel work to minimize the number
of locks -- avoid replication unless there is a genuine need
(a hot lock, say) to replicate locks? We don't routinely add
features because they are easy -- we add them because they are
needed.


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2002-02-21 01:42:03

by Kimio Suganuma

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

Hi all,

On Wed, 20 Feb 2002 17:12:21 -0800
Paul Jackson <[email protected]> wrote:

> > Another problem is the right moment to change the cpu field of the
> > task. ... The IPI to the target CPU is the same as in the
> > initial design of Ingo. It has to wait for the task to unschedule and
> > knows it will find it dequeued.
>
> How about not changing anything of the target task synchronously,
> except for some new "proposed_cpus_allowed" field. Set that
> field and get out of there. Let the target process run the
> set_cpus_allowed() routine on itself, next time it passes through
> the schedule() code. Leave it the case that the set_cpus_allowed()
> routine can only be run on the current process.
>
> Perhaps others need this cpus_allowed change (and the migration
> of a task to a different allowed cpu) to happen "right away".
> But I don't, and I don't (yet) know that anyone else does.

CPU hotplug needs to change cpus_allowed in definite time.
When a process is sleeping for 100000 seconds, how can we offline
a CPU the process belongs?

Regards,
Kimi

--
Kimio Suganuma <[email protected]>

2002-02-21 02:05:05

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On Wed, 20 Feb 2002, Kimio Suganuma wrote:
>
> CPU hotplug needs to change cpus_allowed in definite time.
> When a process is sleeping for 100000 seconds, how can we offline
> a CPU the process belongs?

Good - I figured I'd hear from you on this - thanks.

Are you thinking "definite time" on the order of a second?
I presume you don't require millisecond response time, and that
minute response time would be too slow, right?

And just brainstorming ... if a process is sleeping for a long
time, and the last cpu it executed on is being taken offline,
what need is there to wake up the process? Let the process
stay asleep, and find it a new home when it wakes up for other
reasons.

In other words, perhaps the goal of having the smallest,
simplest, least intrusive, most clearly correct code is more
important here than waking up a process just to tell it that
it's last cpu went offline.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2002-02-21 02:30:53

by Kimio Suganuma

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks


On Wed, 20 Feb 2002 18:04:49 -0800
Paul Jackson <[email protected]> wrote:

> On Wed, 20 Feb 2002, Kimio Suganuma wrote:
> >
> > CPU hotplug needs to change cpus_allowed in definite time.
> > When a process is sleeping for 100000 seconds, how can we offline
> > a CPU the process belongs?
>
> Good - I figured I'd hear from you on this - thanks.
>
> Are you thinking "definite time" on the order of a second?
> I presume you don't require millisecond response time, and that
> minute response time would be too slow, right?

Exactly.

> And just brainstorming ... if a process is sleeping for a long
> time, and the last cpu it executed on is being taken offline,
> what need is there to wake up the process? Let the process
> stay asleep, and find it a new home when it wakes up for other
> reasons.

In such the case, the waken up process's p->cpu must be changed
by another process or in interrupt, not by itself.
So, we cannot assume that p->cpu, or p->cpus_allowed, must be
changed by itself, right?

> In other words, perhaps the goal of having the smallest,
> simplest, least intrusive, most clearly correct code is more
> important here than waking up a process just to tell it that
> it's last cpu went offline.

Smallest, simplest and correct...
I wish I could figure out such codes. :(

Regards,
Kimi

--
Kimio Suganuma <[email protected]>

2002-02-21 04:51:24

by Mike Kravetz

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On Wed, Feb 20, 2002 at 06:57:05PM +0100, Erich Focht wrote:
> Hi,
>
> this is another attempt to overcome a limitation of the
> O(1) scheduler: set_cpus_allowed() can only be called for current
> processes.

Great! I'm glad someone is looking into this. Didn't look at
your patch too closely, but one obvious issue comes to mind.
How does the caller of set_cpus_allowed() lock down the specified
task so that set_cpus_allowed() can be sure it is still valid?
Obviously, this only becomes an issue when you open up the
routine to tasks other than 'current' as you have done. Do you
want to do task validation within set_cpus_allowed() with the
tasklist_lock held?

--
Mike

2002-02-21 04:57:06

by Reid Hekman

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On Wed, 2002-02-20 at 20:29, Kimio Suganuma wrote:
> Smallest, simplest and correct...

Actually, I think the appropriate responsive axiom is, "pick two."

> I wish I could figure out such codes. :(
>

Don't we all.

Regards,
Reid

2002-02-21 14:34:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks


On Wed, 20 Feb 2002, Erich Focht wrote:

> The appended patch contains following changes:
> - set_cpus_allowed treats tasks to be moved depending on their state:
> - for tasks currently running on a foreign CPU:
> an IPI is sent to their CPU,
> the tasks are forced to unschedule,
> an IPI is sent to the target CPU where the interrupt handler waits
> for the task to unschedule and activates the task on its new runqueue.
> - for tasks not running currently:
> if the task is enqueued: deactivate from old RQ, activate on new RQ,
> if the task is not enqueued: just change its cpu.
> - if trying to move "current": same treatment as before.
> - multiple simultaneous calls to set_cpus_allowed() are possible, the one
> lock has been extended to an array of size NR_CPUS.
> - the cpu of a forked task is chosen in do_fork() and not in
> wake_up_forked_process(). This avoids problems with children forked
> shortly after the cpus_allowed mask has been set (and the parent task
> wasn't moved yet).

the problem is, the migration interrupt concept is fundamentally unrobust.
It involves 'waiting' for a task to unschedule in IRQ context - this is
problematic. What happens if the involuntarily migrated task is looping in
a spin_lock(), which lock is held by a task on the target CPU, which task
is hit by the migration interrupt? Deadlock.

to solve this i had to get rid of the migration interrupt completely. (see
the attached patch, against 2.5.5) The concept is the following: there are
new per-CPU system threads (so-called migration threads) that handle a
per-runqueue 'migration queue'. set_cpus_allowed() registers tasks in the
target CPU's migration queue, kicks the migrating thread and wakes up the
migration thread. The migrating thread unschedules on its source CPU, at
which point the migration thread picks the task up and puts it into the
local runqueue.

this solution also makes it easier to support migration on non-x86
architectures: no arch-specific functionality is needed.

i've tested forced migration (by adding a syscall that allows users to
change the affinity of processes), and it works just fine under various
situations.

(the patch also includes a number of other pending scheduler cleanups and
small speedups, such as the removal of sync wakeups - cache-affine load
balancing solves the sync wakeup problem already.)

Ingo

--- linux/fs/pipe.c.orig Thu Feb 21 10:23:45 2002
+++ linux/fs/pipe.c Thu Feb 21 10:25:00 2002
@@ -116,7 +116,7 @@
* writers synchronously that there is more
* room.
*/
- wake_up_interruptible_sync(PIPE_WAIT(*inode));
+ wake_up_interruptible(PIPE_WAIT(*inode));
if (!PIPE_EMPTY(*inode))
BUG();
goto do_more_read;
@@ -214,7 +214,7 @@
* is going to give up this CPU, so it doesnt have
* to do idle reschedules.
*/
- wake_up_interruptible_sync(PIPE_WAIT(*inode));
+ wake_up_interruptible(PIPE_WAIT(*inode));
PIPE_WAITING_WRITERS(*inode)++;
pipe_wait(inode);
PIPE_WAITING_WRITERS(*inode)--;
--- linux/init/main.c.orig Thu Feb 21 11:45:22 2002
+++ linux/init/main.c Thu Feb 21 13:23:05 2002
@@ -413,7 +413,12 @@
*/
static void __init do_basic_setup(void)
{
-
+ /*
+ * Let the per-CPU migration threads start up:
+ */
+#if CONFIG_SMP
+ migration_init();
+#endif
/*
* Tell the world that we're going to be the grim
* reaper of innocent orphaned children.
--- linux/kernel/sched.c.orig Thu Feb 21 10:23:49 2002
+++ linux/kernel/sched.c Thu Feb 21 14:33:51 2002
@@ -16,12 +16,12 @@
#include <linux/nmi.h>
#include <linux/init.h>
#include <asm/uaccess.h>
+#include <linux/highmem.h>
#include <linux/smp_lock.h>
+#include <asm/mmu_context.h>
#include <linux/interrupt.h>
#include <linux/completion.h>
-#include <asm/mmu_context.h>
#include <linux/kernel_stat.h>
-#include <linux/highmem.h>

/*
* Priority of a process goes from 0 to 139. The 0-99
@@ -127,8 +127,6 @@

struct prio_array {
int nr_active;
- spinlock_t *lock;
- runqueue_t *rq;
unsigned long bitmap[BITMAP_SIZE];
list_t queue[MAX_PRIO];
};
@@ -146,6 +144,8 @@
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
+ task_t *migration_thread;
+ list_t migration_queue;
} ____cacheline_aligned;

static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -156,23 +156,23 @@
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)

-static inline runqueue_t *lock_task_rq(task_t *p, unsigned long *flags)
+static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
{
- struct runqueue *__rq;
+ struct runqueue *rq;

repeat_lock_task:
preempt_disable();
- __rq = task_rq(p);
- spin_lock_irqsave(&__rq->lock, *flags);
- if (unlikely(__rq != task_rq(p))) {
- spin_unlock_irqrestore(&__rq->lock, *flags);
+ rq = task_rq(p);
+ spin_lock_irqsave(&rq->lock, *flags);
+ if (unlikely(rq != task_rq(p))) {
+ spin_unlock_irqrestore(&rq->lock, *flags);
preempt_enable();
goto repeat_lock_task;
}
- return __rq;
+ return rq;
}

-static inline void unlock_task_rq(runqueue_t *rq, unsigned long *flags)
+static inline void task_rq_unlock(runqueue_t *rq, unsigned long *flags)
{
spin_unlock_irqrestore(&rq->lock, *flags);
preempt_enable();
@@ -184,7 +184,7 @@
static inline void dequeue_task(struct task_struct *p, prio_array_t *array)
{
array->nr_active--;
- list_del_init(&p->run_list);
+ list_del(&p->run_list);
if (list_empty(array->queue + p->prio))
__clear_bit(p->prio, array->bitmap);
}
@@ -289,31 +289,17 @@
cpu_relax();
barrier();
}
- rq = lock_task_rq(p, &flags);
+ rq = task_rq_lock(p, &flags);
if (unlikely(rq->curr == p)) {
- unlock_task_rq(rq, &flags);
+ task_rq_unlock(rq, &flags);
preempt_enable();
goto repeat;
}
- unlock_task_rq(rq, &flags);
+ task_rq_unlock(rq, &flags);
preempt_enable();
}

/*
- * The SMP message passing code calls this function whenever
- * the new task has arrived at the target CPU. We move the
- * new task into the local runqueue.
- *
- * This function must be called with interrupts disabled.
- */
-void sched_task_migrated(task_t *new_task)
-{
- wait_task_inactive(new_task);
- new_task->thread_info->cpu = smp_processor_id();
- wake_up_process(new_task);
-}
-
-/*
* Kick the remote CPU if the task is running currently,
* this code is used by the signal code to signal tasks
* which are in user-mode as quickly as possible.
@@ -337,27 +323,27 @@
* "current->state = TASK_RUNNING" to mark yourself runnable
* without the overhead of this.
*/
-static int try_to_wake_up(task_t * p, int synchronous)
+static int try_to_wake_up(task_t * p)
{
unsigned long flags;
int success = 0;
runqueue_t *rq;

- rq = lock_task_rq(p, &flags);
+ rq = task_rq_lock(p, &flags);
p->state = TASK_RUNNING;
if (!p->array) {
activate_task(p, rq);
- if ((rq->curr == rq->idle) || (p->prio < rq->curr->prio))
+ if (p->prio < rq->curr->prio)
resched_task(rq->curr);
success = 1;
}
- unlock_task_rq(rq, &flags);
+ task_rq_unlock(rq, &flags);
return success;
}

int wake_up_process(task_t * p)
{
- return try_to_wake_up(p, 0);
+ return try_to_wake_up(p);
}

void wake_up_forked_process(task_t * p)
@@ -366,6 +352,7 @@

preempt_disable();
rq = this_rq();
+ spin_lock_irq(&rq->lock);

p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -378,10 +365,12 @@
p->sleep_avg = p->sleep_avg * CHILD_PENALTY / 100;
p->prio = effective_prio(p);
}
- spin_lock_irq(&rq->lock);
+ INIT_LIST_HEAD(&p->migration_list);
p->thread_info->cpu = smp_processor_id();
activate_task(p, rq);
+
spin_unlock_irq(&rq->lock);
+ init_MUTEX(&p->migration_sem);
preempt_enable();
}

@@ -861,44 +850,33 @@
* started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
* zero in this (rare) case, and we handle it by continuing to scan the queue.
*/
-static inline void __wake_up_common (wait_queue_head_t *q, unsigned int mode,
- int nr_exclusive, const int sync)
+static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
{
struct list_head *tmp;
+ unsigned int state;
+ wait_queue_t *curr;
task_t *p;

- list_for_each(tmp,&q->task_list) {
- unsigned int state;
- wait_queue_t *curr = list_entry(tmp, wait_queue_t, task_list);
-
+ list_for_each(tmp, &q->task_list) {
+ curr = list_entry(tmp, wait_queue_t, task_list);
p = curr->task;
state = p->state;
- if ((state & mode) &&
- try_to_wake_up(p, sync) &&
- ((curr->flags & WQ_FLAG_EXCLUSIVE) &&
- !--nr_exclusive))
- break;
+ if ((state & mode) && try_to_wake_up(p) &&
+ ((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
+ break;
}
}

-void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr)
+void __wake_up(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
{
- if (q) {
- unsigned long flags;
- wq_read_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr, 0);
- wq_read_unlock_irqrestore(&q->lock, flags);
- }
-}
+ unsigned long flags;

-void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr)
-{
- if (q) {
- unsigned long flags;
- wq_read_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr, 1);
- wq_read_unlock_irqrestore(&q->lock, flags);
- }
+ if (unlikely(!q))
+ return;
+
+ wq_read_lock_irqsave(&q->lock, flags);
+ __wake_up_common(q, mode, nr_exclusive);
+ wq_read_unlock_irqrestore(&q->lock, flags);
}

void complete(struct completion *x)
@@ -907,7 +885,7 @@

spin_lock_irqsave(&x->wait.lock, flags);
x->done++;
- __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, 0);
+ __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1);
spin_unlock_irqrestore(&x->wait.lock, flags);
}

@@ -994,35 +972,66 @@
return timeout;
}

+void scheduling_functions_end_here(void) { }
+
+#if CONFIG_SMP
+
/*
- * Change the current task's CPU affinity. Migrate the process to a
- * proper CPU and schedule away if the current CPU is removed from
- * the allowed bitmask.
+ * Change a given task's CPU affinity. Migrate the process to a
+ * proper CPU and schedule it away if the current CPU is removed
+ * from the allowed bitmask.
*/
void set_cpus_allowed(task_t *p, unsigned long new_mask)
{
+ unsigned long flags;
+ runqueue_t *rq;
+ int dest_cpu;
+
+ down(&p->migration_sem);
+ if (!list_empty(&p->migration_list))
+ BUG();
+
new_mask &= cpu_online_map;
if (!new_mask)
BUG();
- if (p != current)
- BUG();

+ rq = task_rq_lock(p, &flags);
p->cpus_allowed = new_mask;
/*
- * Can the task run on the current CPU? If not then
+ * Can the task run on the task's current CPU? If not then
* migrate the process off to a proper CPU.
*/
- if (new_mask & (1UL << smp_processor_id()))
- return;
-#if CONFIG_SMP
- current->state = TASK_UNINTERRUPTIBLE;
- smp_migrate_task(__ffs(new_mask), current);
+ if (new_mask & (1UL << p->thread_info->cpu)) {
+ task_rq_unlock(rq, &flags);
+ goto out;
+ }
+ /*
+ * We mark the process as nonrunnable, and kick it to
+ * schedule away from its current CPU. We also add
+ * the task to the migration queue and wake up the
+ * target CPU's migration thread, so that it can pick
+ * up this task and insert it into the local runqueue.
+ */
+ p->state = TASK_UNINTERRUPTIBLE;
+ kick_if_running(p);
+ task_rq_unlock(rq, &flags);

- schedule();
-#endif
+ dest_cpu = __ffs(new_mask);
+ rq = cpu_rq(dest_cpu);
+
+ spin_lock_irq(&rq->lock);
+ list_add(&p->migration_list, &rq->migration_queue);
+ spin_unlock_irq(&rq->lock);
+ wake_up_process(rq->migration_thread);
+
+ while (!((1UL << p->thread_info->cpu) & p->cpus_allowed) &&
+ (p->state != TASK_ZOMBIE))
+ yield();
+out:
+ up(&p->migration_sem);
}

-void scheduling_functions_end_here(void) { }
+#endif

void set_user_nice(task_t *p, long nice)
{
@@ -1036,7 +1045,7 @@
* We have to be careful, if called from sys_setpriority(),
* the task might be in the middle of scheduling on another CPU.
*/
- rq = lock_task_rq(p, &flags);
+ rq = task_rq_lock(p, &flags);
if (rt_task(p)) {
p->static_prio = NICE_TO_PRIO(nice);
goto out_unlock;
@@ -1056,7 +1065,7 @@
resched_task(rq->curr);
}
out_unlock:
- unlock_task_rq(rq, &flags);
+ task_rq_unlock(rq, &flags);
}

#ifndef __alpha__
@@ -1154,7 +1163,7 @@
* To be able to change p->policy safely, the apropriate
* runqueue lock must be held.
*/
- rq = lock_task_rq(p, &flags);
+ rq = task_rq_lock(p, &flags);

if (policy < 0)
policy = p->policy;
@@ -1197,7 +1206,7 @@
activate_task(p, task_rq(p));

out_unlock:
- unlock_task_rq(rq, &flags);
+ task_rq_unlock(rq, &flags);
out_unlock_tasklist:
read_unlock_irq(&tasklist_lock);

@@ -1477,7 +1486,7 @@

void __init init_idle(task_t *idle, int cpu)
{
- runqueue_t *idle_rq = cpu_rq(cpu), *rq = idle->array->rq;
+ runqueue_t *idle_rq = cpu_rq(cpu), *rq = cpu_rq(idle->thread_info->cpu);
unsigned long flags;

__save_flags(flags);
@@ -1509,14 +1518,13 @@
runqueue_t *rq = cpu_rq(i);
prio_array_t *array;

- rq->active = rq->arrays + 0;
+ rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
+ INIT_LIST_HEAD(&rq->migration_queue);

for (j = 0; j < 2; j++) {
array = rq->arrays + j;
- array->rq = rq;
- array->lock = &rq->lock;
for (k = 0; k < MAX_PRIO; k++) {
INIT_LIST_HEAD(array->queue + k);
__clear_bit(k, array->bitmap);
@@ -1545,3 +1553,104 @@
atomic_inc(&init_mm.mm_count);
enter_lazy_tlb(&init_mm, current, smp_processor_id());
}
+
+#if CONFIG_SMP
+
+static volatile unsigned long migration_mask;
+
+static int migration_thread(void * unused)
+{
+ runqueue_t *rq;
+
+ daemonize();
+ sigfillset(&current->blocked);
+ set_user_nice(current, -20);
+
+ /*
+ * We have to migrate manually - there is no migration thread
+ * to do this for us yet :-)
+ *
+ * We use the following property of the Linux scheduler. At
+ * this point no other task is running, so by keeping all
+ * migration threads running, the load-balancer will distribute
+ * them between all CPUs equally. At that point every migration
+ * task binds itself to the current CPU.
+ */
+
+ /* wait for all migration threads to start up. */
+ while (!migration_mask)
+ yield();
+
+ for (;;) {
+ preempt_disable();
+ if (test_and_clear_bit(smp_processor_id(), &migration_mask))
+ current->cpus_allowed = 1 << smp_processor_id();
+ if (test_thread_flag(TIF_NEED_RESCHED))
+ schedule();
+ if (!migration_mask)
+ break;
+ preempt_enable();
+ }
+ rq = this_rq();
+ rq->migration_thread = current;
+ preempt_enable();
+
+ sprintf(current->comm, "migration_CPU%d", smp_processor_id());
+
+ for (;;) {
+ struct list_head *head;
+ unsigned long flags;
+ task_t *p = NULL;
+
+ spin_lock_irqsave(&rq->lock, flags);
+ head = &rq->migration_queue;
+ if (list_empty(head)) {
+ current->state = TASK_UNINTERRUPTIBLE;
+ spin_unlock_irqrestore(&rq->lock, flags);
+ schedule();
+ continue;
+ }
+ p = list_entry(head->next, task_t, migration_list);
+ list_del_init(head->next);
+ spin_unlock_irqrestore(&rq->lock, flags);
+
+ for (;;) {
+ runqueue_t *rq2 = task_rq_lock(p, &flags);
+
+ if (!p->array) {
+ p->thread_info->cpu = smp_processor_id();
+ task_rq_unlock(rq2, &flags);
+ wake_up_process(p);
+ break;
+ }
+ if (p->state != TASK_UNINTERRUPTIBLE) {
+ p->state = TASK_UNINTERRUPTIBLE;
+ kick_if_running(p);
+ }
+ task_rq_unlock(rq2, &flags);
+ while ((p->state == TASK_UNINTERRUPTIBLE) && p->array) {
+ cpu_relax();
+ barrier();
+ }
+ }
+ }
+}
+
+void __init migration_init(void)
+{
+ int cpu;
+
+ for (cpu = 0; cpu < smp_num_cpus; cpu++)
+ if (kernel_thread(migration_thread, NULL,
+ CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
+ BUG();
+
+ migration_mask = (1 << smp_num_cpus) -1;
+
+ for (cpu = 0; cpu < smp_num_cpus; cpu++)
+ while (!cpu_rq(cpu)->migration_thread)
+ yield();
+ if (migration_mask)
+ BUG();
+}
+#endif
--- linux/kernel/ksyms.c.orig Thu Feb 21 10:24:02 2002
+++ linux/kernel/ksyms.c Thu Feb 21 14:54:32 2002
@@ -443,7 +443,6 @@
/* process management */
EXPORT_SYMBOL(complete_and_exit);
EXPORT_SYMBOL(__wake_up);
-EXPORT_SYMBOL(__wake_up_sync);
EXPORT_SYMBOL(wake_up_process);
EXPORT_SYMBOL(sleep_on);
EXPORT_SYMBOL(sleep_on_timeout);
@@ -458,6 +457,9 @@
EXPORT_SYMBOL(set_user_nice);
EXPORT_SYMBOL(task_nice);
EXPORT_SYMBOL_GPL(idle_cpu);
+#ifdef CONFIG_SMP
+EXPORT_SYMBOL_GPL(set_cpus_allowed);
+#endif
EXPORT_SYMBOL(jiffies);
EXPORT_SYMBOL(xtime);
EXPORT_SYMBOL(do_gettimeofday);
--- linux/include/linux/sched.h.orig Thu Feb 21 10:23:49 2002
+++ linux/include/linux/sched.h Thu Feb 21 14:51:25 2002
@@ -150,8 +150,7 @@
extern void update_one_process(struct task_struct *p, unsigned long user,
unsigned long system, int cpu);
extern void scheduler_tick(int user_tick, int system);
-extern void sched_task_migrated(struct task_struct *p);
-extern void smp_migrate_task(int cpu, task_t *task);
+extern void migration_init(void);
extern unsigned long cache_decay_ticks;


@@ -286,6 +285,10 @@

wait_queue_head_t wait_chldexit; /* for wait4() */
struct completion *vfork_done; /* for vfork() */
+
+ list_t migration_list;
+ struct semaphore migration_sem;
+
unsigned long rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
@@ -382,7 +385,12 @@
*/
#define _STK_LIM (8*1024*1024)

+#if CONFIG_SMP
extern void set_cpus_allowed(task_t *p, unsigned long new_mask);
+#else
+# define set_cpus_allowed(p, new_mask) do { } while (0)
+#endif
+
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
@@ -460,7 +468,6 @@
extern unsigned long prof_shift;

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
-extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
signed long timeout));
@@ -474,13 +481,9 @@
#define wake_up(x) __wake_up((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1)
#define wake_up_nr(x, nr) __wake_up((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, nr)
#define wake_up_all(x) __wake_up((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 0)
-#define wake_up_sync(x) __wake_up_sync((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1)
-#define wake_up_sync_nr(x, nr) __wake_up_sync((x),TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, nr)
#define wake_up_interruptible(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_nr(x, nr) __wake_up((x),TASK_INTERRUPTIBLE, nr)
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
-#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
-#define wake_up_interruptible_sync_nr(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, nr)
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
--- linux/include/linux/init_task.h.orig Thu Feb 21 11:52:54 2002
+++ linux/include/linux/init_task.h Thu Feb 21 14:51:35 2002
@@ -52,6 +52,8 @@
mm: NULL, \
active_mm: &init_mm, \
run_list: LIST_HEAD_INIT(tsk.run_list), \
+ migration_list: LIST_HEAD_INIT(tsk.migration_list), \
+ migration_sem: __MUTEX_INITIALIZER(tsk.migration_sem), \
time_slice: HZ, \
next_task: &tsk, \
prev_task: &tsk, \
--- linux/include/asm-i386/hw_irq.h.orig Thu Feb 21 13:18:29 2002
+++ linux/include/asm-i386/hw_irq.h Thu Feb 21 14:51:25 2002
@@ -35,14 +35,13 @@
* into a single vector (CALL_FUNCTION_VECTOR) to save vector space.
* TLB, reschedule and local APIC vectors are performance-critical.
*
- * Vectors 0xf0-0xf9 are free (reserved for future Linux use).
+ * Vectors 0xf0-0xfa are free (reserved for future Linux use).
*/
#define SPURIOUS_APIC_VECTOR 0xff
#define ERROR_APIC_VECTOR 0xfe
#define INVALIDATE_TLB_VECTOR 0xfd
#define RESCHEDULE_VECTOR 0xfc
-#define TASK_MIGRATION_VECTOR 0xfb
-#define CALL_FUNCTION_VECTOR 0xfa
+#define CALL_FUNCTION_VECTOR 0xfb

/*
* Local APIC timer IRQ vector is on a different priority level,
--- linux/arch/i386/kernel/i8259.c.orig Thu Feb 21 13:18:08 2002
+++ linux/arch/i386/kernel/i8259.c Thu Feb 21 13:18:16 2002
@@ -79,7 +79,6 @@
* through the ICC by us (IPIs)
*/
#ifdef CONFIG_SMP
-BUILD_SMP_INTERRUPT(task_migration_interrupt,TASK_MIGRATION_VECTOR)
BUILD_SMP_INTERRUPT(reschedule_interrupt,RESCHEDULE_VECTOR)
BUILD_SMP_INTERRUPT(invalidate_interrupt,INVALIDATE_TLB_VECTOR)
BUILD_SMP_INTERRUPT(call_function_interrupt,CALL_FUNCTION_VECTOR)
@@ -473,9 +472,6 @@
* IPI, driven by wakeup.
*/
set_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt);
-
- /* IPI for task migration */
- set_intr_gate(TASK_MIGRATION_VECTOR, task_migration_interrupt);

/* IPI for invalidation */
set_intr_gate(INVALIDATE_TLB_VECTOR, invalidate_interrupt);
--- linux/arch/i386/kernel/smp.c.orig Thu Feb 21 13:17:50 2002
+++ linux/arch/i386/kernel/smp.c Thu Feb 21 13:18:01 2002
@@ -485,35 +485,6 @@
do_flush_tlb_all_local();
}

-static spinlock_t migration_lock = SPIN_LOCK_UNLOCKED;
-static task_t *new_task;
-
-/*
- * This function sends a 'task migration' IPI to another CPU.
- * Must be called from syscall contexts, with interrupts *enabled*.
- */
-void smp_migrate_task(int cpu, task_t *p)
-{
- /*
- * The target CPU will unlock the migration spinlock:
- */
- _raw_spin_lock(&migration_lock);
- new_task = p;
- send_IPI_mask(1 << cpu, TASK_MIGRATION_VECTOR);
-}
-
-/*
- * Task migration callback.
- */
-asmlinkage void smp_task_migration_interrupt(void)
-{
- task_t *p;
-
- ack_APIC_irq();
- p = new_task;
- _raw_spin_unlock(&migration_lock);
- sched_task_migrated(p);
-}
/*
* this function sends a 'reschedule' IPI to another CPU.
* it goes straight through and wastes no time serializing

2002-02-21 14:39:10

by Erich Focht

[permalink] [raw]
Subject: Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

Hi Paul,

well I whished this kind of feedback came at my first attempt to have a
discussion on this subject. Would have saved me lots of reboots :-) The
reaction those days was:

Ingo> your patch does not solve the problem, the situation is more
Ingo> complex. What happens if the target task is not 'current' and is
Ingo> running on some other CPU? If we send the migration interrupt then
Ingo> nothing guarantees that the task will reschedule anytime soon, so
Ingo> the target CPU will keep spinning indefinitely. There are other
Ingo> problems too, like crossing calls to set_cpus_allowed(), etc. Right
Ingo> now set_cpus_allowed() can only be used for
Ingo> the current task, and must be used by kernel code that knows what it
Ingo> does.

and later:

Ingo> well, there is a way, by fixing the current mechanizm. But since
Ingo> nothing uses it currently it wont get much testing. I only pointed
Ingo> out that the patch does not solve some of the races.

So I kept Ingo's design idea of sending IPIs. And I made it survive
crossing calls and avoid spinning around for long time, specially in
interrupt.


On Wed, 20 Feb 2002, Paul Jackson wrote:

> > A running task has no reason to change the CPU!
>
> well ... so ... it will check, every time slice or so,
> whether it should give up the CPU, right? That is, the
> code sched.c scheduler() does run every slice or so, right?

A task gives up the CPU in schedule() and is immediately enqueued in the
corresponding expired priority array of the _same_ runqueue. So it stays
on the same CPU! When the expired array is selected as the active one the
task will run again on the CPU which doesn't fit to its cpus_allowed
mask. That's a problem for Kimi's CPU hotplug, I guess...


> > ... The alternative to sending an IPI is to put additional
> > code into schedule() which checks the cpus_allowed mask. This
> > is not desirable, I guess.
>
> Well ... not exactly "check the cpus_allowed mask" because that
> mask is no longer something we can change from another process,
> with Ingo's latest scheduler.

The cpus_allowed mask still has its old meaning: load_balance() checks
whether it is allowed to migrate the task to its own CPU.

I didn't dare to extend the schedule() code for a seldom used
feature, especially after I saw how happy people were with lat_ctx
results getting lower with each new patch from Ingo.


> Rather check some other task field. Say we had a
> "proposed_cpus_allowed" field of the task_struct, settable by
> any process (perhaps including an associated lock, if need be).
>
> Say this field was normally set to zero, but could be
> set to some non-zero value by some other process wanting
> to change the current process cpus_allowed. Then in the
> scheduler, at the very bottom of schedule(), check to see if
> current->proposed_cpus_allowed is non-zero. If it is, use
> set_cpus_allowed() to set cpus_allowed to the proposed value.

You mean after switching context? That increases the cost of schedule(),
too. But you are right, then you at least are in the happy case of needing
to change the current task.


> How about not changing anything of the target task synchronously,
> except for some new "proposed_cpus_allowed" field. Set that
> field and get out of there. Let the target process run the
> set_cpus_allowed() routine on itself, next time it passes through
> the schedule() code. Leave it the case that the set_cpus_allowed()
> routine can only be run on the current process.

Yes, it is the simplest solution, I agree. If Ingo wants to make the
schedule() path longer.


> > Multiple calls can and will occur if you allow users to change the
> > cpu masks of their processes. And the feature was easy to get by just
> > replicating the spinlocks.
>
> Isn't it customary in Linux kernel work to minimize the number
> of locks -- avoid replication unless there is a genuine need
> (a hot lock, say) to replicate locks? We don't routinely add
> features because they are easy -- we add them because they are
> needed.

>From Ingo's initial reaction I interpreted it as a desireable feature.

All I want is a working solution and I hope this discussion will show that
there is demand for it and set_cpus_allowed() will be fixed, one way or
another...

Best regards,

Erich Focht

---
NEC European Supercomputer Systems, European HPC Technology Center



2002-02-21 15:01:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks


On Thu, 21 Feb 2002, Erich Focht wrote:

> well I whished this kind of feedback came at my first attempt to have
> a discussion on this subject. Would have saved me lots of reboots :-)

i did not find out about the most fundamental problem until today, when i
tried to merge your patch. Much of Linux's development is trial & err, i
do that all the time as well.

> The reaction those days was:
>
> Ingo> your patch does not solve the problem, the situation is more
> Ingo> complex. What happens if the target task is not 'current' and is
> Ingo> running on some other CPU? If we send the migration interrupt then
> Ingo> nothing guarantees that the task will reschedule anytime soon, so
> Ingo> the target CPU will keep spinning indefinitely. There are other
> Ingo> problems too, like crossing calls to set_cpus_allowed(), etc. Right
> Ingo> now set_cpus_allowed() can only be used for
> Ingo> the current task, and must be used by kernel code that knows what it
> Ingo> does.
>
> and later:
>
> Ingo> well, there is a way, by fixing the current mechanizm. But since
> Ingo> nothing uses it currently it wont get much testing. I only pointed
> Ingo> out that the patch does not solve some of the races.
>
> So I kept Ingo's design idea of sending IPIs. And I made it survive
> crossing calls and avoid spinning around for long time, specially in
> interrupt.

the fundamental problem is with wait_task_inactive() called from IRQ
contexts - it can spin indefinitely. The above comments list some of the
other problems, and those can indeed be solved by extending the IPI
delivery mechanism, which you did.

migration threads are a variation of the current scheduler as well -the
schedule()/wakeup() hotpath did not have to be touched.

Ingo

2002-02-21 15:16:52

by Erich Focht

[permalink] [raw]
Subject: Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

Ingo,

thanks for the patch, I am glad to have an "official" solution! Though
I'll wait for your 2.4.X patch to give it a try on Itanium SMP
systems. But this lazy migration scheme looks nice and safe.

Best regards,
Erich


On Thu, 21 Feb 2002, Ingo Molnar wrote:

>
> On Thu, 21 Feb 2002, Erich Focht wrote:
>
> > well I whished this kind of feedback came at my first attempt to have
> > a discussion on this subject. Would have saved me lots of reboots :-)
>
> i did not find out about the most fundamental problem until today, when i
> tried to merge your patch. Much of Linux's development is trial & err, i
> do that all the time as well.
>
> > The reaction those days was:
> >
> > Ingo> your patch does not solve the problem, the situation is more
> > Ingo> complex. What happens if the target task is not 'current' and is
> > Ingo> running on some other CPU? If we send the migration interrupt then
> > Ingo> nothing guarantees that the task will reschedule anytime soon, so
> > Ingo> the target CPU will keep spinning indefinitely. There are other
> > Ingo> problems too, like crossing calls to set_cpus_allowed(), etc. Right
> > Ingo> now set_cpus_allowed() can only be used for
> > Ingo> the current task, and must be used by kernel code that knows what it
> > Ingo> does.
> >
> > and later:
> >
> > Ingo> well, there is a way, by fixing the current mechanizm. But since
> > Ingo> nothing uses it currently it wont get much testing. I only pointed
> > Ingo> out that the patch does not solve some of the races.
> >
> > So I kept Ingo's design idea of sending IPIs. And I made it survive
> > crossing calls and avoid spinning around for long time, specially in
> > interrupt.
>
> the fundamental problem is with wait_task_inactive() called from IRQ
> contexts - it can spin indefinitely. The above comments list some of the
> other problems, and those can indeed be solved by extending the IPI
> delivery mechanism, which you did.
>
> migration threads are a variation of the current scheduler as well -the
> schedule()/wakeup() hotpath did not have to be touched.
>
> Ingo

2002-02-21 21:05:36

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

Well, I can't claim to entirely understand Ingo's latest changes
for cpus_allowed and migration, but it seems that Erich managed
to provoke Ingo into coming up with something that likely meets
my modest needs and then some.

If this sticks, then good. Mission accomplished.

Thanks Ingo, Erich, Kimio, Robert, ...

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2002-02-21 21:25:59

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

Ingo wrote:
> The concept is the following: there are new per-CPU
> system threads (so-called migration threads) that handle
> a per-runqueue 'migration queue'.

Thanks, Ingo.

Could you, or some other kind soul who understands this, to
explain why the following alternative for migrating proceses
currently running on some other cpu wouldn't have been better
(simpler and sufficient):

- add another variable to the task_struct, say "eviction_notice"
- when it comes time to tell a process to migrate, set this
eviction_notice (and then continue without waiting)
- rely on code that fires each time slice on the cpu currently
hosting the evicted process to notice the eviction notice and
serve it (migrate the process instead of giving it yet another
slice).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2002-02-22 07:14:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks


On Thu, 21 Feb 2002, Paul Jackson wrote:

> Could you, or some other kind soul who understands this, to explain
> why the following alternative for migrating proceses currently running
> on some other cpu wouldn't have been better (simpler and sufficient):
>
> - add another variable to the task_struct, say "eviction_notice"
> - when it comes time to tell a process to migrate, set this
> eviction_notice (and then continue without waiting)
> - rely on code that fires each time slice on the cpu currently
> hosting the evicted process to notice the eviction notice and
> serve it (migrate the process instead of giving it yet another
> slice).

well, how do you migrate the task from an IRQ context? This is the
fundamental question. A task can only be removed from the runqueue if it's
not running currently. (the only exception is schedule() itself.)

even if it worked, this adds a 1/(2*HZ) average latency to the migration.
I wanted to have something that is capable of migrating tasks 'instantly'.
(well, as 'instantly' as soon the migrated task can be preempted.) I also
wanted to have a set_cpus_allowed() interface that guarantees that the
task has migrated to the valid set.

i'm doing one more variation wrt. migration threads: instead of the
current method of waking up the target CPU's migration thread and 'kicking
off' the migrated task from its current CPU, it's much cleaner to do the
following:

- every migration thread has maximum priority.
- in set_cpus_allowed() we wake up the migration thread on the *source*
CPU.
- the migration thread preempts the migrated thread on the source CPU -
and it can just remove the migrated task from the local runqueue. (if
the task is still there at that point.)

this is even simpler and does not change p->state in any way.

Ingo

2002-02-22 19:46:13

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On Fri, 22 Feb 2002, Ingo Molnar wrote:
>
> well, how do you migrate the task from an IRQ context?

Let it remove itself, in due time.

> even if it worked, this adds a 1/(2*HZ) average latency to the migration.
> I wanted to have something that is capable of migrating tasks 'instantly'.

That's the key question, from my perspective.

I was not aware of any need for instant migration, and to
the best of my knowledge, a clock tick latency was just fine.
But if your intuition suggests its worth supporting instant
migration, go for it.

The main concern I had with Erich Focht's patches that also
attempted instant migration was that Ingo wouldn't agree.
Somehow, I don't think that concern applies for your patches.

Thanks, Ingo! I'm delighted with the outcome.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2002-02-23 02:48:11

by Rusty Russell

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On Wed, 20 Feb 2002 17:40:56 -0800
Kimio Suganuma <[email protected]> wrote:

> Hi all,
>
> On Wed, 20 Feb 2002 17:12:21 -0800
> Paul Jackson <[email protected]> wrote:
>
> > > Another problem is the right moment to change the cpu field of the
> > > task. ... The IPI to the target CPU is the same as in the
> > > initial design of Ingo. It has to wait for the task to unschedule and
> > > knows it will find it dequeued.
> >
> > How about not changing anything of the target task synchronously,
> > except for some new "proposed_cpus_allowed" field. Set that
> > field and get out of there. Let the target process run the
> > set_cpus_allowed() routine on itself, next time it passes through
> > the schedule() code. Leave it the case that the set_cpus_allowed()
> > routine can only be run on the current process.
> >
> > Perhaps others need this cpus_allowed change (and the migration
> > of a task to a different allowed cpu) to happen "right away".
> > But I don't, and I don't (yet) know that anyone else does.
>
> CPU hotplug needs to change cpus_allowed in definite time.
> When a process is sleeping for 100000 seconds, how can we offline
> a CPU the process belongs?

Hi All,

I've been working on hotplug with the new scheduler and preempt.
It still crashes, so no public code yet 8). It follows Linus's suggestion
that cpu_up() be called during the boot process, which is nicer.

But this particular problem did not hurt me, since I now run a
"suicide" thread on the dying CPU, making it safe and trivial to manually
re-queue processes. The question of what to do with processes which cannot
be scheduled on any remaining CPUs is another interesting question.

(Note: downing a CPU with a suicide thread introduces cleanup issues: I may
yet return to the original implementations where suicide thread == idle thread).

Hope that helps,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-02-25 12:30:40

by Dipankar Sarma

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On Sat, Feb 23, 2002 at 01:47:43PM +1100, Rusty Russell wrote:
> But this particular problem did not hurt me, since I now run a
> "suicide" thread on the dying CPU, making it safe and trivial to manually
> re-queue processes. The question of what to do with processes which cannot
> be scheduled on any remaining CPUs is another interesting question.

If these are processes that are bound to the CPU to be shut down,
wouldn't it make sense to fail the CPU shut down operation ? If you
are giving enough control to the user to make CPU affinity decisions,
they better know how to cleanup before shutting down a CPU.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-02-25 20:02:50

by Bill Davidsen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On Wed, 20 Feb 2002, Paul Jackson wrote:

> I see three levels of synchonization possible here:
>
> 1) As Erich did, use IPI to get immediate application
>
> 2) Wakeup the target task, so that it will "soon" see the
> cpus_allowed change, but don't bother with the IPI, or
>
> 3) Make no effort to expedite notifcation, and let the
> target notice its changed cpus_allowed when (and if)
> it ever happens to be scheduled active again.

(3) looks good, if the process isn't running it makes little difference on
which CPU it doesn't run. Would probably be enough to ensure that if it is
scheduled active the change is noted at that time. "Eventually" is a good
time to do this, when going through schedule code anyway.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-26 04:00:36

by Paul Jackson

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On Mon, 25 Feb 2002, Dipankar Sarma wrote:
>
> If these are processes that are bound to the CPU to be shut down,
> wouldn't it make sense to fail the CPU shut down operation ? If you
> are giving enough control to the user to make CPU affinity decisions,
> they better know how to cleanup before shutting down a CPU.

I can imagine some users (applications) wanting to insist on
staying on a particular CPU (Pike's Peak or Bust), and some
content to be migrated automatically, and some wanting to
receive and act on requests to migrate.

One of these policies might be default, with others as options.

Some CPU shut down operations _can't_ fail ... if they are motivated
say by hardware about to fail.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2002-02-27 01:40:11

by Rusty Russell

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH] O(1) scheduler set_cpus_allowed for non-current tasks

On Mon, 25 Feb 2002 19:59:49 -0800
Paul Jackson <[email protected]> wrote:

> On Mon, 25 Feb 2002, Dipankar Sarma wrote:
> >
> > If these are processes that are bound to the CPU to be shut down,
> > wouldn't it make sense to fail the CPU shut down operation ? If you
> > are giving enough control to the user to make CPU affinity decisions,
> > they better know how to cleanup before shutting down a CPU.
>
> I can imagine some users (applications) wanting to insist on
> staying on a particular CPU (Pike's Peak or Bust), and some
> content to be migrated automatically, and some wanting to
> receive and act on requests to migrate.
>
> One of these policies might be default, with others as options.
>
> Some CPU shut down operations _can't_ fail ... if they are motivated
> say by hardware about to fail.

Exactly. If I run the RC5 challenge, one per cpu (using a mythical
oncpu(1) program, say), I'd be very upset if my whole machine dies because
it don't take down a faulty CPU!

I think SIGPWR is appropriate here. If that doesn't work, then SIGKILL.
Of course, ksoftirqd is a special case, etc.

Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.