LinuxLists.cc - [PATCH] 2.4-ac: sparc64 support for O(1) scheduler

[permalink] [raw]

Subject: Re: [PATCH] 2.4-ac: sparc64 support for O(1) scheduler

On 14 Jun 2002, Robert Love wrote:

> > Ummm what is with all of those switch_mm() hacks? Is this an attempt
> > to work around the locking problems? Please don't do that as it is
> > going to kill performance and having ifdef sparc64 sched.c changes is
> > ugly to say the least.
> >
> > Ingo posted the correct fix to the locking problem with the patch
> > he posted the other day, that is what should go into the -ac patches.
>
> I am explicitly refraining from sending Alan any code that is not
> well-tested in 2.5 and my machines first. As Ingo's new switch_mm()
> bits are not even in 2.5 yet, [...]

Linus applied them already, they will be in 2.5.22. They fix real bugs and
i've seen no problems on my testboxes. Those bits are a must for SMP x86
and Sparc64 as well, there is absolutely no reason to selectively delay
their backmerge. Besides the last task_rq_lock() optimization which got
undone in 2.5 already, all the recent scheduler bits i posted are needed.

Ingo

2002-06-16 17:03:19

[permalink] [raw]

Subject: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

the attached patch, sched-2.4.19-pre10-ac2-A3, is a backport of the
current 2.5 O(1) scheduler, against 2.4.19-pre10-ac2. The patch includes
all the recent fixes. (It should not break any architecture that was
working before on -ac. The most affected architecture is Sparc64, i added
the bits without testing them. David?)

The patch can also be downloaded from:

http://redhat.com/~mingo/O(1)-scheduler/sched-2.4.19-pre10-ac2-A3

Changes relative to 2.4.19-pre10-ac2:

Bugfixes:

- rq-frozen fixes, which closes SMP races on x86 and Sparc64 as well.

- O(1) scheduling sched_yield() fixes: do not starve CPU-intensive
processes.

- migration bugfix, do not fast-migrate the task incorrectly if the task
is in the middle of load_balance().

- sync wakeup reintroduction - this should fix the pipe latency problems
observed.

Feature backports:

- nr_uninterruptible optimization. (This is a fairly straightforward
and risk-less feature, and since it also made the backport easier, i
included it.)

- sched_setaffinity() & sched_getaffinity() syscalls on x86.

plus identity changes, comment updates, to bring sched.c in line with the
2.5 version.

the patch was tested on x86 UP and SMP boxes.

Ingo

Attachments:

sched-2.4.19-pre10-ac2-A3 (22.25 kB)

2002-06-16 23:51:02

[permalink] [raw]

Subject: Re: [PATCH] 2.4-ac: sparc64 support for O(1) scheduler

On Sun, 2002-06-16 at 08:19, Ingo Molnar wrote:

> Linus applied them already, they will be in 2.5.22. They fix real bugs and
> i've seen no problems on my testboxes. Those bits are a must for SMP x86
> and Sparc64 as well, there is absolutely no reason to selectively delay
> their backmerge. Besides the last task_rq_lock() optimization which got
> undone in 2.5 already, all the recent scheduler bits i posted are needed.

I know they are fine (I looked over them) and I saw Linus took them, but
2.5.22 is not yet out and I did not see any reason to rush to new bits
to Alan for 2.4 when we could wait a bit and make sure 2.5 proves them
fine...

My approach thus far with 2.5 -> 2.4 O(1) backports has been one of
caution and it has worked fine thus far. I figure, what is the rush?

Robert Love

2002-06-16 23:57:44

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Sun, 2002-06-16 at 10:00, Ingo Molnar wrote:

> Feature backports:
>
> - nr_uninterruptible optimization. (This is a fairly straightforward
> and risk-less feature, and since it also made the backport easier, i
> included it.)

Yah, I agree - this is safe and good.

> - sched_setaffinity() & sched_getaffinity() syscalls on x86.

Do we want to introduce this into 2.4 now? I realize 2.4-ac is not 2.4
proper, but if there is a chance this interface could change...

> - BUG_ON(in_interrupt());
> -
> + if (unlikely(in_interrupt()))
> + BUG();

Eh, why do this? BUG_ON is the same effect and it is more readable to
me... seems better that 2.5 gets 2.4-ac's behavior instead of the other
way around.

> +int idle_cpu(int cpu)
> +{
> + return cpu_curr(cpu) == cpu_rq(cpu)->idle;
> +}
> +

I did not include this in my original O(1) backport update because
nothing in 2.4-ac seems to use it... so why include it?

> /*
> * Valid priorities for SCHED_FIFO and SCHED_RR are
> - * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_OTHER is 0.
> + * 1..MAX_USER_RT_PRIO, valid priority for SCHED_OTHER is 0.
> */

Another case of 2.4-ac being right: the priority range is
1..MAX_USER_RT_PRIO-1 (i.e. 1 to 99, inclusive).

> /*
> - * The first migration thread is started on CPU #0. This one can
> - * migrate the other migration threads to their destination CPUs.
> + * The first migration thread is started on CPU #0. This one can migrate
> + * the other migration threads to their destination CPUs.
> */
> if (cpu != 0) {
> while (!cpu_rq(cpu_logical_map(0))->migration_thread)
> yield();
> set_cpus_allowed(current, 1UL << cpu);
> }
> - printk("migration_task %d on cpu=%d\n", cpu, smp_processor_id());
> + printk("migration_task %d on cpu=%d\n",cpu,smp_processor_id());
> ret = setscheduler(0, SCHED_FIFO, &param);
> rq = this_rq();
> @@ -1632,5 +1813,4 @@
> while (!cpu_rq(cpu_logical_map(cpu))->migration_thread)
> schedule_timeout(2);
> }
> -
> -#endif /* CONFIG_SMP */
> +#endif

I think all three of these hunks look better in 2.4-ac... in all three
cases, the formatting seems better than in 2.5 IMO.

Robert Love

2002-06-17 00:14:00

by J.A. Magallon

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On 2002.06.17 Robert Love wrote:
>On Sun, 2002-06-16 at 10:00, Ingo Molnar wrote:
>
>> +int idle_cpu(int cpu)
>> +{
>> + return cpu_curr(cpu) == cpu_rq(cpu)->idle;
>> +}
>> +
>
>I did not include this in my original O(1) backport update because
>nothing in 2.4-ac seems to use it... so why include it?
>

Well, you asked...

- the irqbalance patch for p4 needs idle_cpu (and not sure about idle_task).
BTW, they were macros before...
- the bproc patch needs task_nice (you can be less interested in this, but
it does not hurt...)

So could I ask you, please
- to make public idle_[cpu,task], as macros or exported functions, here it
does not matter, irqbalance is not a module. Perhaps some other piece of code
could need them.
- to export all the set/get prio/nice interfaces

???

Thanks.

--
J.A. Magallon \ Software is like sex: It's better when it's free
mailto:[email protected] \ -- Linus Torvalds, FSF T-shirt
Linux werewolf 2.4.19-pre10-jam3, Mandrake Linux 8.3 (Cooker) for i586
gcc (GCC) 3.1.1 (Mandrake Linux 8.3 3.1.1-0.4mdk)

2002-06-17 00:15:41

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Sun, 2002-06-16 at 16:57, Robert Love wrote:

> Another case of 2.4-ac being right

Attached patch brings over the sane bits from 2.4-ac: i.e. if Linus
merges this and Alan merges your patch minus my complaints, the two
trees will be in sync...

Robert Love

diff -urN linux-2.5.21/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.21/kernel/sched.c Sat Jun 8 22:28:13 2002
+++ linux/kernel/sched.c Sun Jun 16 17:14:31 2002
@@ -762,8 +762,8 @@
list_t *queue;
int idx;

- if (unlikely(in_interrupt()))
- BUG();
+ BUG_ON(in_interrupt());
+
#if CONFIG_DEBUG_HIGHMEM
check_highmem_ptes();
#endif
@@ -1147,7 +1147,7 @@

/*
* Valid priorities for SCHED_FIFO and SCHED_RR are
- * 1..MAX_USER_RT_PRIO, valid priority for SCHED_OTHER is 0.
+ * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_OTHER is 0.
*/
retval = -EINVAL;
if (lp.sched_priority < 0 || lp.sched_priority > MAX_USER_RT_PRIO-1)
@@ -1710,15 +1710,15 @@
sigfillset(&current->blocked);
set_fs(KERNEL_DS);
/*
- * The first migration thread is started on CPU #0. This one can migrate
- * the other migration threads to their destination CPUs.
+ * The first migration thread is started on CPU #0. This one can
+ * migrate the other migration threads to their destination CPUs.
*/
if (cpu != 0) {
while (!cpu_rq(cpu_logical_map(0))->migration_thread)
yield();
set_cpus_allowed(current, 1UL << cpu);
}
- printk("migration_task %d on cpu=%d\n",cpu,smp_processor_id());
+ printk("migration_task %d on cpu=%d\n", cpu, smp_processor_id());
ret = setscheduler(0, SCHED_FIFO, &param);

rq = this_rq();
@@ -1790,4 +1790,4 @@
while (!cpu_rq(cpu_logical_map(cpu))->migration_thread)
schedule_timeout(2);
}
-#endif
+#endif /* CONFIG_SMP */

2002-06-17 03:26:33

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On 16 Jun 2002, Robert Love wrote:

> > +int idle_cpu(int cpu)
> > +{
> > + return cpu_curr(cpu) == cpu_rq(cpu)->idle;
> > +}
> > +
>
> I did not include this in my original O(1) backport update because
> nothing in 2.4-ac seems to use it... so why include it?

i have planned to submit the irqbalance patch for 2.4-ac real soon, which
needs this function - current IRQ distribution on P4 SMP boxes is a
showstopper.

> > - sched_setaffinity() & sched_getaffinity() syscalls on x86.
>
> Do we want to introduce this into 2.4 now? I realize 2.4-ac is not 2.4
> proper, but if there is a chance this interface could change...

the setaffinity()/getaffinity() interface looks pretty robust, i dont
expect any changes - there's just so many ways to set an affinity mask for
an opaque set of CPUs. And being able to set affinities is something that
was frequently asked for by application developers.

> > - BUG_ON(in_interrupt());
> > -
> > + if (unlikely(in_interrupt()))
> > + BUG();
>
> Eh, why do this? BUG_ON is the same effect and it is more readable to
> me... seems better that 2.5 gets 2.4-ac's behavior instead of the other
> way around.

IMO BUG_ON() is just an ugly way of doing an assert(), i dont like code
with magic conditionals embedded within. But, the main reason was that
2.5-mainline has the code so that's being used.

> > /*
> > * Valid priorities for SCHED_FIFO and SCHED_RR are
> > - * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_OTHER is 0.
> > + * 1..MAX_USER_RT_PRIO, valid priority for SCHED_OTHER is 0.
> > */
>
> Another case of 2.4-ac being right: the priority range is
> 1..MAX_USER_RT_PRIO-1 (i.e. 1 to 99, inclusive).

like above, 2.5 is the reference base. Especially for 100% nonfunctional
things like this it makes no sense to apply them to 2.4-ac only. But i
agree that existing comment fixes should be forward ported into 2.5, i've
applied them to my tree.

Ingo

2002-06-17 03:35:25

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Sun, 2002-06-16 at 20:24, Ingo Molnar wrote:

> On 16 Jun 2002, Robert Love wrote:
>
> > > +int idle_cpu(int cpu)
> > > +{
> > > + return cpu_curr(cpu) == cpu_rq(cpu)->idle;
> > > +}
> > > +
> >
> > I did not include this in my original O(1) backport update because
> > nothing in 2.4-ac seems to use it... so why include it?
>
> i have planned to submit the irqbalance patch for 2.4-ac real soon, which
> needs this function - current IRQ distribution on P4 SMP boxes is a
> showstopper.

Fair enough.

> > > - sched_setaffinity() & sched_getaffinity() syscalls on x86.
> >
> > Do we want to introduce this into 2.4 now? I realize 2.4-ac is not 2.4
> > proper, but if there is a chance this interface could change...
>
> the setaffinity()/getaffinity() interface looks pretty robust, i dont
> expect any changes - there's just so many ways to set an affinity mask for
> an opaque set of CPUs. And being able to set affinities is something that
> was frequently asked for by application developers.

I agree it seems robust and there have been no complaints, although
there could always be changes to the interface. Personally I'd like the
interfaces in 2.4/2.4-ac sooner rather than later too - I just want to
make sure we do not "etch it in stone" prematurely.

> IMO BUG_ON() is just an ugly way of doing an assert(), i dont like code
> with magic conditionals embedded within. But, the main reason was that
> 2.5-mainline has the code so that's being used.

Heh I like BUG_ON :-)

> like above, 2.5 is the reference base. Especially for 100% nonfunctional
> things like this it makes no sense to apply them to 2.4-ac only. But i
> agree that existing comment fixes should be forward ported into 2.5, i've
> applied them to my tree.

I agree the changes are nonfunctional and thus not a big deal...but I
didn't see a point in pushing erroneous changes onto 2.4-ac, whether
they are in 2.5 or not.

Although now it is all a moot point - Linus merged the patch I posted
earlier with the 2.4-ac bits against 2.5... so now a diff of 2.4-ac and
2.5 will be proper. ;-)

Robert Love

2002-06-17 03:51:22

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

i agree with the comment fixes, except these items:

> - if (unlikely(in_interrupt()))
> - BUG();
> + BUG_ON(in_interrupt());
> +

see the previous mail.

> @@ -1790,4 +1790,4 @@
> while (!cpu_rq(cpu_logical_map(cpu))->migration_thread)
> schedule_timeout(2);
> }
> -#endif
> +#endif /* CONFIG_SMP */

and this is just silly... I can see the point in doing #if comments in
include files, but the nesting here is just so obvious.

the rest looks fine. (patch of my current 2.5 scheduler tree attached,
against 2.5.22, with some more other nonfunctional bits added as well.)

Ingo

--- linux/kernel/sched.c.orig Mon Jun 17 05:43:53 2002
+++ linux/kernel/sched.c Mon Jun 17 05:42:03 2002
@@ -6,14 +6,14 @@
* Copyright (C) 1991-2002 Linus Torvalds
*
* 1996-12-23 Modified by Dave Grothe to fix bugs in semaphores and
- * make semaphores SMP safe
+ * make semaphores SMP safe
* 1998-11-19 Implemented schedule_timeout() and related stuff
* by Andrea Arcangeli
* 2002-01-04 New ultra-scalable O(1) scheduler by Ingo Molnar:
- * hybrid priority-list and round-robin design with
- * an array-switch method of distributing timeslices
- * and per-CPU runqueues. Additional code by Davide
- * Libenzi, Robert Love, and Rusty Russell.
+ * hybrid priority-list and round-robin design with
+ * an array-switch method of distributing timeslices
+ * and per-CPU runqueues. Additional code by Davide
+ * Libenzi, Robert Love, and Rusty Russell.
*/

#include <linux/mm.h>
@@ -797,7 +797,8 @@
list_t *queue;
int idx;

- BUG_ON(in_interrupt());
+ if (in_interrupt())
+ BUG();

#if CONFIG_DEBUG_HIGHMEM
check_highmem_ptes();
@@ -1392,25 +1393,35 @@

asmlinkage long sys_sched_yield(void)
{
- runqueue_t *rq;
- prio_array_t *array;
-
- rq = rq_lock(rq);
+ runqueue_t *rq = rq_lock(rq);
+ prio_array_t *array = current->array;

/*
- * Decrease the yielding task's priority by one, to avoid
- * livelocks. This priority loss is temporary, it's recovered
- * once the current timeslice expires.
+ * There are three levels of how a yielding task will give up
+ * the current CPU:
*
- * If priority is already MAX_PRIO-1 then we still
- * roundrobin the task within the runlist.
- */
- array = current->array;
- /*
- * If the task has reached maximum priority (or is a RT task)
- * then just requeue the task to the end of the runqueue:
+ * #1 - it decreases its priority by one. This priority loss is
+ * temporary, it's recovered once the current timeslice
+ * expires.
+ *
+ * #2 - once it has reached the lowest priority level,
+ * it will give up timeslices one by one. (We do not
+ * want to give them up all at once, it's gradual,
+ * to protect the casual yield()er.)
+ *
+ * #3 - once all timeslices are gone we put the process into
+ * the expired array.
+ *
+ * (special rule: RT tasks do not lose any priority, they just
+ * roundrobin on their current priority level.)
*/
- if (likely(current->prio == MAX_PRIO-1 || rt_task(current))) {
+ if (likely(current->prio == MAX_PRIO-1)) {
+ if (current->time_slice <= 1) {
+ dequeue_task(current, rq->active);
+ enqueue_task(current, rq->expired);
+ } else
+ current->time_slice--;
+ } else if (unlikely(rt_task(current))) {
list_del(&current->run_list);
list_add_tail(&current->run_list, array->queue + current->prio);
} else {
@@ -1836,15 +1847,14 @@
int cpu;

current->cpus_allowed = 1UL << cpu_logical_map(0);
- for (cpu = 0; cpu < smp_num_cpus; cpu++) {
+ for (cpu = 0; cpu < smp_num_cpus; cpu++)
if (kernel_thread(migration_thread, (void *) (long) cpu,
CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
BUG();
- }
current->cpus_allowed = -1L;

for (cpu = 0; cpu < smp_num_cpus; cpu++)
while (!cpu_rq(cpu_logical_map(cpu))->migration_thread)
schedule_timeout(2);
}
-#endif /* CONFIG_SMP */
+#endif

2002-06-17 03:57:33

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Sun, 2002-06-16 at 20:49, Ingo Molnar wrote:

> i agree with the comment fixes, except these items:
>
> > - if (unlikely(in_interrupt()))
> > - BUG();
> > + BUG_ON(in_interrupt());
> > +
>
> see the previous mail.

Shrug. Preference I guess... though this is _the_ case for BUG_ON.

> > @@ -1790,4 +1790,4 @@
> > while (!cpu_rq(cpu_logical_map(cpu))->migration_thread)
> > schedule_timeout(2);
> > }
> > -#endif
> > +#endif /* CONFIG_SMP */
>
> and this is just silly... I can see the point in doing #if comments in
> include files, but the nesting here is just so obvious.

I disagree, but OK. I like having the #if marked by the #endif if they
are not close... and elsewhere through the kernel mirrors this. While I
can scroll up and look - assuming the nesting is sane - a simple comment
makes that clear so what is the pain?

> the rest looks fine. (patch of my current 2.5 scheduler tree attached,
> against 2.5.22, with some more other nonfunctional bits added as well.)

Rest looks fine.

Then again, this is all invariants and comments so its really not a big
deal at all. I guess better this than we are fighting over real code,
eh? ;-)

Robert Love

2002-06-17 04:02:24

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Sun, 2002-06-16 at 20:49, Ingo Molnar wrote:
smlinkage long sys_sched_yield(void)
> {
> - runqueue_t *rq;
> - prio_array_t *array;
> -
> - rq = rq_lock(rq);
> + runqueue_t *rq = rq_lock(rq);
> + prio_array_t *array = current->array;

Question. I have always wondered what the C rules are here... is
rq_lock guaranteed to be evaluated before current->array? I.e., is the
above synonymous with:

runqueue_t *rq;
prio_array_t *array;
rq = rq_lock(rq);
array = current->array;

...guaranteed?

Robert Love

2002-06-17 04:03:15

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On 16 Jun 2002, Robert Love wrote:

> > like above, 2.5 is the reference base. Especially for 100% nonfunctional
> > things like this it makes no sense to apply them to 2.4-ac only. But i
> > agree that existing comment fixes should be forward ported into 2.5, i've
> > applied them to my tree.
>
> I agree the changes are nonfunctional and thus not a big deal...but I
> didn't see a point in pushing erroneous changes onto 2.4-ac, whether
> they are in 2.5 or not.

My method is that the less differences in a merge, the better. I dont mind
if a few comment fixes are lost temporarily, they'll be noticed and
forward ported the minute they get zapped by the backport. (and i have
reviewed -ac for ac-only functional fixes, none existed.) This way the
actual code creation part of the backport was a few minutes work only -
the real work mostly involved reviewing the functional parts of the
changes.

Ingo

2002-06-17 04:09:39

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On 16 Jun 2002, Robert Love wrote:

> > > @@ -1790,4 +1790,4 @@
> > > while (!cpu_rq(cpu_logical_map(cpu))->migration_thread)
> > > schedule_timeout(2);
> > > }
> > > -#endif
> > > +#endif /* CONFIG_SMP */
> >
> > and this is just silly... I can see the point in doing #if comments in
> > include files, but the nesting here is just so obvious.
>
> I disagree, but OK. I like having the #if marked by the #endif if they
> are not close... and elsewhere through the kernel mirrors this. While I
> can scroll up and look - assuming the nesting is sane - a simple comment
> makes that clear so what is the pain?

and in this specific sched.c case, are we going to put in magic comments
every 25 lines inbetween:

/* this is CONFIG_SMP conditional code */

just to save us some scrolling up? I dont think #endif is special wrt.
such comments.

in header files the #ifdef jungle often makes proper nesting hard. In
those cases putting comments to #else and #endif makes a real difference
in readability. But in sched.c there is not a single nested #ifdef. (and
that's very much intentional.)

Ingo

2002-06-17 04:30:21

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Mon, 17 Jun 2002, J.A. Magallon wrote:

> - the irqbalance patch for p4 needs idle_cpu (and not sure about idle_task).
> BTW, they were macros before...
> - the bproc patch needs task_nice (you can be less interested in this, but
> it does not hurt...)
>
> So could I ask you, please
> - to make public idle_[cpu,task], as macros or exported functions, here it
> does not matter, irqbalance is not a module. Perhaps some other piece of code
> could need them.
> - to export all the set/get prio/nice interfaces
>
> ???

sure.

Ingo

2002-06-17 04:28:42

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On 16 Jun 2002, Robert Love wrote:

> Question. I have always wondered what the C rules are here... is
> rq_lock guaranteed to be evaluated before current->array? I.e., is the
> above synonymous with:
>
> runqueue_t *rq;
> prio_array_t *array;
> rq = rq_lock(rq);
> array = current->array;
>
> ...guaranteed?

yes. We rely on this in the kernel quite frequently. Btw., i did a few
more coding style cleanups to the scheduler, you can see more such
examples in my patch.

Ingo

--- linux/kernel/sched.c.orig Mon Jun 17 05:43:53 2002
+++ linux/kernel/sched.c Mon Jun 17 06:21:32 2002
@@ -6,14 +6,14 @@
* Copyright (C) 1991-2002 Linus Torvalds
*
* 1996-12-23 Modified by Dave Grothe to fix bugs in semaphores and
- * make semaphores SMP safe
+ * make semaphores SMP safe
* 1998-11-19 Implemented schedule_timeout() and related stuff
* by Andrea Arcangeli
* 2002-01-04 New ultra-scalable O(1) scheduler by Ingo Molnar:
- * hybrid priority-list and round-robin design with
- * an array-switch method of distributing timeslices
- * and per-CPU runqueues. Additional code by Davide
- * Libenzi, Robert Love, and Rusty Russell.
+ * hybrid priority-list and round-robin design with
+ * an array-switch method of distributing timeslices
+ * and per-CPU runqueues. Additional code by Davide
+ * Libenzi, Robert Love, and Rusty Russell.
*/

#include <linux/mm.h>
@@ -180,11 +180,14 @@
/*
* rq_lock - lock a given runqueue and disable interrupts.
*/
-static inline runqueue_t *rq_lock(runqueue_t *rq)
+static inline runqueue_t *this_rq_lock(void)
{
+ runqueue_t *rq;
+
local_irq_disable();
rq = this_rq();
spin_lock(&rq->lock);
+
return rq;
}

@@ -388,9 +391,7 @@

void wake_up_forked_process(task_t * p)
{
- runqueue_t *rq;
-
- rq = rq_lock(rq);
+ runqueue_t *rq = this_rq_lock();

p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -797,7 +798,8 @@
list_t *queue;
int idx;

- BUG_ON(in_interrupt());
+ if (in_interrupt())
+ BUG();

#if CONFIG_DEBUG_HIGHMEM
check_highmem_ptes();
@@ -1158,13 +1160,12 @@
static int setscheduler(pid_t pid, int policy, struct sched_param *param)
{
struct sched_param lp;
+ int retval = -EINVAL;
prio_array_t *array;
unsigned long flags;
runqueue_t *rq;
- int retval;
task_t *p;

- retval = -EINVAL;
if (!param || pid < 0)
goto out_nounlock;

@@ -1251,10 +1252,9 @@

asmlinkage long sys_sched_getscheduler(pid_t pid)
{
+ int retval = -EINVAL;
task_t *p;
- int retval;

- retval = -EINVAL;
if (pid < 0)
goto out_nounlock;

@@ -1271,11 +1271,10 @@

asmlinkage long sys_sched_getparam(pid_t pid, struct sched_param *param)
{
- task_t *p;
struct sched_param lp;
- int retval;
+ int retval = -EINVAL;
+ task_t *p;

- retval = -EINVAL;
if (!param || pid < 0)
goto out_nounlock;

@@ -1310,8 +1309,8 @@
unsigned long *user_mask_ptr)
{
unsigned long new_mask;
- task_t *p;
int retval;
+ task_t *p;

if (len < sizeof(new_mask))
return -EINVAL;
@@ -1361,13 +1360,12 @@
asmlinkage int sys_sched_getaffinity(pid_t pid, unsigned int len,
unsigned long *user_mask_ptr)
{
- unsigned long mask;
unsigned int real_len;
- task_t *p;
+ unsigned long mask;
int retval;
+ task_t *p;

real_len = sizeof(mask);
-
if (len < real_len)
return -EINVAL;

@@ -1392,25 +1390,35 @@

asmlinkage long sys_sched_yield(void)
{
- runqueue_t *rq;
- prio_array_t *array;
-
- rq = rq_lock(rq);
+ runqueue_t *rq = this_rq_lock();
+ prio_array_t *array = current->array;

/*
- * Decrease the yielding task's priority by one, to avoid
- * livelocks. This priority loss is temporary, it's recovered
- * once the current timeslice expires.
+ * There are three levels of how a yielding task will give up
+ * the current CPU:
*
- * If priority is already MAX_PRIO-1 then we still
- * roundrobin the task within the runlist.
- */
- array = current->array;
- /*
- * If the task has reached maximum priority (or is a RT task)
- * then just requeue the task to the end of the runqueue:
+ * #1 - it decreases its priority by one. This priority loss is
+ * temporary, it's recovered once the current timeslice
+ * expires.
+ *
+ * #2 - once it has reached the lowest priority level,
+ * it will give up timeslices one by one. (We do not
+ * want to give them up all at once, it's gradual,
+ * to protect the casual yield()er.)
+ *
+ * #3 - once all timeslices are gone we put the process into
+ * the expired array.
+ *
+ * (special rule: RT tasks do not lose any priority, they just
+ * roundrobin on their current priority level.)
*/
- if (likely(current->prio == MAX_PRIO-1 || rt_task(current))) {
+ if (likely(current->prio == MAX_PRIO-1)) {
+ if (current->time_slice <= 1) {
+ dequeue_task(current, rq->active);
+ enqueue_task(current, rq->expired);
+ } else
+ current->time_slice--;
+ } else if (unlikely(rt_task(current))) {
list_del(&current->run_list);
list_add_tail(&current->run_list, array->queue + current->prio);
} else {
@@ -1461,9 +1469,9 @@

asmlinkage long sys_sched_rr_get_interval(pid_t pid, struct timespec *interval)
{
+ int retval = -EINVAL;
struct timespec t;
task_t *p;
- int retval = -EINVAL;

if (pid < 0)
goto out_nounlock;
@@ -1758,8 +1766,8 @@

static int migration_thread(void * bind_cpu)
{
- int cpu = cpu_logical_map((int) (long) bind_cpu);
struct sched_param param = { sched_priority: MAX_RT_PRIO-1 };
+ int cpu = cpu_logical_map((int) (long) bind_cpu);
runqueue_t *rq;
int ret;

@@ -1836,15 +1844,14 @@
int cpu;

current->cpus_allowed = 1UL << cpu_logical_map(0);
- for (cpu = 0; cpu < smp_num_cpus; cpu++) {
+ for (cpu = 0; cpu < smp_num_cpus; cpu++)
if (kernel_thread(migration_thread, (void *) (long) cpu,
CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
BUG();
- }
current->cpus_allowed = -1L;

for (cpu = 0; cpu < smp_num_cpus; cpu++)
while (!cpu_rq(cpu_logical_map(cpu))->migration_thread)
schedule_timeout(2);
}
-#endif /* CONFIG_SMP */
+#endif

2002-06-17 04:51:22

by Stephen Satchell

[permalink] [raw]

Subject: Toshiba PCToPIC97 PC Card freeze in 2.4.18

All:

I'm at my wit's end. I have a Toshiba Satellite 2545XCDT which has a PC
Card adapter. I have been happily running this laptop with a 2.2.16 kernel
without problem. Today, when trying to upgrade to a 20GB hard disk and a
2.4.18 kernel, the box would freeze when trying to start the PCMCIA
service. Here is the message that I get on the screen:

PCI: No IRQ known for interrupt pin A of device 00:13:0. Please try using
pci=biosirq
PCI: No IRQ known for interrupt pin B of device 00:13.0. Please try using
pci=biosirq
Yenta IRQ list 06b8 PCI irq 0
Socket status: 30000007

and the system is completely frozen at that point -- even CTRL-ALT-DEL
doesn't work. (The soft power switch does, which tells me that NMI
interrupts get through, but nothing else.) As you might guess, SysRq
didn't work, either. Only powering off would allow me to restart the system.

When I recompile the kernel to not make PCMCIA a module, there is NO
message, just the system freeze.

Nothing interesting shows up in syslog.

Probing the /proc filesystem, I find that under 2.2.16 there is a character
device 254 labeled PCMCIA; in the 2.4.18 kernel I see no device 254 or any
device with the label PCMCIA. Granted, in the case of 2.2.16 the various
modules successfully loaded, so they may have advertised device 254,
whereas on the 2.4.18 kernel the failure kept the device from being advertised.

Dumping /proc/pci, I see device 19 (0x13) listed but completely different
capabilities advertised. Under 2.2.16, I see "Slow devsel. Fast
back-to-back capable. Master Capable. No bursts. Min Gnt=128.Max
lat=4." The same device under 2.4.18 reports "Non-prefetchable 32 bit
memory at 0x100000000 [0x100000fff]." Other PCI devices have reports that
differ in format but not significantly in the amount of and values in content.

I even went so far as to download the latest version
(pcmcia-cs-3.1.34.tar.gz) of the PCMCIA stuff from SourceForge, compiled it
all, and ended up with exactly the same results. So I'm beginning to
believe that it's not the PCMCIA/PCCard software.

I checked the kernel archives for any mention of this problem, and the
closest I could find was a complaint regarding an IBM ThinkPad. Ditto
checking the bug list for the project on SourceForge. Nothing on Toshiba

I put the old hard drive back into the laptop so I can get some work done,
but I still have all the stuff on the new drive.

The distributions involved are Red Hat 7.0 and Red Hat 7.3.

Where to "try using pci=biosirq"? I tried adding it to the boot sequence,
with no result.

I'm stumped. Any suggestions where to start looking?

Stephen Satchell

2002-06-17 04:52:07

[permalink] [raw]

Subject: [patch] 2.5.22 current scheduler bits #1.

one more fix: do not forced-migrate tasks in wakeup if it violates the
affinity mask. My current scheduler tree is attached, against 2.5.22.

Ingo

--- linux/kernel/sched.c.orig Mon Jun 17 05:43:53 2002
+++ linux/kernel/sched.c Mon Jun 17 06:46:01 2002
@@ -6,14 +6,14 @@
* Copyright (C) 1991-2002 Linus Torvalds
*
* 1996-12-23 Modified by Dave Grothe to fix bugs in semaphores and
- * make semaphores SMP safe
+ * make semaphores SMP safe
* 1998-11-19 Implemented schedule_timeout() and related stuff
* by Andrea Arcangeli
* 2002-01-04 New ultra-scalable O(1) scheduler by Ingo Molnar:
- * hybrid priority-list and round-robin design with
- * an array-switch method of distributing timeslices
- * and per-CPU runqueues. Additional code by Davide
- * Libenzi, Robert Love, and Rusty Russell.
+ * hybrid priority-list and round-robin design with
+ * an array-switch method of distributing timeslices
+ * and per-CPU runqueues. Additional code by Davide
+ * Libenzi, Robert Love, and Rusty Russell.
*/

#include <linux/mm.h>
@@ -180,11 +180,14 @@
/*
* rq_lock - lock a given runqueue and disable interrupts.
*/
-static inline runqueue_t *rq_lock(runqueue_t *rq)
+static inline runqueue_t *this_rq_lock(void)
{
+ runqueue_t *rq;
+
local_irq_disable();
rq = this_rq();
spin_lock(&rq->lock);
+
return rq;
}

@@ -358,12 +361,17 @@
rq = task_rq_lock(p, &flags);
old_state = p->state;
if (!p->array) {
- if (unlikely(sync && (rq->curr != p))) {
- if (p->thread_info->cpu != smp_processor_id()) {
- p->thread_info->cpu = smp_processor_id();
- task_rq_unlock(rq, &flags);
- goto repeat_lock_task;
- }
+ /*
+ * Fast-migrate the task if it's not running or runnable
+ * currently. Do not violate hard affinity.
+ */
+ if (unlikely(sync && (rq->curr != p) &&
+ (p->thread_info->cpu != smp_processor_id()) &&
+ (p->cpus_allowed & (1UL << smp_processor_id())))) {
+
+ p->thread_info->cpu = smp_processor_id();
+ task_rq_unlock(rq, &flags);
+ goto repeat_lock_task;
}
if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
@@ -388,9 +396,7 @@

void wake_up_forked_process(task_t * p)
{
- runqueue_t *rq;
-
- rq = rq_lock(rq);
+ runqueue_t *rq = this_rq_lock();

p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -797,7 +803,8 @@
list_t *queue;
int idx;

- BUG_ON(in_interrupt());
+ if (unlikely(in_interrupt()))
+ BUG();

#if CONFIG_DEBUG_HIGHMEM
check_highmem_ptes();
@@ -1158,13 +1165,12 @@
static int setscheduler(pid_t pid, int policy, struct sched_param *param)
{
struct sched_param lp;
+ int retval = -EINVAL;
prio_array_t *array;
unsigned long flags;
runqueue_t *rq;
- int retval;
task_t *p;

- retval = -EINVAL;
if (!param || pid < 0)
goto out_nounlock;

@@ -1251,10 +1257,9 @@

asmlinkage long sys_sched_getscheduler(pid_t pid)
{
+ int retval = -EINVAL;
task_t *p;
- int retval;

- retval = -EINVAL;
if (pid < 0)
goto out_nounlock;

@@ -1271,11 +1276,10 @@

asmlinkage long sys_sched_getparam(pid_t pid, struct sched_param *param)
{
- task_t *p;
struct sched_param lp;
- int retval;
+ int retval = -EINVAL;
+ task_t *p;

- retval = -EINVAL;
if (!param || pid < 0)
goto out_nounlock;

@@ -1310,8 +1314,8 @@
unsigned long *user_mask_ptr)
{
unsigned long new_mask;
- task_t *p;
int retval;
+ task_t *p;

if (len < sizeof(new_mask))
return -EINVAL;
@@ -1361,13 +1365,12 @@
asmlinkage int sys_sched_getaffinity(pid_t pid, unsigned int len,
unsigned long *user_mask_ptr)
{
- unsigned long mask;
unsigned int real_len;
- task_t *p;
+ unsigned long mask;
int retval;
+ task_t *p;

real_len = sizeof(mask);
-
if (len < real_len)
return -EINVAL;

@@ -1392,25 +1395,35 @@

asmlinkage long sys_sched_yield(void)
{
- runqueue_t *rq;
- prio_array_t *array;
-
- rq = rq_lock(rq);
+ runqueue_t *rq = this_rq_lock();
+ prio_array_t *array = current->array;

/*
- * Decrease the yielding task's priority by one, to avoid
- * livelocks. This priority loss is temporary, it's recovered
- * once the current timeslice expires.
+ * There are three levels of how a yielding task will give up
+ * the current CPU:
*
- * If priority is already MAX_PRIO-1 then we still
- * roundrobin the task within the runlist.
- */
- array = current->array;
- /*
- * If the task has reached maximum priority (or is a RT task)
- * then just requeue the task to the end of the runqueue:
+ * #1 - it decreases its priority by one. This priority loss is
+ * temporary, it's recovered once the current timeslice
+ * expires.
+ *
+ * #2 - once it has reached the lowest priority level,
+ * it will give up timeslices one by one. (We do not
+ * want to give them up all at once, it's gradual,
+ * to protect the casual yield()er.)
+ *
+ * #3 - once all timeslices are gone we put the process into
+ * the expired array.
+ *
+ * (special rule: RT tasks do not lose any priority, they just
+ * roundrobin on their current priority level.)
*/
- if (likely(current->prio == MAX_PRIO-1 || rt_task(current))) {
+ if (likely(current->prio == MAX_PRIO-1)) {
+ if (current->time_slice <= 1) {
+ dequeue_task(current, rq->active);
+ enqueue_task(current, rq->expired);
+ } else
+ current->time_slice--;
+ } else if (unlikely(rt_task(current))) {
list_del(&current->run_list);
list_add_tail(&current->run_list, array->queue + current->prio);
} else {
@@ -1461,9 +1474,9 @@

asmlinkage long sys_sched_rr_get_interval(pid_t pid, struct timespec *interval)
{
+ int retval = -EINVAL;
struct timespec t;
task_t *p;
- int retval = -EINVAL;

if (pid < 0)
goto out_nounlock;
@@ -1758,8 +1771,8 @@

static int migration_thread(void * bind_cpu)
{
- int cpu = cpu_logical_map((int) (long) bind_cpu);
struct sched_param param = { sched_priority: MAX_RT_PRIO-1 };
+ int cpu = cpu_logical_map((int) (long) bind_cpu);
runqueue_t *rq;
int ret;

@@ -1836,15 +1849,14 @@
int cpu;

current->cpus_allowed = 1UL << cpu_logical_map(0);
- for (cpu = 0; cpu < smp_num_cpus; cpu++) {
+ for (cpu = 0; cpu < smp_num_cpus; cpu++)
if (kernel_thread(migration_thread, (void *) (long) cpu,
CLONE_FS | CLONE_FILES | CLONE_SIGNAL) < 0)
BUG();
- }
current->cpus_allowed = -1L;

for (cpu = 0; cpu < smp_num_cpus; cpu++)
while (!cpu_rq(cpu_logical_map(cpu))->migration_thread)
schedule_timeout(2);
}
-#endif /* CONFIG_SMP */
+#endif

2002-06-17 05:33:25

[permalink] [raw]

Subject: Re: [PATCH] 2.4-ac: sparc64 support for O(1) scheduler

From: Robert Love <[email protected]>
Date: 16 Jun 2002 16:45:45 -0700

My approach thus far with 2.5 -> 2.4 O(1) backports has been one of
caution and it has worked fine thus far. I figure, what is the rush?

Your changes were pretty, that's part of the problem. Fixing things
correctly is 10 times more preferable to a 1 time hack "just for now".

2002-06-17 08:18:06

by Zwane Mwaikambo

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Mon, 17 Jun 2002, Ingo Molnar wrote:

> i have planned to submit the irqbalance patch for 2.4-ac real soon, which
> needs this function - current IRQ distribution on P4 SMP boxes is a
> showstopper.

Can we add a config time option for irqbalance? I consider it extra
overhead for setups which can do the interrupt distribution via hardware
properly, also irqbalance breaks NUMAQ horribly seeing as it assumes a
number of things like addressing modes.

Regards,
Zwane Mwaikambo

--
http://function.linuxpower.ca

2002-06-17 08:35:00

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Mon, 17 Jun 2002, Zwane Mwaikambo wrote:

> > i have planned to submit the irqbalance patch for 2.4-ac real soon, which
> > needs this function - current IRQ distribution on P4 SMP boxes is a
> > showstopper.
>
> Can we add a config time option for irqbalance? I consider it extra
> overhead for setups which can do the interrupt distribution via hardware
> properly, [...]

What x86 hardware do you have in mind?

My main issue with irqbalance is the lack of testing it has - eg. a
showstopper SMP-on-UP bug was found just two days ago.

> [...] also irqbalance breaks NUMAQ horribly seeing as it assumes a
> number of things like addressing modes.

exactly what does it assume that breaks NUMAQ?

Ingo

2002-06-17 08:50:55

by Zwane Mwaikambo

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Mon, 17 Jun 2002, Ingo Molnar wrote:

>
> On Mon, 17 Jun 2002, Zwane Mwaikambo wrote:
>
> > > i have planned to submit the irqbalance patch for 2.4-ac real soon, which
> > > needs this function - current IRQ distribution on P4 SMP boxes is a
> > > showstopper.
> >
> > Can we add a config time option for irqbalance? I consider it extra
> > overhead for setups which can do the interrupt distribution via hardware
> > properly, [...]
>
> What x86 hardware do you have in mind?

ye olde generic x86 SMP box, the interrupt handling imbalance came about
with the P4 and their newer APIC setup did it not? Although i am aware
that some x86 SMP boxes don't do the distribution properly too, thats why
i reckon config option would be best.

> My main issue with irqbalance is the lack of testing it has - eg. a
> showstopper SMP-on-UP bug was found just two days ago.

Understandable, i agree not many people run 2.5 and it would help if it
got into 2.4-ac for testing purposes.

> > [...] also irqbalance breaks NUMAQ horribly seeing as it assumes a
> > number of things like addressing modes.
>
> exactly what does it assume that breaks NUMAQ?

<Disclaimer>
I am not a NUMAQ expert and do not even have access to one for testing
</Disclaimer>

The addressing mode, irq_balance assumes that the addressing mode is
logical mode (when programming the IOREDTBL entries), whilst NUMAQ uses a
completely different addressing architecture. Also another thing is
consider this situation;

irqbalance programs IOAPIC#0 on node0 to deliver to CPU#6 on node1

Will that interrupt get delivered?

Regards,
Zwane Mwaikambo

--
http://function.linuxpower.ca

2002-06-17 09:02:44

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Mon, 17 Jun 2002, Zwane Mwaikambo wrote:

> > > Can we add a config time option for irqbalance? I consider it extra
> > > overhead for setups which can do the interrupt distribution via hardware
> > > properly, [...]
> >
> > What x86 hardware do you have in mind?
>
> ye olde generic x86 SMP box, the interrupt handling imbalance came about
> with the P4 and their newer APIC setup did it not? Although i am aware
> that some x86 SMP boxes don't do the distribution properly too, thats
> why i reckon config option would be best.

even generic x86 SMP boxes benefit from irqbalance due to better irq
affinity. Actually one could hardly find a worse way to distribute
interrupts than the IO-APIC + lowest priority delivery mode does... [in
fact there is one, the P4's do it ;-) ]

> > > [...] also irqbalance breaks NUMAQ horribly seeing as it assumes a
> > > number of things like addressing modes.
> >
> > exactly what does it assume that breaks NUMAQ?
>
> <Disclaimer>
> I am not a NUMAQ expert and do not even have access to one for testing
> </Disclaimer>
>
> The addressing mode, irq_balance assumes that the addressing mode is
> logical mode (when programming the IOREDTBL entries), whilst NUMAQ uses
> a completely different addressing architecture. [...]

irqbalance uses the set_ioapic_affinity() method to set affinity. The
clustered APIC code is broken if it doesnt handle this properly. (i dont
have such hardware so i cant tell, but it indeed doesnt appear to handle
this case properly.) By wrapping around at node boundary the irqbalance
code will work just fine.

> [...] Also another thing is consider this situation;
>
> irqbalance programs IOAPIC#0 on node0 to deliver to CPU#6 on node1
>
> Will that interrupt get delivered?

i agree that this could be a problem, but set_ioapic_affinity() can be
made dependent on the actual NUMA setup that is used. This is absolutely
needed anyway for a proper /proc/irq/*/smp_affinity feature.

Ingo

2002-06-17 10:02:00

by Zwane Mwaikambo

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

<Martin Bligh added to CC>

On Mon, 17 Jun 2002, Ingo Molnar wrote:

> irqbalance uses the set_ioapic_affinity() method to set affinity. The
> clustered APIC code is broken if it doesnt handle this properly. (i dont
> have such hardware so i cant tell, but it indeed doesnt appear to handle
> this case properly.) By wrapping around at node boundary the irqbalance
> code will work just fine.

I agree, Also we have to be careful about the usage of cpu_online_map in
balance_irq, there might need to be a bit of reworking of some of the
other parts to get this working e.g. being able to determine which node a
specific IOAPIC register is on (perhaps there might be 1 or 2 IOAPICs /
node) etc etc. Martin?

> i agree that this could be a problem, but set_ioapic_affinity() can be
> made dependent on the actual NUMA setup that is used. This is absolutely
> needed anyway for a proper /proc/irq/*/smp_affinity feature.

Agreed.

Thanks,
Zwane
--
http://function.linuxpower.ca

2002-06-17 16:59:10

by Rusty Russell

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Mon, 17 Jun 2002 05:24:30 +0200 (CEST)
Ingo Molnar <[email protected]> wrote:

> > > - sched_setaffinity() & sched_getaffinity() syscalls on x86.
> >
> > Do we want to introduce this into 2.4 now? I realize 2.4-ac is not 2.4
> > proper, but if there is a chance this interface could change...
>
> the setaffinity()/getaffinity() interface looks pretty robust, i dont
> expect any changes

There's one coming. In 2.5.soon, you'll need to handle the "CPU going away"
signal, otherwise your process will abort as someone downs a CPU.

The problem with backporting one and not the other, is that apps can't be
written correctly for 2.4 and 2.5 8(

Rusty.
--
there are those who do and those who hang on and you don't see too
many doers quoting their contemporaries. -- Larry McVoy

2002-06-17 21:18:29

[permalink] [raw]

Subject: Re: [PATCH] 2.4-ac: sparc64 support for O(1) scheduler

On Sun, 2002-06-16 at 22:28, David S. Miller wrote:

> Your changes were pretty, that's part of the problem. Fixing things
> correctly is 10 times more preferable to a 1 time hack "just for now".

*shrug* I think you are missing my point but that is OK - we really do
not need to fight over it.

The switch_mm patch touched _core_ bits - code that affects i386 which
works fine now in 2.4-ac. As 2.4-ac is stable and i386 is working fine,
I want to move changes into it slowly and with testing.

If you object to merging the "broken" sparc64 patch now but concede we
can wait for Ingo's patch, then I agree. In fact, in light of Ingo's
patch Alan should not merge what I sent. But my opinion would be to
hold off until the new bits saw some testing in 2.5 ... however trivial
they may be.

Robert Love

2002-06-18 07:17:21

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

On Mon, Jun 17, 2002 at 11:00:26AM +0200, Ingo Molnar wrote:
> irqbalance uses the set_ioapic_affinity() method to set affinity. The
> clustered APIC code is broken if it doesnt handle this properly. (i dont
> have such hardware so i cant tell, but it indeed doesnt appear to handle
> this case properly.) By wrapping around at node boundary the irqbalance
> code will work just fine.

Perhaps a brief look at the code will help. Please forgive my
non-preservation of whitespace as I cut and pasted it.

static inline void balance_irq(int irq)
{
#if CONFIG_SMP
irq_balance_t *entry = irq_balance + irq;
unsigned long now = jiffies;

if (unlikely(entry->timestamp != now)) {
unsigned long allowed_mask;
int random_number;

rdtscl(random_number);
random_number &= 1;

allowed_mask = cpu_online_map & irq_affinity[irq];
entry->timestamp = now;
entry->cpu = move(entry->cpu, allowed_mask, now, random_number);
set_ioapic_affinity(irq, 1 << entry->cpu);
}
#endif
}

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1 << entry->cpu

This could be problematic ...

static void set_ioapic_affinity (unsigned int irq, unsigned long mask)
{
unsigned long flags;

/*
* Only the first 8 bits are valid.
*/
mask = mask << 24;
spin_lock_irqsave(&ioapic_lock, flags);
__DO_ACTION(1, = mask, )
spin_unlock_irqrestore(&ioapic_lock, flags);
}

According to this, nothing over 8 cpu's can work as the cpu id is used
as a shift into an 8-bit bitfield. Also,

#define __DO_ACTION(R, ACTION, FINAL) \
\
{ \
int pin; \
struct irq_pin_list *entry = irq_2_pin + irq; \
\
for (;;) { \
unsigned int reg; \
pin = entry->pin; \
if (pin == -1) \
break; \
reg = io_apic_read(entry->apic, 0x10 + R + pin*2); \
reg ACTION; \
io_apic_modify(entry->apic, reg); \
if (!entry->next) \
break; \
entry = irq_2_pin + entry->next; \
} \
FINAL; \
}

ACTION is supposed to be an assignment to reg; in clustered hierarchical
destination format this is not a bitmask as assumed by 1 << entry->cpu.

Matt, Mike, please comment.

Cheers,
Bill

2002-06-19 01:25:59

by Matthew Dobson

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

I'm looking at this right now, as it is definitely broken on our NUMA-Q
hardware when running in multiquad mode. It needs to respect clustered APIC
mode, so I'm working on it.

Cheers!

-Matt

William Lee Irwin III wrote:
> On Mon, Jun 17, 2002 at 11:00:26AM +0200, Ingo Molnar wrote:
>
>>irqbalance uses the set_ioapic_affinity() method to set affinity. The
>>clustered APIC code is broken if it doesnt handle this properly. (i dont
>>have such hardware so i cant tell, but it indeed doesnt appear to handle
>>this case properly.) By wrapping around at node boundary the irqbalance
>>code will work just fine.
>
>
> Perhaps a brief look at the code will help. Please forgive my
> non-preservation of whitespace as I cut and pasted it.
>
>
> static inline void balance_irq(int irq)
> {
> #if CONFIG_SMP
> irq_balance_t *entry = irq_balance + irq;
> unsigned long now = jiffies;
>
> if (unlikely(entry->timestamp != now)) {
> unsigned long allowed_mask;
> int random_number;
>
> rdtscl(random_number);
> random_number &= 1;
>
> allowed_mask = cpu_online_map & irq_affinity[irq];
> entry->timestamp = now;
> entry->cpu = move(entry->cpu, allowed_mask, now, random_number);
> set_ioapic_affinity(irq, 1 << entry->cpu);
> }
> #endif
> }
>
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 1 << entry->cpu
>
>
>
> This could be problematic ...
>
>
> static void set_ioapic_affinity (unsigned int irq, unsigned long mask)
> {
> unsigned long flags;
>
> /*
> * Only the first 8 bits are valid.
> */
> mask = mask << 24;
> spin_lock_irqsave(&ioapic_lock, flags);
> __DO_ACTION(1, = mask, )
> spin_unlock_irqrestore(&ioapic_lock, flags);
> }
>
>
> According to this, nothing over 8 cpu's can work as the cpu id is used
> as a shift into an 8-bit bitfield. Also,
>
>
> #define __DO_ACTION(R, ACTION, FINAL) \
> \
> { \
> int pin; \
> struct irq_pin_list *entry = irq_2_pin + irq; \
> \
> for (;;) { \
> unsigned int reg; \
> pin = entry->pin; \
> if (pin == -1) \
> break; \
> reg = io_apic_read(entry->apic, 0x10 + R + pin*2); \
> reg ACTION; \
> io_apic_modify(entry->apic, reg); \
> if (!entry->next) \
> break; \
> entry = irq_2_pin + entry->next; \
> } \
> FINAL; \
> }
>
> ACTION is supposed to be an assignment to reg; in clustered hierarchical
> destination format this is not a bitmask as assumed by 1 << entry->cpu.
>
>
> Matt, Mike, please comment.
>
>
> Cheers,
> Bill
>

2002-06-20 19:42:05

by Alan Cox

[permalink] [raw]

Subject: Re: [PATCH] 2.4-ac: sparc64 support for O(1) scheduler

> them... (I am currently putting together all the scheduler bits we have
> been working on for a 2.4-ac patch...)
>
> Your sparc64 kernel/sched.c bits have zero testing in any kernel.
> What point are you trying to make? It disables a very important
> optimization on SMP sparc64. It's simply unacceptable.

I don't care about Sparc64, especially as a short term item. Long term
yes you are right but for the -ac work, it can fall back for a while

2002-06-20 20:24:30

by Andrew Theurer

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

Ingo,

Could we also change "now" to a longer interval? In netbench, 2.4.18, O1,
irqbalance, I get the following results:

[4-way P4, 4 acenics]
now = jiffies 743 Mbps
now = jiffies*10 784 Mbps
now = jiffies*20 803 Mbps
now = jiffies*30 800 Mbps
now = jiffies*100 770 Mbps

[no irqbalance patch]
all IRQs on one CPU 809 Mbps
1 acenic per CPU 800 Mbps

Either the IRQs don't get to stick around long enough, or there is a high cost
for the IOAPIC programming? Anton may have some info on this as well....

-Andrew Theurer

On Tuesday 18 June 2002 20:05, Matthew Dobson wrote:
> I'm looking at this right now, as it is definitely broken on our NUMA-Q
> hardware when running in multiquad mode. It needs to respect clustered
> APIC mode, so I'm working on it.
>
> Cheers!
>
> -Matt
>
> William Lee Irwin III wrote:
> > On Mon, Jun 17, 2002 at 11:00:26AM +0200, Ingo Molnar wrote:
> >>irqbalance uses the set_ioapic_affinity() method to set affinity. The
> >>clustered APIC code is broken if it doesnt handle this properly. (i dont
> >>have such hardware so i cant tell, but it indeed doesnt appear to handle
> >>this case properly.) By wrapping around at node boundary the irqbalance
> >>code will work just fine.
> >
> > Perhaps a brief look at the code will help. Please forgive my
> > non-preservation of whitespace as I cut and pasted it.
> >
> >
> > static inline void balance_irq(int irq)
> > {
> > #if CONFIG_SMP
> > irq_balance_t *entry = irq_balance + irq;
> > unsigned long now = jiffies;
> >
> > if (unlikely(entry->timestamp != now)) {
> > unsigned long allowed_mask;
> > int random_number;
> >
> > rdtscl(random_number);
> > random_number &= 1;
> >
> > allowed_mask = cpu_online_map & irq_affinity[irq];
> > entry->timestamp = now;
> > entry->cpu = move(entry->cpu, allowed_mask, now, random_number);
> > set_ioapic_affinity(irq, 1 << entry->cpu);
> > }
> > #endif
> > }
> >
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 1 << entry->cpu
> >
> >
> >
> > This could be problematic ...
> >
> >
> > static void set_ioapic_affinity (unsigned int irq, unsigned long mask)
> > {
> > unsigned long flags;
> >
> > /*
> > * Only the first 8 bits are valid.
> > */
> > mask = mask << 24;
> > spin_lock_irqsave(&ioapic_lock, flags);
> > __DO_ACTION(1, = mask, )
> > spin_unlock_irqrestore(&ioapic_lock, flags);
> > }
> >
> >
> > According to this, nothing over 8 cpu's can work as the cpu id is used
> > as a shift into an 8-bit bitfield. Also,
> >
> >
> > #define __DO_ACTION(R, ACTION, FINAL) \
> > \
> > { \
> > int pin; \
> > struct irq_pin_list *entry = irq_2_pin + irq; \
> > \
> > for (;;) { \
> > unsigned int reg; \
> > pin = entry->pin; \
> > if (pin == -1) \
> > break; \
> > reg = io_apic_read(entry->apic, 0x10 + R + pin*2); \
> > reg ACTION; \
> > io_apic_modify(entry->apic, reg); \
> > if (!entry->next) \
> > break; \
> > entry = irq_2_pin + entry->next; \
> > } \
> > FINAL; \
> > }
> >
> > ACTION is supposed to be an assignment to reg; in clustered hierarchical
> > destination format this is not a bitmask as assumed by 1 << entry->cpu.
> >
> >
> > Matt, Mike, please comment.
> >
> >
> > Cheers,
> > Bill
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2002-06-24 00:18:14

by Martin J. Bligh

[permalink] [raw]

Subject: Re: [patch] 2.4.19-pre10-ac2: O(1) scheduler merge, -A3.

> On Mon, Jun 17, 2002 at 11:00:26AM +0200, Ingo Molnar wrote:
>> irqbalance uses the set_ioapic_affinity() method to set affinity. The
>> clustered APIC code is broken if it doesnt handle this properly. (i dont
>> have such hardware so i cant tell, but it indeed doesnt appear to handle
>> this case properly.) By wrapping around at node boundary the irqbalance
>> code will work just fine.
>
> Perhaps a brief look at the code will help. Please forgive my
> non-preservation of whitespace as I cut and pasted it.

IIRC, I set up the IOAPICs to use physical mode broadcast
on all quads - physical broadcasts are quad-local, and thus
the interrupt is always processesed by a cpu on the quad
where it originated. Much simpler than trying to correctly
program clustered logical mode broadcasts differently for
every quad.

You also don't want to end up reprogramming the IO-APICs
cross-quad, you want a per-node thread to do this. We have
2 IO-APICs per node.

Whilst balancing of some form is definitely valuable for a P4,
I'm less convinced it's worthwhile for a P3 system. I presume
what you're trying to achieve is cache warmth for the interrupt
handling code at the expense of the cost of constantly reprogramming
the IO-APICs.

At the very least, we need to have a simple disable config option
in order to benchmark whether this change is worthwhile for each subarchitecture.

M.

2002-06-14 04:30:05

[permalink] [raw]

Subject: Re: [PATCH] 2.4-ac: sparc64 support for O(1) scheduler

From: Robert Love <[email protected]>
Date: 13 Jun 2002 12:21:58 -0700

Patch is against 2.4.19-pre10-ac2, please apply.

Ummm what is with all of those switch_mm() hacks? Is this an attempt
to work around the locking problems? Please don't do that as it is
going to kill performance and having ifdef sparc64 sched.c changes is
ugly to say the least.

Ingo posted the correct fix to the locking problem with the patch
he posted the other day, that is what should go into the -ac patches.

2002-06-14 17:37:45

[permalink] [raw]

Subject: Re: [PATCH] 2.4-ac: sparc64 support for O(1) scheduler

On Thu, 2002-06-13 at 21:25, David S. Miller wrote:

> Ummm what is with all of those switch_mm() hacks? Is this an attempt
> to work around the locking problems? Please don't do that as it is
> going to kill performance and having ifdef sparc64 sched.c changes is
> ugly to say the least.
>
> Ingo posted the correct fix to the locking problem with the patch
> he posted the other day, that is what should go into the -ac patches.

I am explicitly refraining from sending Alan any code that is not
well-tested in 2.5 and my machines first. As Ingo's new switch_mm()
bits are not even in 2.5 yet, I plan to wait a bit before sending
them... (I am currently putting together all the scheduler bits we have
been working on for a 2.4-ac patch...)

If you like, Alan can hold off on this and take it when the appropriate
patches are in.

Robert Love

2002-06-14 22:01:14

by Tom Duffy

[permalink] [raw]

Subject: Re: [PATCH] 2.4-ac: sparc64 support for O(1) scheduler

begin David S. Miller quotation on Thu, 13 Jun 2002 21:32:11 -0700:

> From: Robert Love <[email protected]>
> Date: 13 Jun 2002 12:21:58 -0700
>
> Patch is against 2.4.19-pre10-ac2, please apply.
>
> Ummm what is with all of those switch_mm() hacks? Is this an attempt to
> work around the locking problems? Please don't do that as it is going
> to kill performance and having ifdef sparc64 sched.c changes is ugly to
> say the least.
>
> Ingo posted the correct fix to the locking problem with the patch he
> posted the other day, that is what should go into the -ac patches.

This part of the patch (the change to kernel/sched.c) can be safely
removed without making o1 stop working.

This hack (conservatively) fixes an issue where on bootup, the machine
would get into a page fault loop and hang. This only happens a very
small percentage of the time. I will investigate whether the patch
Ingo put out fixes this issue.

-tduffy

--
He who receives an idea from me, receives instruction himself without
lessening mine; as he who lights his taper at mine, receives light
without darkening me. -- Thomas Jefferson

2002-06-15 13:27:20

[permalink] [raw]

Subject: Re: [PATCH] 2.4-ac: sparc64 support for O(1) scheduler

From: Robert Love <[email protected]>
Date: 14 Jun 2002 10:32:32 -0700

I am explicitly refraining from sending Alan any code that is not
well-tested in 2.5 and my machines first. As Ingo's new switch_mm()
bits are not even in 2.5 yet, I plan to wait a bit before sending
them... (I am currently putting together all the scheduler bits we have
been working on for a 2.4-ac patch...)

Your sparc64 kernel/sched.c bits have zero testing in any kernel.
What point are you trying to make? It disables a very important
optimization on SMP sparc64. It's simply unacceptable.

Ingo's change which deletes the frozen locking bits has to be
installed with the patches which allow sparc64 to continue working
without the deadlock bug, they cannot be added seperately.

2002-06-15 13:39:51