the -J4 scheduler patch is available:
http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.3-pre2-J4.patch
http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-J4.patch
there are no open/reported bugs, and no bugs were found since -J2. The
scheduler appears to be stabilizing steadily.
-J4 includes two changes to further improve interactiveness:
1) the introduction of 'super-long' timeslices, max timeslice is 500
msecs - it was 180 msecs before. The new default timeslice is 250
msecs, it was 90 msecs before.
the reason for super-long timeslices is that IMO we now can afford them.
The scheduler is pretty good at identifying true interactive tasks these
days. So we can increase timeslice length without risking the loss of good
interacte latencies. Long timeslices have a number of advantages:
- nice +19 CPU hogs take up less CPU time than they used to.
- interactive tasks can gather a bigger 'reserve' timeslice they can use
up to do bursts of processing.
- CPU hogs will get better cache affinity, due to longer timeslices
and less context-switching.
Long timeslices also have a disadvantage:
- under high load, if an interactive task manages to fall into the
CPU-bound hell then it will take longer for it to get the next slice of
processing.
i have measured the pros to beat the cons under the workloads i tried, but
YMMV - more testing by more people is needed, comparing -J4's interactive
feel (and nice behavior, and kernel compilation performance) against -J2's
interactive feel/performance.
2) slight shrinking of the bonus/penalty range a task can get.
i've shrunk the bonus/penalty range from +-19 priority levels to +-14
priority levels. (from 90% of the full range to 70% of the full range.)
The reason why this can be done without hurting interactiveness is that
it's no longer a necessity to use the maximum range of priorities - the
interactiviness information is stored in p->sleep_avg, which is not
sensitive to the range of priority levels.
The shrinking has two benefits:
- slightly denser priority arrays, slightly better cache utilization.
- more isolation of nice levels from each other. Eg. nice -20 tasks now
have a 6 priority levels 'buffer zone' which cannot be reached by
normal interactive tasks. nice -20 audio daemons should benefit from
this. Also, normal CPU hogs are better isolated from nice +19 CPU hogs,
with the same 6 priority levels 'buffer zone'.
(by shrinking the bonus/penalty range, the -3 rule in the TASK_INTERACTIVE
definition was shrunk as well, to -2.)
Changelog:
- Erich Focht: optimize max_load, remove prev_max_load.
- Robert Love: simplify unlock_task_rq().
- Robert Love: fix the ->cpu offset value in x86's entry.S, used by the
preemption patch.
- me: interactiveness updates.
- me: sched_rr_get_interval() should return the timeslice value based on
->__nice, not based on ->prio.
Bug reports, comments, suggestions welcome. (any patch/fix that is not in
-J4 is lost and should be resent.)
Ingo
and due to popular demand there is also a patch against 2.4.18-pre4:
http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.18-pre4-J4.patch
Ingo
> and due to popular demand there is also a patch against 2.4.18-pre4:
>
> http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.18-pre4-J4.patch
I was trying to test this in my 8 way NUMA box, but this patch
seems to have lost half of the wait_init_idle fix that I put
in a while back. I'm not sure if this is deliberate or not, but
I suspect not, as you only removed part of it, and from your
comment below (from a previous email), I think you understand
the reasoning behind it:
> the new rules are this: no schedule() must be called before all bits in
> wait_init_idle are clear. I'd suggest for you to add this to the top of
> schedule():
>
> if (wait_init_idle)
> BUG();
Anyway, the machine won't boot without this fix, so I tried adding
it back in, and now it boots just fine. Patch is attached below.
If the removal was accidental, please could you add it back in
as below ... if not, could we discuss why this was removed, and
maybe we can find another way to fix the problem?
Meanwhile, I'll try to knock out some benchmark figures with the
new scheduler code in place on the 8 way NUMA and a 16 way
NUMA ;-)
Martin.
diff -urN linux-2.4.18-pre4.old/init/main.c linux-2.4.18-pre4.new/init/main.c
--- linux-2.4.18-pre4.old/init/main.c Wed Jan 23 18:26:56 2002
+++ linux-2.4.18-pre4.new/init/main.c Wed Jan 23 18:27:04 2002
@@ -508,6 +508,14 @@
smp_threads_ready=1;
smp_commence();
+
+ /* Wait for the other cpus to set up their idle processes */
+ printk("Waiting on wait_init_idle (map = 0x%lx)\n", wait_init_idle);
+ while (wait_init_idle) {
+ cpu_relax();
+ barrier();
+ }
+ printk("All processors have done init_idle\n");
}
#endif
diff -urN linux-2.4.18-pre4.old/kernel/sched.c linux-2.4.18-pre4.new/kernel/sched.c
--- linux-2.4.18-pre4.old/kernel/sched.c Wed Jan 23 18:26:56 2002
+++ linux-2.4.18-pre4.new/kernel/sched.c Wed Jan 23 18:27:09 2002
@@ -1221,6 +1221,8 @@
spin_unlock(&rq2->lock);
}
+extern unsigned long wait_init_idle;
+
void __init init_idle(void)
{
runqueue_t *this_rq = this_rq(), *rq = current->array->rq;
@@ -1237,6 +1239,7 @@
current->state = TASK_RUNNING;
double_rq_unlock(this_rq, rq);
current->need_resched = 1;
+ clear_bit(cpu(), &wait_init_idle);
__restore_flags(flags);
}
On Wed, 23 Jan 2002, Martin J. Bligh wrote:
> I was trying to test this in my 8 way NUMA box, but this patch seems
> to have lost half of the wait_init_idle fix that I put in a while
> back. [...]
please check out the -J5 2.4.17/18 patch, thats the first 2.4 patch that
has the correct idle-thread fixes. (which 2.5.3-pre3 has as well.) Do you
still have booting problems?
Ingo
>> I was trying to test this in my 8 way NUMA box, but this patch seems
>> to have lost half of the wait_init_idle fix that I put in a while
>> back. [...]
>
> please check out the -J5 2.4.17/18 patch, thats the first 2.4 patch that
> has the correct idle-thread fixes. (which 2.5.3-pre3 has as well.) Do you
> still have booting problems?
Yes ... tried J6 on 2.4.18-pre4. If you want the garbled panic, it's attatched
below. What you're doing in J6 certainly looks different, but appears not
to be correct still. I'll look at it some more, and try to send you a patch
against J6 today.
On the upside, the perfomance of the your J4 patch with the added fix I
sent yesterday seems to be a great improvment - before I was getting
about 16% of my total system time spent in default_idle on a
make -j16 bzImage. Now it's 0% ... we're actually feeding those CPUs ;-)
Kernel compile time is under a minute (56s) for the first time ever on the
8 way ... more figures later.
Martin.
checking TSC synchronization across CPUs:
BIOS BUG: CPU#0 improperly initialized, has 52632 usecs TSC skew! FIXED.
BIOS BUG: CPU#1 improperly initialized, has 52632 usecs TSC skew! FIXED.
BIOS BUG: CPU#2 improperly initialized, has 52632 usecs TSC skew! FIXED.
BIOS BUG: CPU#3 improperly initialized, has 52632 usecs TSC skew! FIXED.
BIOS BUG: CPU#4 improperly initialized, has -52632 usecs TSC skew! FIXED.
BIOS BUG: CPU#5 improperly initialized, has -52632 usecs TSC skew! FIXED.
BIOS BUG: CPU#6 improperly initialized, has -52632 usecs TSC skew! FIXED.
BIOS BUG: CPU#7 improperly initialized, has -52633 usecs TSC skew! FIXED.
tecpu 1 has don0 init idl0, do246>c u_idot tai 0idle
ee doin0000u_idle().
00cpu 09 s done idat i0 edoinf cau_00
().
i: cpu 6 has done 00000d1e doin: f7d_idl8
dydinitieli ds sp:Cf7dadab0e
3CPU# slready initialiess
apperblp1>: 0, sta4,pagreaddlre00)presenack: >c01e<Uc0 l0000 to4ha0dl00ha1d po26
2900 ointe5f36 00000246 00000001 c021def0
00000282 00000001 00000015 00000000 c0295ef5 c0295ef7 c0118910 00000c2d
00000004 00000000 00000000 c023624f
Call Trace: [<c0118910>]
Code: Bad EIP value.
in0>Kernop randc: A00
pted to k 2
tEI i e ta10!
<c0I3 ed2e] k - noainteding
F AGS: 00010002
eax: 00000029 ebx: 00010007 ecx: c021df08 edx: 00003ffc
esi: c0233b1a edi: 0000001e ebp: f7db5fa8 esp: f7da9fb0
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, stackpage=f7da9000)
Stack: c01ee3c0 00000002 00000002 c0262a00 c0235f36 00000246 00000001 c021def0
00000282 00000001 00000015 00000000 c0295ef5 c0295ef7 c0118910 00000c35
00000006 00000000 00000000 c023624f
Call Trace: [<c0118910>]
Code: 0f 0b 83 c4 0c a1 84 74 2a c0 8d 90 c8 00 00 00 eb 0f 0f a3
<0>Kernel panic: Attempted to kill the idle task!
In idle task - not syncing
On Thu, 24 Jan 2002, Martin J. Bligh wrote:
> tecpu 1 has don0 init idl0, do246>c u_idot tai 0idle
> ee doin0000u_idle().
> 00cpu 09 s done idat i0 edoinf cau_00
just take out the TSC initialization messages from smpboot.c, that should
ungarble the output. And/or add this to printk.c:
if (smp_processor_id())
return;
this way you'll only see a single CPU's printk messages.
Ingo
Measuring the performace of a parallelized kernel compile with warm caches
on a 8 way NUMA-Q box. Highmem support is turned OFF so I'm only using
the first 1Gb or so of RAM (it's much faster without HIGHMEM).
prepare:
make -j16 dep; make -j16 bzImage; make mrproper; make -j16 dep;
measured:
time make -j16 bzImage
2.4.18-pre7
330.06user 99.92system 1:00.35elapsed 712%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (411135major+486026minor)pagefaults 0swaps
2.4.18-pre7 with J6 scheduler
307.19user 88.54system 0:57.63elapsed 686%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (399255major+484472minor)pagefaults 0swaps
Seems to give a significant improvement, not only giving a shorter
elapsed time, but also lower CPU load.
Martin.
On Thu, Jan 24, 2002 at 09:47:15AM -0800, Martin J. Bligh wrote:
> >> I was trying to test this in my 8 way NUMA box, but this patch seems
> >> to have lost half of the wait_init_idle fix that I put in a while
> >> back. [...]
We had this same trouble on 4 to 12 way Itanium machines, but finally
made them boot using a variation of your fix. That was with J7. It
looks like the boot cpu gets stuck waiting for wait_init_idle to
clear. We'll try to send out a patch on Monday.
Thanks,
Jesse
Measuring kernel compile times on a 16 way NUMA-Q,
adding Ingo's scheduler patch takes kernel compiles down
from 47 seconds to 31 seconds .... pretty impressive benefit.
Mike Kravetz is working on NUMA additions to Ingo's scheduler
which should give further improvements.
Martin.
On Thu, 7 Feb 2002, Martin J. Bligh wrote:
> Measuring kernel compile times on a 16 way NUMA-Q, adding Ingo's
> scheduler patch takes kernel compiles down from 47 seconds to 31
> seconds .... pretty impressive benefit.
cool! By the way, could you try a test-compile with a 'big' .config file?
The reason i'm asking this is that with 31 seconds compiles, the final
link time serialization has a significant effect, which makes the compile
itself less scalable. Adding lots of subsystems to the .config will create
a compilation that takes much longer, but which should also compare the
two schedulers better.
Ingo
Am Fre, 2002-02-08 um 00.23 schrieb Ingo Molnar:
> > Measuring kernel compile times on a 16 way NUMA-Q, adding Ingo's
> > scheduler patch takes kernel compiles down from 47 seconds to 31
> > seconds .... pretty impressive benefit.
> cool! By the way, could you try a test-compile with a 'big' .config file?
I'd assume that a 16way machine still taking 31s to compile the kernel
is already having a 'big' config file.
--
Servus,
Daniel
>> > Measuring kernel compile times on a 16 way NUMA-Q, adding Ingo's
>> > scheduler patch takes kernel compiles down from 47 seconds to 31
>> > seconds .... pretty impressive benefit.
>
>> cool! By the way, could you try a test-compile with a 'big' .config file?
>
> I'd assume that a 16way machine still taking 31s to compile the kernel
> is already having a 'big' config file.
It's a fairly normal config file, but the machine isn't feeling very
in touch with it's NUMAness, so it scales badly. If I only use one
quad (4 processsors), the same compile takes 47s.
M.