LinuxLists.cc - [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Mon, 7 Jan 2002, Brian Gerst wrote:

> I noticed in this patch that you removed the rest_init() function.
> The reason it was split from start_kernel() is that there was a race
> where init memory could be freed before the call to cpu_idle(). Note
> that start_kernel() is marked __init and rest_init() is not.

you are right, i've missed that detail. I've fixed this in my tree
(reverted that part to the previous behavior), the fix will show up in the
next patch. Thanks,

Ingo

2002-01-07 19:39:29

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Mon, 7 Jan 2002, Linus Torvalds wrote:

> Ingo, looks true. A quick -D2?

yep, Brian is right. I've uploaded -D2:

http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D2.patch

other changes:

- make rt_priority 99 map to p->prio 0, rt_priority 0 map to p->prio 99.

- display 'top' priorities correctly, 0-39 for normal processes, negative
values for RT tasks. (it works just fine it appears.) We did not use to
display the real priority of RT tasks, but now it's natural.

> Oh, and please move console_init() back, other consoles (sparc?) may
> depend on having PCI layers initialized.

(doh, done too, fix is in -D2.)

> Oh, and _I_ don't like "cpu()". What's wrong with the already
> existing "smp_processor_id()"?

nothing serious, my main problem with it is that it's often too long for
my 80 chars wide consoles, and it's also too long to type and i use it
quite often in SMP code.

IIRC we had a 'hard_smp_processor_id()' initially, partly to make it
harder to use it. (it was very slow because it did an APIC read). But
these days smp_processor_id() is just as fast (or even faster) as
'current'. So i wanted to use cpu() in new code to make it easier to read
and to make it more compact. But if this is a problem i can remove it.
I've verified that there is no obvious namespace collisions.

(i've done a quick UP sanity compile + boot of 2.5.2-pre9 + D2, it all
works as expected.)

Ingo

2002-01-08 08:43:09

by FD Cami

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

Hi all

I'm joining the host of beta testers involved in that patch...

It's currently running on a production machine :
dual PII350 on ASUS P2B-DS
3 SCSI hard drives
512MB of RAM
3C905C
This is a network server running squid-cache www proxy with
a medium load (700 clients on a T3), mysqld, apache, proftpd.
kernel is stock 2.4.17 - and so far, so good.

Cheers,

Fran?ois Cami

Ingo Molnar wrote:

> On Mon, 7 Jan 2002, Linus Torvalds wrote:
>
>
>>Ingo, looks true. A quick -D2?
>>
>
> yep, Brian is right. I've uploaded -D2:
>
> http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D2.patch
>
> other changes:
>
> - make rt_priority 99 map to p->prio 0, rt_priority 0 map to p->prio 99.
>
> - display 'top' priorities correctly, 0-39 for normal processes, negative
> values for RT tasks. (it works just fine it appears.) We did not use to
> display the real priority of RT tasks, but now it's natural.
>
>
>>Oh, and please move console_init() back, other consoles (sparc?) may
>>depend on having PCI layers initialized.
>>
>
> (doh, done too, fix is in -D2.)
>
>
>>Oh, and _I_ don't like "cpu()". What's wrong with the already
>>existing "smp_processor_id()"?
>>
>
> nothing serious, my main problem with it is that it's often too long for
> my 80 chars wide consoles, and it's also too long to type and i use it
> quite often in SMP code.
>
> IIRC we had a 'hard_smp_processor_id()' initially, partly to make it
> harder to use it. (it was very slow because it did an APIC read). But
> these days smp_processor_id() is just as fast (or even faster) as
> 'current'. So i wanted to use cpu() in new code to make it easier to read
> and to make it more compact. But if this is a problem i can remove it.
> I've verified that there is no obvious namespace collisions.
>
> (i've done a quick UP sanity compile + boot of 2.5.2-pre9 + D2, it all
> works as expected.)
>
> Ingo
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2002-01-08 11:37:13

by Anton Blanchard

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

Hi Ingo,

I tested 2.5.2-pre10 today. There is some bitop abuse that needs fixing
for big endian machines to work :)

At the moment we have:

#define BITMAP_SIZE ((MAX_PRIO+7)/8)
char bitmap[BITMAP_SIZE];

Which is initialised using:

memset(array->bitmap, 0xff, BITMAP_SIZE);
clear_bit(MAX_PRIO, array->bitmap);

This results in the following in memory (in ascending memory order):

ffffffffffffffff ffffffffffffffff fffffeffff000000

The problem here is that when we search the high word, we do so from
the right, therefore we get 128 all the time :)

The following patch fixes this. We need to define the bitmap to be in
terms of unsigned long, in this case its only lucky we have the correct
alignment. We also replace the memset of the bitmap with set_bit.

With the patch things look much better (and the kernel boots on my
ppc64 machine :)

ffffffffffffffff ffffffffffffffff 000000ffffffffff

Anton

diff -urN linuxppc_2_5/include/asm-i386/mmu_context.h linuxppc_2_5_work/include/asm-i386/mmu_context.h
--- linuxppc_2_5/include/asm-i386/mmu_context.h Tue Jan 8 17:09:47 2002
+++ linuxppc_2_5_work/include/asm-i386/mmu_context.h Tue Jan 8 22:06:35 2002
@@ -16,7 +16,7 @@
# error update this function.
#endif

-static inline int sched_find_first_zero_bit(char *bitmap)
+static inline int sched_find_first_zero_bit(unsigned long *bitmap)
{
unsigned int *b = (unsigned int *)bitmap;
unsigned int rt;
diff -urN linuxppc_2_5/kernel/sched.c linuxppc_2_5_work/kernel/sched.c
--- linuxppc_2_5/kernel/sched.c Tue Jan 8 17:09:47 2002
+++ linuxppc_2_5_work/kernel/sched.c Tue Jan 8 22:13:45 2002
@@ -20,15 +20,13 @@
#include <linux/interrupt.h>
#include <asm/mmu_context.h>

-#define BITMAP_SIZE ((MAX_PRIO+7)/8)
-
typedef struct runqueue runqueue_t;

struct prio_array {
int nr_active;
spinlock_t *lock;
runqueue_t *rq;
- char bitmap[BITMAP_SIZE];
+ unsigned long bitmap[3];
list_t queue[MAX_PRIO];
};

@@ -1306,11 +1304,12 @@
array = rq->arrays + j;
array->rq = rq;
array->lock = &rq->lock;
- for (k = 0; k < MAX_PRIO; k++)
+ for (k = 0; k < MAX_PRIO; k++) {
INIT_LIST_HEAD(array->queue + k);
- memset(array->bitmap, 0xff, BITMAP_SIZE);
+ __set_bit(k, array->bitmap);
+ }
// zero delimiter for bitsearch
- clear_bit(MAX_PRIO, array->bitmap);
+ __clear_bit(MAX_PRIO, array->bitmap);
}
}
/*

2002-01-08 11:47:43

by Anton Blanchard

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

> struct prio_array {
> int nr_active;
> spinlock_t *lock;
> runqueue_t *rq;
> - char bitmap[BITMAP_SIZE];
> + unsigned long bitmap[3];
> list_t queue[MAX_PRIO];
> };

Sorry, of course this is wrong if sizeof(unsigned long) < 64. But you
get the idea :)

Anton

2002-01-08 12:35:53

[permalink] [raw]

Subject: [patch] O(1) scheduler, -E1, 2.5.2-pre10, 2.4.17

this is the latest update of the O(1) scheduler:

http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-pre10-E1.patch

http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.4.17-E1.patch

now that Linus has put the -D2 patch into the 2.5.2-pre10 kernel, the
2.5.2-pre10-E1 patch has become quite small :-)

The patch compiles, boots & works just fine on my UP/SMP boxes.

Changes since -D2:

- make rq->bitmap big-endian safe. (Anton Blanchard)

- documented and cleaned up the load estimator bits, no functional
changes apart from small speedups.

- do init_idle() before starting up the init thread, this removes a race
where we'd run the init thread on CPU#0 before init_idle() has been
called.

Ingo

2002-01-08 12:37:43

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Tue, 8 Jan 2002, Anton Blanchard wrote:

> > - char bitmap[BITMAP_SIZE];
> > + unsigned long bitmap[3];
> > list_t queue[MAX_PRIO];
>
> Sorry, of course this is wrong if sizeof(unsigned long) < 64. But you
> get the idea :)

thanks, i've put the generic fix into the -E1 patch.

> With the patch things look much better (and the kernel boots on my
> ppc64 machine :)

hey it should not even compile, you forgot to send us the PPC definition
of sched_find_first_zero_bit() ;-)

Ingo

2002-01-08 18:45:05

by jjs

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

Excellent - I'm going to try this one on whatever
machines I have available for testing, and if I am
emboldened by success, I'll try it on some light
duty production servers as well -

- keep us in the loop, please!

Regards,

jjs

FD Cami wrote:

>
> Hi all
>
> I'm joining the host of beta testers involved in that patch...
>
> It's currently running on a production machine :
> dual PII350 on ASUS P2B-DS
> 3 SCSI hard drives
> 512MB of RAM
> 3C905C
> This is a network server running squid-cache www proxy with
> a medium load (700 clients on a T3), mysqld, apache, proftpd.
> kernel is stock 2.4.17 - and so far, so good.
>
> Cheers,
>
> Fran?ois Cami
>
>
> Ingo Molnar wrote:
>
>> On Mon, 7 Jan 2002, Linus Torvalds wrote:
>>
>>
>>> Ingo, looks true. A quick -D2?
>>>
>>
>> yep, Brian is right. I've uploaded -D2:
>>
>> http://redhat.com/~mingo/O(1)-scheduler/sched-O1-2.5.2-D2.patch
>>
>> other changes:
>>
>> - make rt_priority 99 map to p->prio 0, rt_priority 0 map to p->prio
>> 99.
>>
>> - display 'top' priorities correctly, 0-39 for normal processes,
>> negative
>> values for RT tasks. (it works just fine it appears.) We did not
>> use to
>> display the real priority of RT tasks, but now it's natural.
>>
>>
>>> Oh, and please move console_init() back, other consoles (sparc?) may
>>> depend on having PCI layers initialized.
>>>
>>
>> (doh, done too, fix is in -D2.)
>>
>>
>>> Oh, and _I_ don't like "cpu()". What's wrong with the already
>>> existing "smp_processor_id()"?
>>>
>>
>> nothing serious, my main problem with it is that it's often too long for
>> my 80 chars wide consoles, and it's also too long to type and i use it
>> quite often in SMP code.
>>
>> IIRC we had a 'hard_smp_processor_id()' initially, partly to make it
>> harder to use it. (it was very slow because it did an APIC read). But
>> these days smp_processor_id() is just as fast (or even faster) as
>> 'current'. So i wanted to use cpu() in new code to make it easier to
>> read
>> and to make it more compact. But if this is a problem i can remove it.
>> I've verified that there is no obvious namespace collisions.
>>
>> (i've done a quick UP sanity compile + boot of 2.5.2-pre9 + D2, it all
>> works as expected.)
>>
>> Ingo
>>
>> -
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>>
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-01-09 04:40:02

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

Below are some benchmark results when running the D1 version
of the O(1) scheduler on 2.5.2-pre9. To add another data point,
I hacked together a half-a** multi-queue scheduler based on
the 2.5.2-pre9 scheduler. haMQ doesn't do load balancing or
anything fancy. However, it aggressively tries to not let CPUs
go idle (not always a good thing as has been previously discussed).
For reference, patch is at: lse.sourceforge.net/scheduling/2.5.2-pre9-hamq
I can't recommend this code for anything useful.

All benchmarks were run on an 8-way Pentium III 700 MHz 1MB caches.
Number of CPUs was altered via the maxcpus boot flag.

--------------------------------------------------------------------
mkbench - Time how long it takes to compile the kernel.
We use 'make -j 8' and increase the number of makes run
in parallel. Result is average build time in seconds.
Lower is better.
--------------------------------------------------------------------
# CPUs # Makes Vanilla O(1) haMQ
--------------------------------------------------------------------
2 1 188 192 184
2 2 366 372 362
2 4 730 742 600
2 6 1096 1112 853
4 1 102 101 95
4 2 196 198 186
4 4 384 386 374
4 6 576 579 487
8 1 58 57 58
8 2 109 108 105
8 4 209 213 186
8 6 309 312 280

Surprisingly, O(1) seems to do worse than the vanilla scheduler
in almost all cases.

--------------------------------------------------------------------
Chat - VolanoMark simulator. Result is a measure of throughput.
Higher is better.
--------------------------------------------------------------------
Configuration Parms # CPUs Vanilla O(1) haMQ
--------------------------------------------------------------------
10 rooms, 200 messages 2 162644 145915 137097
20 rooms, 200 messages 2 145872 136134 138646
30 rooms, 200 messages 2 124314 183366 144403
10 rooms, 200 messages 4 201745 258444 255415
20 rooms, 200 messages 4 177854 246032 263723
30 rooms, 200 messages 4 153506 302615 257170
10 rooms, 200 messages 8 121792 262804 310603
20 rooms, 200 messages 8 68697 248406 420157
30 rooms, 200 messages 8 42133 302513 283817

O(1) scheduler does better than Vanilla as load and number of
CPUs increase. Still need to look into why it does worse on
the less loaded 2 CPU runs.

--------------------------------------------------------------------
Reflex - lat_ctx(of LMbench) on steroids. Does token passing
to over emphasize scheduler paths. Allows loading of
the runqueue unlike lat_ctx. Result is microseconds
per round. Lower is better. All runs with 0 delay.
lse.sourceforge.net/scheduling/other/reflex/
Lower is better.
--------------------------------------------------------------------
#tasks # CPUs Vanilla O(1) haMQ
--------------------------------------------------------------------
2 2 6.594 14.388 15.996
4 2 6.988 3.787 4.686
8 2 7.322 3.757 5.148
16 2 7.234 3.737 7.244
32 2 7.651 5.135 7.182
64 2 9.462 3.948 7.553
128 2 13.889 4.584 7.918
2 4 6.019 14.646 15.403
4 4 10.997 6.213 6.755
8 4 9.838 2.160 2.838
16 4 10.595 2.154 3.080
32 4 11.870 2.917 3.400
64 4 15.280 2.890 3.131
128 4 19.832 2.685 3.307
2 8 6.338 9.064 15.474
4 8 11.454 7.020 8.281
8 8 13.354 4.390 5.816
16 8 14.976 1.502 2.018
32 8 16.757 1.920 2.240
64 8 19.961 2.264 2.358
128 8 25.010 2.280 2.260

I believe the poor showings for O(1) at the low end are the
result of having the 2 tasks run on 2 different CPUs. This
is the right thing to do in spite of the numbers. You
can see lock contention become a factor in the Vanilla scheduler
as load and number of CPUs increase. Having multiple runqueues
eliminates this problem.

--
Mike

2002-01-09 05:00:30

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Tue, 8 Jan 2002, Mike Kravetz wrote:

> Below are some benchmark results when running the D1 version
> of the O(1) scheduler on 2.5.2-pre9. To add another data point,
> I hacked together a half-a** multi-queue scheduler based on
> the 2.5.2-pre9 scheduler. haMQ doesn't do load balancing or
> anything fancy. However, it aggressively tries to not let CPUs
> go idle (not always a good thing as has been previously discussed).
> For reference, patch is at: lse.sourceforge.net/scheduling/2.5.2-pre9-hamq
> I can't recommend this code for anything useful.
>
> All benchmarks were run on an 8-way Pentium III 700 MHz 1MB caches.
> Number of CPUs was altered via the maxcpus boot flag.
>
> --------------------------------------------------------------------
> mkbench - Time how long it takes to compile the kernel.
> We use 'make -j 8' and increase the number of makes run
> in parallel. Result is average build time in seconds.
> Lower is better.
> --------------------------------------------------------------------
> # CPUs # Makes Vanilla O(1) haMQ
> --------------------------------------------------------------------
> 2 1 188 192 184
> 2 2 366 372 362
> 2 4 730 742 600
> 2 6 1096 1112 853
> 4 1 102 101 95
> 4 2 196 198 186
> 4 4 384 386 374
> 4 6 576 579 487
> 8 1 58 57 58
> 8 2 109 108 105
> 8 4 209 213 186
> 8 6 309 312 280
>
> Surprisingly, O(1) seems to do worse than the vanilla scheduler
> in almost all cases.
>
> --------------------------------------------------------------------
> Chat - VolanoMark simulator. Result is a measure of throughput.
> Higher is better.
> --------------------------------------------------------------------
> Configuration Parms # CPUs Vanilla O(1) haMQ
> --------------------------------------------------------------------
> 10 rooms, 200 messages 2 162644 145915 137097
> 20 rooms, 200 messages 2 145872 136134 138646
> 30 rooms, 200 messages 2 124314 183366 144403
> 10 rooms, 200 messages 4 201745 258444 255415
> 20 rooms, 200 messages 4 177854 246032 263723
> 30 rooms, 200 messages 4 153506 302615 257170
> 10 rooms, 200 messages 8 121792 262804 310603
> 20 rooms, 200 messages 8 68697 248406 420157
> 30 rooms, 200 messages 8 42133 302513 283817
>
> O(1) scheduler does better than Vanilla as load and number of
> CPUs increase. Still need to look into why it does worse on
> the less loaded 2 CPU runs.
>
> --------------------------------------------------------------------
> Reflex - lat_ctx(of LMbench) on steroids. Does token passing
> to over emphasize scheduler paths. Allows loading of
> the runqueue unlike lat_ctx. Result is microseconds
> per round. Lower is better. All runs with 0 delay.
> lse.sourceforge.net/scheduling/other/reflex/
> Lower is better.
> --------------------------------------------------------------------
> #tasks # CPUs Vanilla O(1) haMQ
> --------------------------------------------------------------------
> 2 2 6.594 14.388 15.996
> 4 2 6.988 3.787 4.686
> 8 2 7.322 3.757 5.148
> 16 2 7.234 3.737 7.244
> 32 2 7.651 5.135 7.182
> 64 2 9.462 3.948 7.553
> 128 2 13.889 4.584 7.918
> 2 4 6.019 14.646 15.403
> 4 4 10.997 6.213 6.755
> 8 4 9.838 2.160 2.838
> 16 4 10.595 2.154 3.080
> 32 4 11.870 2.917 3.400
> 64 4 15.280 2.890 3.131
> 128 4 19.832 2.685 3.307
> 2 8 6.338 9.064 15.474
> 4 8 11.454 7.020 8.281
> 8 8 13.354 4.390 5.816
> 16 8 14.976 1.502 2.018
> 32 8 16.757 1.920 2.240
> 64 8 19.961 2.264 2.358
> 128 8 25.010 2.280 2.260
>
> I believe the poor showings for O(1) at the low end are the
> result of having the 2 tasks run on 2 different CPUs. This
> is the right thing to do in spite of the numbers. You
> can see lock contention become a factor in the Vanilla scheduler
> as load and number of CPUs increase. Having multiple runqueues
> eliminates this problem.

Awesome job Mike. Ingo's O(1) scheduler is still 'young' and can be
improved expecially from a balancing point of view. I think that it'll
here that the real challenge will take place ( even if Linus keeps saying
that it's easy :-) ).
Mike can you try the patch listed below on custom pre-10 ?
I've got 30-70% better performances with the chat_s/c test.

PS: next time we'll have lunch i'll talk you about a wanderful tool called
gnuplot :)

- Davide

diff -Nru linux-2.5.2-pre10.vanilla/include/linux/sched.h linux-2.5.2-pre10.mqo1/include/linux/sched.h
--- linux-2.5.2-pre10.vanilla/include/linux/sched.h Mon Jan 7 17:12:45 2002
+++ linux-2.5.2-pre10.mqo1/include/linux/sched.h Mon Jan 7 21:45:19 2002
@@ -305,11 +305,7 @@
prio_array_t *array;

unsigned int time_slice;
- unsigned long sleep_timestamp, run_timestamp;
-
- #define SLEEP_HIST_SIZE 4
- int sleep_hist[SLEEP_HIST_SIZE];
- int sleep_idx;
+ unsigned long swap_cnt_last;

unsigned long policy;
unsigned long cpus_allowed;
diff -Nru linux-2.5.2-pre10.vanilla/kernel/fork.c linux-2.5.2-pre10.mqo1/kernel/fork.c
--- linux-2.5.2-pre10.vanilla/kernel/fork.c Mon Jan 7 17:12:45 2002
+++ linux-2.5.2-pre10.mqo1/kernel/fork.c Mon Jan 7 18:49:34 2002
@@ -705,9 +705,6 @@
current->time_slice = 1;
expire_task(current);
}
- p->sleep_timestamp = p->run_timestamp = jiffies;
- memset(p->sleep_hist, 0, sizeof(p->sleep_hist[0])*SLEEP_HIST_SIZE);
- p->sleep_idx = 0;
__restore_flags(flags);

/*
diff -Nru linux-2.5.2-pre10.vanilla/kernel/sched.c linux-2.5.2-pre10.mqo1/kernel/sched.c
--- linux-2.5.2-pre10.vanilla/kernel/sched.c Mon Jan 7 17:12:45 2002
+++ linux-2.5.2-pre10.mqo1/kernel/sched.c Tue Jan 8 18:28:02 2002
@@ -48,6 +48,7 @@
spinlock_t lock;
unsigned long nr_running, nr_switches, last_rt_event;
task_t *curr, *idle;
+ unsigned long swap_cnt;
prio_array_t *active, *expired, arrays[2];
char __pad [SMP_CACHE_BYTES];
} runqueues [NR_CPUS] __cacheline_aligned;
@@ -91,115 +92,20 @@
p->array = array;
}

-/*
- * This is the per-process load estimator. Processes that generate
- * more load than the system can handle get a priority penalty.
- *
- * The estimator uses a 4-entry load-history ringbuffer which is
- * updated whenever a task is moved to/from the runqueue. The load
- * estimate is also updated from the timer tick to get an accurate
- * estimation of currently executing tasks as well.
- */
-#define NEXT_IDX(idx) (((idx) + 1) % SLEEP_HIST_SIZE)
-
-static inline void update_sleep_avg_deactivate(task_t *p)
-{
- unsigned int idx;
- unsigned long j = jiffies, last_sample = p->run_timestamp / HZ,
- curr_sample = j / HZ, delta = curr_sample - last_sample;
-
- if (unlikely(delta)) {
- if (delta < SLEEP_HIST_SIZE) {
- for (idx = 0; idx < delta; idx++) {
- p->sleep_idx++;
- p->sleep_idx %= SLEEP_HIST_SIZE;
- p->sleep_hist[p->sleep_idx] = 0;
- }
- } else {
- for (idx = 0; idx < SLEEP_HIST_SIZE; idx++)
- p->sleep_hist[idx] = 0;
- p->sleep_idx = 0;
- }
- }
- p->sleep_timestamp = j;
-}
-
-#if SLEEP_HIST_SIZE != 4
-# error update this code.
-#endif
-
-static inline unsigned int get_sleep_avg(task_t *p, unsigned long j)
-{
- unsigned int sum;
-
- sum = p->sleep_hist[0];
- sum += p->sleep_hist[1];
- sum += p->sleep_hist[2];
- sum += p->sleep_hist[3];
-
- return sum * HZ / ((SLEEP_HIST_SIZE-1)*HZ + (j % HZ));
-}
-
-static inline void update_sleep_avg_activate(task_t *p, unsigned long j)
-{
- unsigned int idx;
- unsigned long delta_ticks, last_sample = p->sleep_timestamp / HZ,
- curr_sample = j / HZ, delta = curr_sample - last_sample;
-
- if (unlikely(delta)) {
- if (delta < SLEEP_HIST_SIZE) {
- p->sleep_hist[p->sleep_idx] += HZ - (p->sleep_timestamp % HZ);
- p->sleep_idx++;
- p->sleep_idx %= SLEEP_HIST_SIZE;
-
- for (idx = 1; idx < delta; idx++) {
- p->sleep_idx++;
- p->sleep_idx %= SLEEP_HIST_SIZE;
- p->sleep_hist[p->sleep_idx] = HZ;
- }
- } else {
- for (idx = 0; idx < SLEEP_HIST_SIZE; idx++)
- p->sleep_hist[idx] = HZ;
- p->sleep_idx = 0;
- }
- p->sleep_hist[p->sleep_idx] = 0;
- delta_ticks = j % HZ;
- } else
- delta_ticks = j - p->sleep_timestamp;
- p->sleep_hist[p->sleep_idx] += delta_ticks;
- p->run_timestamp = j;
-}
-
static inline void activate_task(task_t *p, runqueue_t *rq)
{
prio_array_t *array = rq->active;
- unsigned long j = jiffies;
- unsigned int sleep, load;
- int penalty;

- if (likely(p->run_timestamp == j))
- goto enqueue;
- /*
- * Give the process a priority penalty if it has not slept often
- * enough in the past. We scale the priority penalty according
- * to the current load of the runqueue, and the 'load history'
- * this process has. Eg. if the CPU has 3 processes running
- * right now then a process that has slept more than two-thirds
- * of the time is considered to be 'interactive'. The higher
- * the load of the CPUs is, the easier it is for a process to
- * get an non-interactivity penalty.
- */
-#define MAX_PENALTY (MAX_USER_PRIO/3)
- update_sleep_avg_activate(p, j);
- sleep = get_sleep_avg(p, j);
- load = HZ - sleep;
- penalty = (MAX_PENALTY * load)/HZ;
if (!rt_task(p)) {
- p->prio = NICE_TO_PRIO(p->__nice) + penalty;
- if (p->prio > MAX_PRIO-1)
- p->prio = MAX_PRIO-1;
+ unsigned long prio_bonus = rq->swap_cnt - p->swap_cnt_last;
+
+ p->swap_cnt_last = rq->swap_cnt;
+ if (prio_bonus > MAX_PRIO)
+ prio_bonus = MAX_PRIO;
+ p->prio -= prio_bonus;
+ if (p->prio < MAX_RT_PRIO)
+ p->prio = MAX_RT_PRIO;
}
-enqueue:
enqueue_task(p, array);
rq->nr_running++;
}
@@ -209,7 +115,6 @@
rq->nr_running--;
dequeue_task(p, p->array);
p->array = NULL;
- update_sleep_avg_deactivate(p);
}

static inline void resched_task(task_t *p)
@@ -535,33 +440,16 @@
p->need_resched = 1;
if (rt_task(p))
p->time_slice = RT_PRIO_TO_TIMESLICE(p->prio);
- else
+ else {
p->time_slice = PRIO_TO_TIMESLICE(p->prio);
-
- /*
- * Timeslice used up - discard any possible
- * priority penalty:
- */
- dequeue_task(p, rq->active);
- /*
- * Tasks that have nice values of -20 ... -15 are put
- * back into the active array. If they use up too much
- * CPU time then they'll get a priority penalty anyway
- * so this can not starve other processes accidentally.
- * Otherwise this is pretty handy for sysadmins ...
- */
- if (p->prio <= MAX_RT_PRIO + MAX_PENALTY/2)
- enqueue_task(p, rq->active);
- else
+ /*
+ * Timeslice used up - discard any possible
+ * priority penalty:
+ */
+ dequeue_task(p, rq->active);
+ if (++p->prio >= MAX_PRIO)
+ p->prio = MAX_PRIO - 1;
enqueue_task(p, rq->expired);
- } else {
- /*
- * Deactivate + activate the task so that the
- * load estimator gets updated properly:
- */
- if (!rt_task(p)) {
- deactivate_task(p, rq);
- activate_task(p, rq);
}
}
load_balance(rq);
@@ -616,6 +504,7 @@
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
+ rq->swap_cnt++;
}

idx = sched_find_first_zero_bit(array->bitmap);
@@ -1301,6 +1190,7 @@
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
rq->cpu = i;
+ rq->swap_cnt = 0;

for (j = 0; j < 2; j++) {
array = rq->arrays + j;

2002-01-09 05:35:04

by Rusty Russell

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Tue, 8 Jan 2002 21:05:23 -0800 (PST)
Davide Libenzi <[email protected]> wrote:
> Mike can you try the patch listed below on custom pre-10 ?
> I've got 30-70% better performances with the chat_s/c test.

I'd encourage you to use hackbench, which is basically "the part of chat_c/s
that is interesting".

And I'd encourage you to come up with a better name, too 8)

Cheers,
Rusty.

/* Simple scheduler test. */
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <sys/poll.h>

static int use_pipes = 0;

static void barf(const char *msg)
{
fprintf(stderr, "%s (error: %s)\n", msg, strerror(errno));
exit(1);
}

static void fdpair(int fds[2])
{
if (use_pipes) {
if (pipe(fds) == 0)
return;
} else {
if (socketpair(AF_UNIX, SOCK_STREAM, 0, fds) == 0)
return;
}
barf("Creating fdpair");
}

/* Block until we're ready to go */
static void ready(int ready_out, int wakefd)
{
char dummy;
struct pollfd pollfd = { .fd = wakefd, .events = POLLIN };

/* Tell them we're ready. */
if (write(ready_out, &dummy, 1) != 1)
barf("CLIENT: ready write");

/* Wait for "GO" signal */
if (poll(&pollfd, 1, -1) != 1)
barf("poll");
}

static void reader(int ready_out, int wakefd, unsigned int loops, int fd)
{
char dummy;
unsigned int i;

ready(ready_out, wakefd);

for (i = 0; i < loops; i++) {
if (read(fd, &dummy, 1) != 1)
barf("READER: read");
}
}

/* Start the server */
static void server(int ready_out, int wakefd,
unsigned int loops, unsigned int num_fds)
{
unsigned int i;
int write_fds[num_fds];
unsigned int counters[num_fds];

for (i = 0; i < num_fds; i++) {
int fds[2];

fdpair(fds);
switch (fork()) {
case -1: barf("fork()");
case 0:
close(fds[1]);
reader(ready_out, wakefd, loops, fds[0]);
exit(0);
}
close(fds[0]);
write_fds[i] = fds[1];
if (fcntl(write_fds[i], F_SETFL, O_NONBLOCK) != 0)
barf("fcntl NONBLOCK");

counters[i] = 0;
}

ready(ready_out, wakefd);

for (i = 0; i < loops * num_fds;) {
unsigned int j;
char dummy;

for (j = 0; j < num_fds; j++) {
if (counters[j] < loops) {
if (write(write_fds[j], &dummy, 1) == 1) {
counters[j]++;
i++;
} else if (errno != EAGAIN)
barf("write");
}
}
}

/* Reap them all */
for (i = 0; i < num_fds; i++) {
int status;
wait(&status);
if (!WIFEXITED(status))
exit(1);
}
exit(0);
}

int main(int argc, char *argv[])
{
unsigned int i;
struct timeval start, stop, diff;
unsigned int num_fds;
int readyfds[2], wakefds[2];
char dummy;
int status;

if (argv[1] && strcmp(argv[1], "-pipe") == 0) {
use_pipes = 1;
argc--;
argv++;
}

if (argc != 2 || (num_fds = atoi(argv[1])) == 0)
barf("Usage: hackbench2 [-pipe] <num pipes>\n");

fdpair(readyfds);
fdpair(wakefds);

switch (fork()) {
case -1: barf("fork()");
case 0:
server(readyfds[1], wakefds[0], 10000, num_fds);
exit(0);
}

/* Wait for everyone to be ready */
for (i = 0; i < num_fds+1; i++)
if (read(readyfds[0], &dummy, 1) != 1)
barf("Reading for readyfds");

gettimeofday(&start, NULL);

/* Kick them off */
if (write(wakefds[1], &dummy, 1) != 1)
barf("Writing to start them");

/* Reap server */
wait(&status);
if (!WIFEXITED(status))
exit(1);

gettimeofday(&stop, NULL);

/* Print time... */
timersub(&stop, &start, &diff);
printf("Time: %lu.%03lu\n", diff.tv_sec, diff.tv_usec/1000);
exit(0);
}
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

2002-01-09 06:29:22

by Brian

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

Can this be correct?

Intuitively, I would expect several CPUs hammering away at the compile to
finish faster than one. Given these numbers, I would have to conclude
that is not just wrong, but absolutely wrong. Compile time increases
linearly with the number of jobs, regardless of the number of CPUs.

What would cause this? Severe memory bottlenecks?

-- Brian

On Tuesday 08 January 2002 10:39 pm, Mike Kravetz wrote:
> --------------------------------------------------------------------
> mkbench - Time how long it takes to compile the kernel.
> We use 'make -j 8' and increase the number of makes run
> in parallel. Result is average build time in seconds.
> Lower is better.
> --------------------------------------------------------------------
> # CPUs # Makes Vanilla O(1) haMQ
> --------------------------------------------------------------------
> 2 1 188 192 184
> 2 2 366 372 362
> 2 4 730 742 600
> 2 6 1096 1112 853
> 4 1 102 101 95
> 4 2 196 198 186
> 4 4 384 386 374
> 4 6 576 579 487
> 8 1 58 57 58
> 8 2 109 108 105
> 8 4 209 213 186
> 8 6 309 312 280

2002-01-09 06:42:04

by Jeffrey W. Baker

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Wed, 9 Jan 2002, Brian wrote:

> Can this be correct?
>
> Intuitively, I would expect several CPUs hammering away at the compile to
> finish faster than one. Given these numbers, I would have to conclude
> that is not just wrong, but absolutely wrong. Compile time increases
> linearly with the number of jobs, regardless of the number of CPUs.
>
> What would cause this? Severe memory bottlenecks?

Mike ran make -j 8 which means 8 compiler processes for each "# Makes" in
the table. Thus, the first row has 8 parallel processes on a 2-way and
the last row has 48 processes on an 8-way. The best ratio is 8 processes
on an 8-way which not incidentally also has the lowest time: 57 seconds.

-jwb

>
> -- Brian
>
> On Tuesday 08 January 2002 10:39 pm, Mike Kravetz wrote:
> > --------------------------------------------------------------------
> > mkbench - Time how long it takes to compile the kernel.
> > We use 'make -j 8' and increase the number of makes run
> > in parallel. Result is average build time in seconds.
> > Lower is better.
> > --------------------------------------------------------------------
> > # CPUs # Makes Vanilla O(1) haMQ
> > --------------------------------------------------------------------
> > 2 1 188 192 184
> > 2 2 366 372 362
> > 2 4 730 742 600
> > 2 6 1096 1112 853
> > 4 1 102 101 95
> > 4 2 196 198 186
> > 4 4 384 386 374
> > 4 6 576 579 487
> > 8 1 58 57 58
> > 8 2 109 108 105
> > 8 4 209 213 186
> > 8 6 309 312 280
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2002-01-09 06:45:53

by Ryan Cumming

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On January 8, 2002 22:29, Brian wrote:
> Can this be correct?
>
> Intuitively, I would expect several CPUs hammering away at the compile to
> finish faster than one. Given these numbers, I would have to conclude
> that is not just wrong, but absolutely wrong. Compile time increases
> linearly with the number of jobs, regardless of the number of CPUs.

In the charts in the original message, he's not increasing the number of
jobs, but the number of concurrent 'make -j8's. Two makes should really
finish in half the time one make does. I don't see any problem with the
results.

-Ryan

2002-01-09 06:48:23

by Ryan Cumming

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On January 8, 2002 22:45, Ryan Cumming wrote:
> In the charts in the original message, he's not increasing the number of
> jobs, but the number of concurrent 'make -j8's. Two makes should really
> finish in half the time one make does. I don't see any problem with the
> results.
Er, I meant finish in twice the time one make does... really... ;)

-Ryan

2002-01-09 08:28:38

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Tue, 8 Jan 2002, Mike Kravetz wrote:

> --------------------------------------------------------------------
> Chat - VolanoMark simulator. Result is a measure of throughput.
> Higher is better.

very interesting numbers, nice work Mike! I'd suggest the following
additional test: please also run tests like VolanoMark with 'nice -n 19'.
The O(1) scheduler's task-penalty method works in our favor in this case,
since we know the test is CPU-bound we can move all processes to nice
level 19.

Ingo

2002-01-09 09:40:41

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Tue, 8 Jan 2002, Davide Libenzi wrote:

> Mike can you try the patch listed below on custom pre-10 ?
> I've got 30-70% better performances with the chat_s/c test.

i've compared this patch of yours (which changes the way interactivity is
detected and timeslices are distributed), to 2.5.2-pre10-vanilla on a
2-way 466 MHz Celeron box:

davide-patch-2.5.2-pre10 running at default priority:

# ./chat_s 127.0.0.1
# ./chat_c 127.0.0.1 10 1000

Average throughput : 123103 messages per second
Average throughput : 105122 messages per second
Average throughput : 112901 messages per second

[ system is *unusable* interactively, during the whole test. ]

davide-patch-2.5.2-pre10 running at nice level 19:

# nice -n 19 ./chat_s 127.0.0.1
# nice -n 19 ./chat_c 127.0.0.1 10 1000

Average throughput : 109337 messages per second
Average throughput : 122077 messages per second
Average throughput : 105296 messages per second

[ system is *unusable* interactively, despite renicing. ]

2.5.2-pre10-vanilla running the test at the default priority level:

# ./chat_s 127.0.0.1
# ./chat_c 127.0.0.1 10 1000

Average throughput : 124676 messages per second
Average throughput : 102244 messages per second
Average throughput : 115841 messages per second

[ system is unresponsive at the start of the test, but
once the 2.5.2-pre10 load-estimator establishes which task is
interactive and which one is not, the system becomes usable.
Load can be felt and there are frequent delays in commands. ]

2.5.2-pre10-vanilla running at nice level 19:

# nice -n 19 ./chat_s 127.0.0.1
# nice -n 19 ./chat_c 127.0.0.1 10 1000

Average throughput : 214626 messages per second
Average throughput : 220876 messages per second
Average throughput : 225529 messages per second

[ system is usable from the beginning - nice levels are working as
expected. Load can be felt while executing shell commands, but the
system is usable. Load cannot be felt in truly interactive
applications like editors.

Summary of throughput results: 2.5.2-pre10-vanilla is equivalent
throughput-wise in the test with your patched kernel, but the vanilla
kernel is about 100% faster than your patched kernel when running reniced.

but the interactivity observations are the real showstoppers in my
opinion. With your patch applied the system became *unbearably* slow
during the test.

i have three observation about why your patch causes these effects (we had
email discussions about this topic in private already, so you probably
know my position):

- your patch adds the 'recalculation based priority distribution
method' that is in 2.5.2-pre9 to the O(1) scheduler. (2.4.2-pre9's
priority distribution scheme is an improved but conceptually
equivalent version of the priority distribution scheme of 2.4.17 -
which scheme was basically unchanged since 1991. )

originally the O(1) patch was using the priority distribution scheme of
2.5.2-pre9 (it's very easy to switch between the two methods), but i
have changed it because:

there is a flaw in the recalculation-based (array-switch based in O(1)
scheduler terms) priority distribution scheme: interactive tasks will
get new timeslices depending on the frequency of recalculations. But
*exactly under load*, the frequency of recalculations gets very, very
low - it can be more than 10 seconds. In the above test this property
causes shell interactivity to degrade so dramatically. Interactive
tasks might accumulate up to 64 timeslices, but it's easy for them to
use up this reserve in such high load situations and they'll never get
back any new timeslices. Mike, do you agree with this analysis? [if
anyone wants to look at the new estimtaor code then please apply the
-E1 patch to -pre10, which cleans up the estimator code and comments
it, without changing functionality.]

- your patch in essence makes the scheduler ignore things like nice
level +19. We *used to* ignore nice levels, but with the new load
estimator this has changed, and personally i dont think i want to go
back to the old behavior.

- the system i tested has a more than twice as slow CPU as yours. So i'd
suggest for you to repeat those exact tests but increase the number of
'rooms' to something like 40 (i know you tried 20 rooms, i dont think
it's enough), and increase the number of messages sent, from 1000 to
5000 or something like that.

your patch indeed decreases the load estimation and interactivity
detection overhead and code complexity - but as the above tests have
shown, at the price of interactivity, and in some cases even at the price
of throughput.

Ingo

2002-01-09 11:20:43

by Rene Rebe

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

Hi.

From: Ingo Molnar <[email protected]>
Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17
Date: Wed, 9 Jan 2002 12:37:46 +0100 (CET)

[...]

> 2.5.2-pre10-vanilla running the test at the default priority level:
>
> # ./chat_s 127.0.0.1
> # ./chat_c 127.0.0.1 10 1000
>
> Average throughput : 124676 messages per second
> Average throughput : 102244 messages per second
> Average throughput : 115841 messages per second
>
> [ system is unresponsive at the start of the test, but
> once the 2.5.2-pre10 load-estimator establishes which task is
> interactive and which one is not, the system becomes usable.
> Load can be felt and there are frequent delays in commands. ]
>
> 2.5.2-pre10-vanilla running at nice level 19:
>
> # nice -n 19 ./chat_s 127.0.0.1
> # nice -n 19 ./chat_c 127.0.0.1 10 1000
>
> Average throughput : 214626 messages per second
> Average throughput : 220876 messages per second
> Average throughput : 225529 messages per second
>
> [ system is usable from the beginning - nice levels are working as
> expected. Load can be felt while executing shell commands, but the
> system is usable. Load cannot be felt in truly interactive
> applications like editors.
>
> Summary of throughput results: 2.5.2-pre10-vanilla is equivalent
> throughput-wise in the test with your patched kernel, but the vanilla
> kernel is about 100% faster than your patched kernel when running reniced.

Could someone tell a non-kernel-hacker why this benchmark is nearly
twice as fast when running reniced??? Shouldn't it be slower when it
runs with lower priority (And you execute / type some commands during
it)?

[...]

> Ingo

k33p h4ck1n6
Ren?

--
Ren? Rebe (Registered Linux user: #248718 <http://counter.li.org>)

eMail: [email protected]
[email protected]

Homepage: http://www.tfh-berlin.de/~s712059/index.html

Anyone sending unwanted advertising e-mail to this address will be
charged $25 for network traffic and computing time. By extracting my
address from this message or its header, you agree to these terms.

2002-01-09 15:35:06

by Ryan Cumming

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On January 9, 2002 03:19, Rene Rebe wrote:
> Could someone tell a non-kernel-hacker why this benchmark is nearly
> twice as fast when running reniced??? Shouldn't it be slower when it
> runs with lower priority (And you execute / type some commands during
> it)?

In addition for using the nice level as a priority hint, the new scheduler
also uses it as a hint of how "CPU-bound" a process it. Negative (higher
priority) nice levels give the process short, frequent timeslices. Positive
priorities give the process long, infrequent time slices. On an otherwise
(mostly) idle system, both processes will get the same amount of CPU time,
but distributed in a different way.

In applications that really don't care about interactivity, the long time
slice will increase their efficency greatly. In addition to having a fewer
context switches (and therefore less context switch overhead), the longer
time slices give them more time to warm up the cache. This has been referred
to as "batching", as the process is executing at once what would normally
take many shorter timeslices to complete.

So, what you're actually seeing is the reniced task not taking up more CPU
time (it's probably actually using slightly less), just using the CPU time
more efficently.

<worships Ingo>

-Ryan

2002-01-09 17:57:06

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Wed, 9 Jan 2002, Rusty Russell wrote:

> On Tue, 8 Jan 2002 21:05:23 -0800 (PST)
> Davide Libenzi <[email protected]> wrote:
> > Mike can you try the patch listed below on custom pre-10 ?
> > I've got 30-70% better performances with the chat_s/c test.
>
> I'd encourage you to use hackbench, which is basically "the part of chat_c/s
> that is interesting".
>
> And I'd encourage you to come up with a better name, too 8)

Got it. I'll try.

- Davide

2002-01-09 18:20:18

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Wed, 9 Jan 2002, Ingo Molnar wrote:

>
> On Tue, 8 Jan 2002, Davide Libenzi wrote:
>
> > Mike can you try the patch listed below on custom pre-10 ?
> > I've got 30-70% better performances with the chat_s/c test.
>
> i've compared this patch of yours (which changes the way interactivity is
> detected and timeslices are distributed), to 2.5.2-pre10-vanilla on a
> 2-way 466 MHz Celeron box:
>
> davide-patch-2.5.2-pre10 running at default priority:
>
> # ./chat_s 127.0.0.1
> # ./chat_c 127.0.0.1 10 1000
>
> Average throughput : 123103 messages per second
> Average throughput : 105122 messages per second
> Average throughput : 112901 messages per second
>
> [ system is *unusable* interactively, during the whole test. ]
>
> davide-patch-2.5.2-pre10 running at nice level 19:
>
> # nice -n 19 ./chat_s 127.0.0.1
> # nice -n 19 ./chat_c 127.0.0.1 10 1000
>
> Average throughput : 109337 messages per second
> Average throughput : 122077 messages per second
> Average throughput : 105296 messages per second
>
> [ system is *unusable* interactively, despite renicing. ]
>
> 2.5.2-pre10-vanilla running the test at the default priority level:
>
> # ./chat_s 127.0.0.1
> # ./chat_c 127.0.0.1 10 1000
>
> Average throughput : 124676 messages per second
> Average throughput : 102244 messages per second
> Average throughput : 115841 messages per second
>
> [ system is unresponsive at the start of the test, but
> once the 2.5.2-pre10 load-estimator establishes which task is
> interactive and which one is not, the system becomes usable.
> Load can be felt and there are frequent delays in commands. ]
>
> 2.5.2-pre10-vanilla running at nice level 19:
>
> # nice -n 19 ./chat_s 127.0.0.1
> # nice -n 19 ./chat_c 127.0.0.1 10 1000
>
> Average throughput : 214626 messages per second
> Average throughput : 220876 messages per second
> Average throughput : 225529 messages per second
>
> [ system is usable from the beginning - nice levels are working as
> expected. Load can be felt while executing shell commands, but the
> system is usable. Load cannot be felt in truly interactive
> applications like editors.
>
> Summary of throughput results: 2.5.2-pre10-vanilla is equivalent
> throughput-wise in the test with your patched kernel, but the vanilla
> kernel is about 100% faster than your patched kernel when running reniced.
>
> but the interactivity observations are the real showstoppers in my
> opinion. With your patch applied the system became *unbearably* slow
> during the test.

Ingo, this is not the picture that i've got from my machine.
-------------------------------------------------------------------
AMD Athlon 1GHz 256 Mb RAM, swap_cnt patch :

# nice -n 19 chat_s 127.0.0.1 &
# nice -n 19 chat_c 127.0.0.1 20 1000

125236
123988
128048

with :

r b w swpd free buff cache si so bi bo in cs us sy
id
198 0 0 1476 28996 8024 89408 0 0 0 108 812 19424 12 87 1
216 0 1 1476 32388 8024 89412 0 0 0 0 523 56344 9 91 0
134 0 1 1476 32812 8024 89412 0 0 0 0 578 32374 9 91 0
96 1 1 1476 33540 8024 89412 0 0 0 0 114 7910 13 87 0
81 0 0 1476 35412 8024 89420 0 0 0 12 657 54034 12 88 0

pre-10 :

135684
127456
132420

the niced -20 vmstat has not been run for the whole test time and the
system seemed quite bad ( personal feeling, not for the whole test time
but for 1-2 sec spots ) compared with the previous test. The whole point
Ingo is that during the test we've had 200 tasks on the run queue with a
cs 8000..50000 !!?

AMD Athlon 1GHz, swap_cnt patch :

# chat_s 127.0.0.1 &
# chat_c 127.0.0.1 20 1000

118386
114464
117972

pre-10 :

90066
88234
92612

I was not able to identify any interactive feel difference here.
----------------------------------------------------------------------

Today i'll try the same on both my dual cpu system ( PIII 733 and PIII 1GHz )
I really fail to understand why you're asking everyone to run your test reniced ?!?

> - your patch in essence makes the scheduler ignore things like nice
> level +19. We *used to* ignore nice levels, but with the new load
> estimator this has changed, and personally i dont think i want to go
> back to the old behavior.

Ingo for the duration of the test the `nice -n 20 vmstat -n 1` never run
for about the 20 seconds.
With the swap_cnt correction it ran for 5-6 times.

> - the system i tested has a more than twice as slow CPU as yours. So i'd
> suggest for you to repeat those exact tests but increase the number of
> 'rooms' to something like 40 (i know you tried 20 rooms, i dont think
> it's enough), and increase the number of messages sent, from 1000 to
> 5000 or something like that.

Ingo, with 20 rooms my system was loaded with more than 200 tasks on the
run queue and was switching at 50000 times/sec.
Don't you think that it's enough for a single cpu system ??!!

> your patch indeed decreases the load estimation and interactivity
> detection overhead and code complexity - but as the above tests have
> shown, at the price of interactivity, and in some cases even at the price
> of throughput.

Ingo i tried to be the more impartial as possible and during the test i
was not able to identify any difference in system usability.
As i wrote you in private, the only spot i've had of system unusability
was running with stock pre10 ( but this could be happened occasionally ).

- Davide

2002-01-09 19:26:53

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Wed, 9 Jan 2002, Davide Libenzi wrote:

> the niced -20 vmstat has not been run for the whole test time and the
[...]

> Ingo for the duration of the test the `nice -n 20 vmstat -n 1` never
> run for about the 20 seconds. With the swap_cnt correction it ran for
> 5-6 times.

no wonder, it should be 'nice -n -20 vmstat -n 1'. And you should also do
a 'renice -20 $$ $PPID' before running vmstat. (if you are about to run
comparisons, i'd suggest the -G1 patch so you'll have all the recent
fixes.)

Ingo

2002-01-09 19:39:23

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Wed, Jan 09, 2002 at 10:24:00PM +0100, Ingo Molnar wrote:
> (if you are about to run
> comparisons, i'd suggest the -G1 patch so you'll have all the recent
> fixes.)

I just kicked off another benchmark run to compare pre10, pre10 & G1
patch, pre10 & Davide's patch. chat and make will be run as before
with the addition of chat reniced. I won't attempt to make any claims
about interactive responsiveness. Simple throughput numbers. Results
should be available in about 24 hours.

--
Mike

2002-01-09 20:18:26

by Linus Torvalds

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Wed, 9 Jan 2002, Ingo Molnar wrote:
>
> 2.5.2-pre10-vanilla running the test at the default priority level:
>
> # ./chat_s 127.0.0.1
> # ./chat_c 127.0.0.1 10 1000
>
> Average throughput : 124676 messages per second
> Average throughput : 102244 messages per second
> Average throughput : 115841 messages per second
>
> [ system is unresponsive at the start of the test, but
> once the 2.5.2-pre10 load-estimator establishes which task is
> interactive and which one is not, the system becomes usable.
> Load can be felt and there are frequent delays in commands. ]
>
> 2.5.2-pre10-vanilla running at nice level 19:
>
> # nice -n 19 ./chat_s 127.0.0.1
> # nice -n 19 ./chat_c 127.0.0.1 10 1000
>
> Average throughput : 214626 messages per second
> Average throughput : 220876 messages per second
> Average throughput : 225529 messages per second
>
> [ system is usable from the beginning - nice levels are working as
> expected. Load can be felt while executing shell commands, but the
> system is usable. Load cannot be felt in truly interactive
> applications like editors.

Ingo, there's something wrong there.

Not a way in hell should "nice 19" cause the throughput to improve like
that. It looks like this is a result of "nice 19" simply doing _different_
scheduling, possibly more batch-like, and as such those numbers cannot
sanely be compared to anything else.

(And if they _are_ comparable, then you should be able to get the good
numbers even without "nice 19". Quite frankly it sounds to me like the
whole chat benchmark is another "dbench", ie doing unbalanced scheduling
_helps_ it performance-wise, which implies that it's probably a bad
benchmark to look at numbers for).

Linus

2002-01-09 21:05:26

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Wed, 9 Jan 2002, Linus Torvalds wrote:

> Not a way in hell should "nice 19" cause the throughput to improve
> like that. It looks like this is a result of "nice 19" simply doing
> _different_ scheduling, possibly more batch-like, and as such those
> numbers cannot sanely be compared to anything else.

yes, this is what happens. The difference is that the load estimator
'punishes' tasks to have lower priority, while the recalc-based method
gives a 'bonus'. If run with nice +19 then the process cannot be punished
anymore, all the tasks will run on the same priority level - and none can
cause a preemption of the other one. The priority limit is set right at
the nice +19 level.

is this an intended thing with nice +19 tasks? I think so, at least for
some usages. It could be fixed by adding some more priority space (+13
levels) they could explore into (but which couldnt be set as the default
priority). So by having a ceiling it really behaves differently, very
batch-like - but that's what such benchmarks are asking for anyway ... I
think it's an intended effect for CPU hogs as well - we do not want them
to preempt each other, they should each use up their timeslices fully and
roundrobin nicely.

> (And if they _are_ comparable, then you should be able to get the good
> numbers even without "nice 19". Quite frankly it sounds to me like the
> whole chat benchmark is another "dbench", ie doing unbalanced
> scheduling _helps_ it performance-wise, which implies that it's
> probably a bad benchmark to look at numbers for).

yes, agreed. It's not really unbalanced scheduling, the scheduler is still
fair. What doesnt happen is priority based preemption.

i think it could be a bonus to have such a scheduler mode - people dont
run shells at +19 niceness level, it's the known CPU hogs that get started
up with nice +19. It's a kind of SCHED_IDLE - everything can preempt it
and it will preempt nothing, without the priority inheritance problems of
SCHED_IDLE.

Ingo

2002-01-09 22:33:51

by Mark Hahn

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

> no wonder, it should be 'nice -n -20 vmstat -n 1'. And you should also do

I keep a suid setrealtime wrapper around (UNSAFE!) for this kind of use:

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sched.h>

int main(int argc, char *argv[]) {
static struct sched_param sched_parms;
int pid, wrapper=0;

if (argc <= 1)
return 1;

pid = atoi(argv[1]);

if (!pid || argc != 2) {
wrapper = 1;
pid = getpid();
}

sched_parms.sched_priority = sched_get_priority_min(SCHED_FIFO);
if (sched_setscheduler(pid, SCHED_FIFO, &sched_parms) == -1) {
perror("cannot set realtime scheduling policy");
return 1;
}
if (wrapper) {
setuid(getuid());
execvp(argv[1],&argv[1]);
perror("exec failed");
return 1;
}
return 0;
}

regards, mark hahn.

2002-01-09 23:15:54

by Anton Blanchard

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

> > With the patch things look much better (and the kernel boots on my
> > ppc64 machine :)
>
> hey it should not even compile, you forgot to send us the PPC definition
> of sched_find_first_zero_bit() ;-)

Good point, but its ppc64 so the patch would include all of
include/asm-ppc64 and arch/ppc64 :)

I expect most architectures have a reasonably fast find_first_zero_bit
so they can simply do:

static inline int sched_find_first_zero_bit(unsigned long *bitmap)
{
return find_first_zero_bit(bitmap, MAX_PRIO);
}

2002-01-10 01:09:59

by Richard Henderson

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Thu, Jan 10, 2002 at 10:15:14AM +1100, Anton Blanchard wrote:
> I expect most architectures have a reasonably fast find_first_zero_bit
> so they can simply do:
>
> static inline int sched_find_first_zero_bit(unsigned long *bitmap)
> {
> return find_first_zero_bit(bitmap, MAX_PRIO);
> }

Careful. The following is really quite a bit better on Alpha:

static inline int
sched_find_first_zero_bit(unsigned long *bitmap)
{
unsigned long b0 = bitmap[0];
unsigned long b1 = bitmap[1];
unsigned long b2 = bitmap[2];
unsigned long ofs = MAX_RT_PRIO;

if (unlikely(~(b0 & b1) != 0)) {
b2 = (~b0 == 0 ? b0 : b1);
ofs = (~b0 == 0 ? 0 : 64);
}

return ffz(b2) + ofs;
}

It compiles down to

ldq $2,0($16)
ldq $3,8($16)
lda $5,128($31)
ldq $0,16($16)
and $2,$3,$1
ornot $31,$2,$4
ornot $31,$1,$1
bne $1,$L8
$L2:
ornot $31,$0,$0
cttz $0,$0
addl $0,$5,$0
ret $31,($26),1
$L8:
mov $2,$0
cmpult $31,$4,$5
cmovne $4,$3,$0
sll $5,6,$5
br $31,$L2

which is a fair bit better than find_first_zero_bit if for
no other reason than we collect all the memory accesses
right up at the beginning.

While we're on the subject of sched_find_first_zero_bit, I'd
like to complain about Ingo's choice of header file. Why in
the world did you choose mmu_context.h? Invent a new asm/sched.h
if you must, but please don't choose headers at random.

r~

2002-01-10 12:07:43

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Wed, 9 Jan 2002, Mark Hahn wrote:

> > no wonder, it should be 'nice -n -20 vmstat -n 1'. And you should also do
>
> I keep a suid setrealtime wrapper around (UNSAFE!) for this kind of use:

nice -20 is an equivalent but safe version of the same (if you use my
patches). I made priority levels -20 ... -16 to be 'super-high priority',
ie. such tasks never expire. (they can still drop above prio -16 if they
use up too much CPU time, so they cannot lock up systems accidentally like
RT tasks.) So it's in essence a 'admin priority', for super-important
shells. I'm using it with great success.

Ingo

2002-01-10 17:05:27

by Ivan Kokshaysky

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Wed, Jan 09, 2002 at 05:09:28PM -0800, Richard Henderson wrote:
> Careful. The following is really quite a bit better on Alpha:
>
> static inline int
> sched_find_first_zero_bit(unsigned long *bitmap)
> {
> unsigned long b0 = bitmap[0];
> unsigned long b1 = bitmap[1];
> unsigned long b2 = bitmap[2];
> unsigned long ofs = MAX_RT_PRIO;
>
> if (unlikely(~(b0 & b1) != 0)) {
> b2 = (~b0 == 0 ? b0 : b1);
> ofs = (~b0 == 0 ? 0 : 64);
> }
>
> return ffz(b2) + ofs;
> }

True. Minor correction:
- b2 = (~b0 == 0 ? b0 : b1);
- ofs = (~b0 == 0 ? 0 : 64);
+ b2 = (~b0 ? b0 : b1);
+ ofs = (~b0 ? 0 : 64);

Note that comment for this function is a bit confusing:
* ... It's the fastest
* way of searching a 168-bit bitmap where the first 128 bits are
* unlikely to be set.

s/set/cleared/

> While we're on the subject of sched_find_first_zero_bit, I'd
> like to complain about Ingo's choice of header file. Why in
> the world did you choose mmu_context.h? Invent a new asm/sched.h
> if you must, but please don't choose headers at random.

Agreed. Apparently asm/bitops.h?

Ivan.

2002-01-10 18:51:09

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Wed, Jan 09, 2002 at 11:38:33AM -0800, Mike Kravetz wrote:
>
> I just kicked off another benchmark run to compare pre10, pre10 & G1
> patch, pre10 & Davide's patch.

It wasn't a good night for benchmarking. I had a typo in the
script to run chat reniced and as a result didn't collect any
numbers for this. In addition, the kernel with Davide's patch
failed to boot with 8 CPUs enabled. Can't see any '# CPU specific'
mods in the patch. In any case, here is what I do have.

--------------------------------------------------------------------
mkbench - Time how long it takes to compile the kernel. On this
8 CPU system we use 'make -j 8' and increase the number
of makes run in parallel. Result is average build time in
seconds. Lower is better.
--------------------------------------------------------------------
# CPUs # Makes pre10 pre10-G1 pre10-Davide
--------------------------------------------------------------------
2 1 189 190 185
2 2 370 376 362
2 4 733 726* 717
2 6 1102 1082* 1077
4 1 101 99 101
4 2 199 192 195
4 4 387 382 381
4 6 581 551 568
8 1 58 56 -
8 2 110 104 -
8 4 214 204 -
8 6 314 305 -

* Most likely statisticly invalid results. I run these things 3 times
to make sure results are at lease consistent. With pre10-G1 results
varied more than the others. Items marked with * had extremely high
variations.

--------------------------------------------------------------------
Chat - VolanoMark simulator. Result is a measure of throughput.
Higher is better.
--------------------------------------------------------------------
Configuration Parms # CPUs pre10 pre10-G1 pre10-Davide
--------------------------------------------------------------------
10 rooms, 200 messages 2 143041 107718 181556
20 rooms, 200 messages 2 147335 147151 166048
30 rooms, 200 messages 2 179370 190413 173135
10 rooms, 200 messages 4 264033 287076 272597
20 rooms, 200 messages 4 243873 241855 273219
30 rooms, 200 messages 4 303228 301175 278513
10 rooms, 200 messages 8 304754 306891 -
20 rooms, 200 messages 8 241077 301414 -
30 rooms, 200 messages 8 309485 333660 -

--
Mike

2002-01-10 19:03:29

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Thu, 10 Jan 2002, Mike Kravetz wrote:

> On Wed, Jan 09, 2002 at 11:38:33AM -0800, Mike Kravetz wrote:
> >
> > I just kicked off another benchmark run to compare pre10, pre10 & G1
> > patch, pre10 & Davide's patch.
>
> It wasn't a good night for benchmarking. I had a typo in the
> script to run chat reniced and as a result didn't collect any
> numbers for this. In addition, the kernel with Davide's patch
> failed to boot with 8 CPUs enabled. Can't see any '# CPU specific'
> mods in the patch. In any case, here is what I do have.

Doh !! Do you have a panic dump Mike ?

- Davide

2002-01-10 19:10:59

by Linus Torvalds

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Thu, 10 Jan 2002, Davide Libenzi wrote:
> >
> > It wasn't a good night for benchmarking. I had a typo in the
> > script to run chat reniced and as a result didn't collect any
> > numbers for this. In addition, the kernel with Davide's patch
> > failed to boot with 8 CPUs enabled. Can't see any '# CPU specific'
> > mods in the patch. In any case, here is what I do have.
>
> Doh !! Do you have a panic dump Mike ?

I bet it's just the placement of "init_idle()" in init/main.c, which is
unrelated to the scheduling proper, but if the kernel thread is started
before the boot CPU has done its "init_idle()", then the scheduler state
isn't really set up fully yet.

(Old bug, I think its been there for a long time, I just think that the
old scheduler didn't much care, and the "child runs first" logic in
particular of the new scheduler probably just showed it more clearly)

Linus

2002-01-10 19:19:09

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Thu, Jan 10, 2002 at 11:08:21AM -0800, Davide Libenzi wrote:
> On Thu, 10 Jan 2002, Mike Kravetz wrote:
> > >
> > > I just kicked off another benchmark run to compare pre10, pre10 & G1
> > > patch, pre10 & Davide's patch.
> >
> > It wasn't a good night for benchmarking. I had a typo in the
> > script to run chat reniced and as a result didn't collect any
> > numbers for this. In addition, the kernel with Davide's patch
> > failed to boot with 8 CPUs enabled. Can't see any '# CPU specific'
> > mods in the patch. In any case, here is what I do have.
>
> Doh !! Do you have a panic dump Mike ?

It didn't panic, but hung during the boot process. After
reading other mail, this may be caused by the out of order
locking bug/deadlock that existed in this version of the
O(1) scheduler. I may be able to try and verify later today.
Right now the machine is being used for something else.

--
Mike

2002-01-10 20:00:32

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Thu, 10 Jan 2002, Mike Kravetz wrote:

> Right now the machine is being used for something else.

Do they know at IBM that you're using 8 way SMP systems to run
counter-strike servers ? :-)

- Davide

2002-01-10 20:44:41

by George Anzinger

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

Ivan Kokshaysky wrote:
>
> On Wed, Jan 09, 2002 at 05:09:28PM -0800, Richard Henderson wrote:
> > Careful. The following is really quite a bit better on Alpha:
> >
> > static inline int
> > sched_find_first_zero_bit(unsigned long *bitmap)
> > {
> > unsigned long b0 = bitmap[0];
> > unsigned long b1 = bitmap[1];
> > unsigned long b2 = bitmap[2];
> > unsigned long ofs = MAX_RT_PRIO;
> >
> > if (unlikely(~(b0 & b1) != 0)) {
> > b2 = (~b0 == 0 ? b0 : b1);
> > ofs = (~b0 == 0 ? 0 : 64);
> > }
> >
> > return ffz(b2) + ofs;
> > }
>
> True. Minor correction:
> - b2 = (~b0 == 0 ? b0 : b1);
> - ofs = (~b0 == 0 ? 0 : 64);
> + b2 = (~b0 ? b0 : b1);
> + ofs = (~b0 ? 0 : 64);
>
> Note that comment for this function is a bit confusing:
> * ... It's the fastest
> * way of searching a 168-bit bitmap where the first 128 bits are
> * unlikely to be set.

What if we want a 2048-bit bitmap???
>
> s/set/cleared/
>
> > While we're on the subject of sched_find_first_zero_bit, I'd
> > like to complain about Ingo's choice of header file. Why in
> > the world did you choose mmu_context.h? Invent a new asm/sched.h
> > if you must, but please don't choose headers at random.
>
> Agreed. Apparently asm/bitops.h?
>
> Ivan.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
George [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/

2002-01-10 21:04:03

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Thu, 10 Jan 2002, Linus Torvalds wrote:

>
> On Thu, 10 Jan 2002, Davide Libenzi wrote:
> > >
> > > It wasn't a good night for benchmarking. I had a typo in the
> > > script to run chat reniced and as a result didn't collect any
> > > numbers for this. In addition, the kernel with Davide's patch
> > > failed to boot with 8 CPUs enabled. Can't see any '# CPU specific'
> > > mods in the patch. In any case, here is what I do have.
> >
> > Doh !! Do you have a panic dump Mike ?
>
> I bet it's just the placement of "init_idle()" in init/main.c, which is
> unrelated to the scheduling proper, but if the kernel thread is started
> before the boot CPU has done its "init_idle()", then the scheduler state
> isn't really set up fully yet.
>
> (Old bug, I think its been there for a long time, I just think that the
> old scheduler didn't much care, and the "child runs first" logic in
> particular of the new scheduler probably just showed it more clearly)

Uhm, seems fixed in pre11. Did you fix it in pre10->pre11 stage ?

- Davide

2002-01-10 21:59:18

[permalink] [raw]

Subject: Re: [patch] O(1) scheduler, -D1, 2.5.2-pre9, 2.4.17

On Thu, 10 Jan 2002, Ivan Kokshaysky wrote:

> Note that comment for this function is a bit confusing:
> * ... It's the fastest
> * way of searching a 168-bit bitmap where the first 128 bits are
> * unlikely to be set.
>
> s/set/cleared/

no, it's really 'cleared'. The bits are inverted right now.

Ingo

2002-01-11 12:04:40