2002-11-06 16:28:28

by Erich Focht

[permalink] [raw]
Subject: NUMA scheduler BK tree

Michael, Martin,

in order to make it easier to keep up with the main Linux tree I've
set up a bitkeeper repository with our NUMA scheduler at
bk://numa-ef.bkbits.net/numa-sched
(Web view: http://numa-ef.bkbits.net/)
This used to contain my node affine NUMA scheduler, I'll add extra
trees when the additional patches for that are tested on top of our
NUMA scheduler.

Is it ok for you to have it this way or would you prefer having the
core and the initial load balancer separate?

The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
I'll try to keep so.

Regards,
Erich


2002-11-06 18:08:33

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: NUMA scheduler BK tree

On Wed, 2002-11-06 at 08:34, Erich Focht wrote:
> Michael, Martin,
>
> in order to make it easier to keep up with the main Linux tree I've
> set up a bitkeeper repository with our NUMA scheduler at
> bk://numa-ef.bkbits.net/numa-sched
> (Web view: http://numa-ef.bkbits.net/)
> This used to contain my node affine NUMA scheduler, I'll add extra
> trees when the additional patches for that are tested on top of our
> NUMA scheduler.
>
> Is it ok for you to have it this way or would you prefer having the
> core and the initial load balancer separate?
>
> The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
> I'll try to keep so.

Erich,

This is fine with me. Can't the core changes and and load
balancer be maintained as separate changesets within the bk
tree?

Michael

> Regards,
> Erich
--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2002-11-07 22:59:23

by Erich Focht

[permalink] [raw]
Subject: Re: NUMA scheduler BK tree

On Wednesday 06 November 2002 19:10, Michael Hohnbaum wrote:
> > Is it ok for you to have it this way or would you prefer having the
> > core and the initial load balancer separate?
>
> This is fine with me. Can't the core changes and and load
> balancer be maintained as separate changesets within the bk
> tree?

OK, I'll do that. Any idea how I can apply a changeset which has
another name attached to it than mine?

Regards,
Erich

2002-11-07 23:41:07

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: NUMA scheduler BK tree

On Wed, 2002-11-06 at 08:34, Erich Focht wrote:
> Michael, Martin,
>
> in order to make it easier to keep up with the main Linux tree I've
> set up a bitkeeper repository with our NUMA scheduler at
> bk://numa-ef.bkbits.net/numa-sched
> (Web view: http://numa-ef.bkbits.net/)
>
> The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
> I'll try to keep so.
>
Tested this on a 4 node NUMAQ. Worked fine. Results:

$reportbench stock46 sched46
Kernbench:
Elapsed User System CPU
stock46 20.66s 194.062s 53.39s 1197.4%
sched46 19.988s 191.302s 50.692s 1210.4%

Schedbench 4:
Elapsed TotalUser TotalSys AvgUser
stock46 27.27 40.64 109.13 0.85
sched46 23.10 41.32 92.42 0.76

Schedbench 8:
Elapsed TotalUser TotalSys AvgUser
stock46 39.18 55.12 313.56 1.68
sched46 34.45 51.24 275.63 2.28

Schedbench 16:
Elapsed TotalUser TotalSys AvgUser
stock46 56.39 72.44 902.45 5.12
sched46 56.73 71.31 907.88 4.19

Schedbench 32:
Elapsed TotalUser TotalSys AvgUser
stock46 90.47 203.28 2895.41 10.39
sched46 60.95 143.21 1950.72 10.31

Schedbench 64:
Elapsed TotalUser TotalSys AvgUser
stock46 105.00 439.04 6720.90 25.02
sched46 59.07 262.98 3781.06 19.59

The schedbench runs were ran once each. Kernbench is the average of
5 runs.

Michael

> Regards,
> Erich
--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2002-11-08 16:51:18

by Erich Focht

[permalink] [raw]
Subject: Re: NUMA scheduler BK tree

Michael,

thanks for the results, they look really good!

There's a problem with your script for Schedbench: the column labels
are permuted! Instead of
> Elapsed TotalUser TotalSys AvgUser
you should have
> AvgUser Elapsed TotalUser TotalSys

The numbers make more sense like that ;-)
Could you please send me your script for schedbench?

BTW: the bk repository has now two distinct changesets, one for the
"core" the other one for the initial balancing. Larry McVoy showed me
how to do things right with the names assigned to the changesets.

Thanks!

Regards,
Erich


On Friday 08 November 2002 00:46, Michael Hohnbaum wrote:
> On Wed, 2002-11-06 at 08:34, Erich Focht wrote:
> > in order to make it easier to keep up with the main Linux tree I've
> > set up a bitkeeper repository with our NUMA scheduler at
> > bk://numa-ef.bkbits.net/numa-sched
> > (Web view: http://numa-ef.bkbits.net/)
...
> Tested this on a 4 node NUMAQ. Worked fine. Results:
>
> $reportbench stock46 sched46
> Kernbench:
> Elapsed User System CPU
> stock46 20.66s 194.062s 53.39s 1197.4%
> sched46 19.988s 191.302s 50.692s 1210.4%
>
> Schedbench 4:
> Elapsed TotalUser TotalSys AvgUser
> stock46 27.27 40.64 109.13 0.85
> sched46 23.10 41.32 92.42 0.76
>
> Schedbench 8:
> Elapsed TotalUser TotalSys AvgUser
> stock46 39.18 55.12 313.56 1.68
> sched46 34.45 51.24 275.63 2.28
>
> Schedbench 16:
> Elapsed TotalUser TotalSys AvgUser
> stock46 56.39 72.44 902.45 5.12
> sched46 56.73 71.31 907.88 4.19
>
> Schedbench 32:
> Elapsed TotalUser TotalSys AvgUser
> stock46 90.47 203.28 2895.41 10.39
> sched46 60.95 143.21 1950.72 10.31
>
> Schedbench 64:
> Elapsed TotalUser TotalSys AvgUser
> stock46 105.00 439.04 6720.90 25.02
> sched46 59.07 262.98 3781.06 19.59
>
> The schedbench runs were ran once each. Kernbench is the average of
> 5 runs.
>
> Michael

2002-11-11 15:07:23

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.47] NUMA scheduler (1/2)

Hi,

here come the NUMA scheduler patches for 2.5.47 which emerged from
Michael's and my work on NUMA schedulers. As usual in two parts:

01-numa-sched-core-2.5.47-21.patch: core NUMA scheduler infrastructure
providing a node aware load_balancer

02-numa-sched-ilb-2.5.47-21.patch: initial load balancing, selects
least loaded node & CPU at exec()

An up-to-date BK tree containing the whole NUMA scheduler can be found
at bk://numa-ef.bkbits.net/numa-sched .

Regards,
Erich


Attachments:
01-numa-sched-core-2.5.47-21.patch (12.27 kB)

2002-11-11 15:08:40

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.47] NUMA scheduler (2/2)

And here's the second part.

Erich

On Monday 11 November 2002 16:13, Erich Focht wrote:
> here come the NUMA scheduler patches for 2.5.47 which emerged from
> Michael's and my work on NUMA schedulers. As usual in two parts:
>
> 01-numa-sched-core-2.5.47-21.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer
>
> 02-numa-sched-ilb-2.5.47-21.patch: initial load balancing, selects
> least loaded node & CPU at exec()


Attachments:
02-numa-sched-ilb-2.5.47-21.patch (4.55 kB)

2002-11-12 00:19:38

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: [PATCH 2.5.47] NUMA scheduler (1/2)

On Mon, 2002-11-11 at 07:13, Erich Focht wrote:
> Hi,
>
> here come the NUMA scheduler patches for 2.5.47 which emerged from
> Michael's and my work on NUMA schedulers. As usual in two parts:
>
> 01-numa-sched-core-2.5.47-21.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer
>
> 02-numa-sched-ilb-2.5.47-21.patch: initial load balancing, selects
> least loaded node & CPU at exec()

Built, booted, tested on NUMAQ. Results:

$ reportbench sched47 stock47
Kernbench:
Elapsed User System CPU
sched47 20.39s 191.824s 50.358s 1187.6%
stock47 20.608s 194.536s 53s 1201.4%

Schedbench 4:
Elapsed TotalUser TotalSys AvgUser
sched47 22.45 34.79 89.84 0.69
stock47 29.05 42.58 116.25 0.94

Schedbench 8:
Elapsed TotalUser TotalSys AvgUser
sched47 39.58 60.98 316.72 1.78
stock47 43.30 56.84 346.49 2.44

Schedbench 16:
Elapsed TotalUser TotalSys AvgUser
sched47 58.83 72.58 941.56 5.30
stock47 60.76 82.05 972.31 4.82

Schedbench 32:
Elapsed TotalUser TotalSys AvgUser
sched47 55.87 122.75 1788.32 11.02
stock47 77.93 179.68 2494.06 11.35

Schedbench 64:
Elapsed TotalUser TotalSys AvgUser
sched47 85.77 360.49 5490.28 21.13
stock47 94.31 398.85 6036.95 23.47


Michael

> An up-to-date BK tree containing the whole NUMA scheduler can be found
> at bk://numa-ef.bkbits.net/numa-sched .
>
> Regards,
> Erich

--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2002-11-18 18:39:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: NUMA scheduler BK tree

> in order to make it easier to keep up with the main Linux tree I've
> set up a bitkeeper repository with our NUMA scheduler at
> bk://numa-ef.bkbits.net/numa-sched
> (Web view: http://numa-ef.bkbits.net/)
> This used to contain my node affine NUMA scheduler, I'll add extra
> trees when the additional patches for that are tested on top of our
> NUMA scheduler.
>
> Is it ok for you to have it this way or would you prefer having the
> core and the initial load balancer separate?
>
> The tree is currently in sync with bk://linux.bkbits.net/linux-2.5 and
> I'll try to keep so.

BTW, can you keep producing normal patches too, when you do an update?
I don't use bitkeeper ...

Thanks,

Martin.


2002-11-19 16:20:51

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.48] NUMA scheduler (1/2)

As requested in another email, I attach the 2.5.48 patches for the
NUMA scheduler which emerged from Michael's and my work:

01-numa-sched-core-2.5.48-22.patch: core NUMA scheduler infrastructure
providing a node aware load_balancer. Rediffed and fixed one
external declaration.

02-numa-sched-ilb-2.5.48-21.patch: initial load balancing, selects
least loaded node & CPU at exec(). Rediffed.

We are curious about any benchmark results and, of course, about running
on other platforms than NUMAQ & Azusa/TX7, too.

Regards,
Erich

On Monday 18 November 2002 20:40, Martin J. Bligh wrote:
>
> BTW, can you keep producing normal patches too, when you do an update?
> I don't use bitkeeper ...


Attachments:
01-numa-sched-core-2.5.48-22.patch (12.28 kB)

2002-11-19 16:21:51

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.48] NUMA scheduler (2/2)

And here is the the second patch...

Erich

On Tuesday 19 November 2002 17:26, Erich Focht wrote:
> As requested in another email, I attach the 2.5.48 patches for the
> NUMA scheduler which emerged from Michael's and my work:
>
> 01-numa-sched-core-2.5.48-22.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer. Rediffed and fixed one
> external declaration.
>
> 02-numa-sched-ilb-2.5.48-21.patch: initial load balancing, selects
> least loaded node & CPU at exec(). Rediffed.
>
> We are curious about any benchmark results and, of course, about running
> on other platforms than NUMAQ & Azusa/TX7, too.
>
> Regards,
> Erich
>
> On Monday 18 November 2002 20:40, Martin J. Bligh wrote:
> > BTW, can you keep producing normal patches too, when you do an update?
> > I don't use bitkeeper ...


Attachments:
02-numa-sched-ilb-2.5.48-21.patch (4.55 kB)

2002-12-02 15:23:34

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.50] NUMA scheduler (1/2)

Here come the NUMA scheduler patches rediffed for 2.5.50. No
functional changes since last version (
http://marc.theaimsgroup.com/?l=linux-kernel&m=103772346430798&w=2 ).

Regards,
Erich


Attachments:
01-numa-sched-core-2.5.50-22.patch (12.38 kB)

2002-12-02 15:23:46

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.50] NUMA scheduler (2/2)

... and the second part ...

On Monday 02 December 2002 16:29, Erich Focht wrote:
> Here come the NUMA scheduler patches rediffed for 2.5.50. No
> functional changes since last version (
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103772346430798&w=2 ).
>
> Regards,
> Erich


Attachments:
02-numa-sched-ilb-2.5.50-21.patch (4.64 kB)

2002-12-06 17:30:42

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: [PATCH 2.5.50] NUMA scheduler (1/2)

On Mon, 2002-12-02 at 07:29, Erich Focht wrote:
> Here come the NUMA scheduler patches rediffed for 2.5.50. No
> functional changes since last version (
> http://marc.theaimsgroup.com/?l=linux-kernel&m=103772346430798&w=2 ).

Tested on NUMAQ. Ran into the problem of the missing CPU stats from
/proc/self/cpu so initial test results were worthless. Applied Erich's
patch that restored this information and ran on NUMAQ. As previous
versions, result was a modest performance gain on kernbench and a more
significant performance gain on Erich's memory intensive test (fondly
referred to here as schedbench).

Kernbench:
Elapsed User System CPU
stock50a 20.92s 194.12s 53.15s 1182%
sched50 20.002s 191.976s 51.17s 1215%

Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
stock50a 29.50 43.11 118.02 0.83
sched50 33.93 47.17 135.76 0.74

Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
stock50a 46.88 62.51 375.12 1.78
sched50 33.90 47.31 271.30 1.98

Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
stock50a 56.26 71.17 900.32 6.23
sched50 57.22 73.34 915.76 4.53

Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
stock50a 84.86 191.48 2715.76 10.93
sched50 57.36 130.88 1835.76 10.46

Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
stock50a 114.91 474.55 7355.45 24.95
sched50 80.20 338.04 5133.56 22.70

In other testing we are seeing unexpectedly high idle times. Andrea has
patches against 2.4 port of the O(1) scheduler that are suppose to help
with this. I plan to try those out to see if they reduce our idle time.

Michael

>
> Regards,
> Erich
> ----

--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2002-12-18 16:15:03

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.52] NUMA scheduler (2/2)

And here comes patch #2, work done by Michael Hohnbaum.

On Wednesday 18 December 2002 17:21, Erich Focht wrote:
> This is the first part of the minimal NUMA scheduler built on top of
> the O(1) load balancer. It is rediffed for 2.5.52 and contains a small
> bugfix in build_node_data().
>
> The two patches are:
>
> 01-numa-sched-core-2.5.52-23.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer.
>
> 02-numa-sched-ilb-2.5.52-21.patch: initial load balancing, selects
> least loaded node & CPU at exec().
>
> Regards,
> Erich


Attachments:
02-numa-sched-ilb-2.5.52-21.patch (4.64 kB)

2002-12-18 16:13:44

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.52] NUMA scheduler (1/2)

This is the first part of the minimal NUMA scheduler built on top of
the O(1) load balancer. It is rediffed for 2.5.52 and contains a small
bugfix in build_node_data().

The two patches are:

01-numa-sched-core-2.5.52-23.patch: core NUMA scheduler infrastructure
providing a node aware load_balancer.

02-numa-sched-ilb-2.5.52-21.patch: initial load balancing, selects
least loaded node & CPU at exec().

Regards,
Erich


Attachments:
01-numa-sched-core-2.5.52-23.patch (12.41 kB)

2002-12-20 14:41:03

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.52] NUMA scheduler: cputimes stats

The attached patch is needed for evaluating and understanding
scheduler and load balancer activity. It adds back to the kernel per
CPU user and system time statistics for each process in /proc/PID/cpu,
a feature which was around for ages and went lost in 2.5.50.

Regards,
Erich

On Wednesday 18 December 2002 17:21, Erich Focht wrote:
> This is the first part of the minimal NUMA scheduler built on top of
> the O(1) load balancer. It is rediffed for 2.5.52 and contains a small
> bugfix in build_node_data().
>
> The two patches are:
>
> 01-numa-sched-core-2.5.52-23.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer.
>
> 02-numa-sched-ilb-2.5.52-21.patch: initial load balancing, selects
> least loaded node & CPU at exec().
>
> Regards,
> Erich


Attachments:
03-cputimes_stat-2.5.52.patch (3.24 kB)

2002-12-20 15:09:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2.5.52] NUMA scheduler (1/2)


diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
--- a/arch/i386/kernel/smpboot.c 2002-12-16 03:07:56.000000000 +0100
+++ b/arch/i386/kernel/smpboot.c 2002-12-18 16:53:12.000000000 +0100
@@ -1202,6 +1202,9 @@
void __init smp_cpus_done(unsigned int max_cpus)
{
zap_low_mappings();
+#ifdef CONFIG_NUMA
+ build_node_data();
+#endif

I think it would be much nicer if you had a proper stub for !CONFIG_NUMA..

@@ -20,6 +20,7 @@
#include <asm/mmu.h>

#include <linux/smp.h>
+#include <asm/topology.h>

Another header in sched.h?? And it doesn't look like it's used at all.

#include <linux/sem.h>
#include <linux/signal.h>
#include <linux/securebits.h>
@@ -446,6 +447,9 @@
# define set_cpus_allowed(p, new_mask) do { } while (0)
#endif

+#ifdef CONFIG_NUMA
+extern void build_node_data(void);
+#endif

The ifdef is superflous.

diff -urN a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2002-12-16 03:08:14.000000000 +0100
+++ b/kernel/sched.c 2002-12-18 16:53:13.000000000 +0100
@@ -158,6 +158,10 @@
struct list_head migration_queue;

atomic_t nr_iowait;
+
+ unsigned long wait_time;
+ int wait_node;
+

Here OTOH a #if CONFIG_MUA could help to avoid a little bit of bloat.

} ____cacheline_aligned;

+#define cpu_to_node(cpu) __cpu_to_node(cpu)

I wonder why we don't have a proper cpu_to_node() yet, but as long
as it doesn't exist please use __cpu_to_node() directly.

+#define LOADSCALE 128

Any explanation?

+#ifdef CONFIG_NUMA

sched.c uses #if CONFIG_FOO, not #ifdef CONFIG_FOO, it would be cool
if you could follow the style of existing source files.

+/* Number of CPUs per node: sane values until all CPUs are up */
+int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
+int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are sorted!)*/
+#define node_ncpus(node) _node_nr_cpus[node]

Parametrized macros and variables aren't in the ?ame namespace, what about
just node_nr_cpus for the macro, too. And should these be static?

+
+#define NODE_DELAY_IDLE (1*HZ/1000)
+#define NODE_DELAY_BUSY (20*HZ/1000)

Comments, please..

+/* the following macro implies that logical CPUs are sorted by node number */
+#define loop_over_node(cpu,node) \
+ for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)

Move to asm/topology.h?

+ ptr=0;
+ for (n=0; n<numnodes; n++) {

You need to add lots of space to match Documentation/?odingStyle.. :)

+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ if (!cpu_online(cpu)) continue;

And linebreaks..

Btw, what about a for_each_cpu that has the cpu_online check in topology.h?

+ /* Wait longer before stealing if own node's load is average. */
+ if (NODES_BALANCED(avg_load,nd[this_node].average_load))

Shouldn't NODES_BALANCED shout less and be an inline called nodes_balanced?

+ this_rq->wait_node = busiest_node;
+ this_rq->wait_time = jiffies;
+ goto out;
+ } else
+ /* old most loaded node: check if waited enough */
+ if (jiffies - this_rq->wait_time < steal_delay)
+ goto out;

That indentation looks really strange, why not just

/* old most loaded node: check if waited enough */
} else if (jiffies - this_rq->wait_time < steal_delay)
goto out;

+
+ if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {

Dito, the name shouts a bit too much

+#endif

#endif /* CONFIG_NUMA */

-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)

And explanation why you changed that constant?

p = req->task;
- cpu_dest = __ffs(p->cpus_allowed);
+ cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
+

This looks like a bugfix valid without the rest of the patch.

2002-12-20 17:36:23

by Erich Focht

[permalink] [raw]
Subject: Re: [PATCH 2.5.52] NUMA scheduler (1/2)

Hi Christoph,

thanks for the comments, I'll try to fix the things you mentioned when
preparing the next patch and add some more comments.

Regards,
Erich

On Friday 20 December 2002 16:17, Christoph Hellwig wrote:
> diff -urN a/arch/i386/kernel/smpboot.c b/arch/i386/kernel/smpboot.c
> --- a/arch/i386/kernel/smpboot.c 2002-12-16 03:07:56.000000000 +0100
> +++ b/arch/i386/kernel/smpboot.c 2002-12-18 16:53:12.000000000 +0100
> @@ -1202,6 +1202,9 @@
> void __init smp_cpus_done(unsigned int max_cpus)
> {
> zap_low_mappings();
> +#ifdef CONFIG_NUMA
> + build_node_data();
> +#endif
>
> I think it would be much nicer if you had a proper stub for !CONFIG_NUMA..
>
> @@ -20,6 +20,7 @@
> #include <asm/mmu.h>
>
> #include <linux/smp.h>
> +#include <asm/topology.h>
>
> Another header in sched.h?? And it doesn't look like it's used at all.
>
> #include <linux/sem.h>
> #include <linux/signal.h>
> #include <linux/securebits.h>
> @@ -446,6 +447,9 @@
> # define set_cpus_allowed(p, new_mask) do { } while (0)
> #endif
>
> +#ifdef CONFIG_NUMA
> +extern void build_node_data(void);
> +#endif
>
> The ifdef is superflous.
>
> diff -urN a/kernel/sched.c b/kernel/sched.c
> --- a/kernel/sched.c 2002-12-16 03:08:14.000000000 +0100
> +++ b/kernel/sched.c 2002-12-18 16:53:13.000000000 +0100
> @@ -158,6 +158,10 @@
> struct list_head migration_queue;
>
> atomic_t nr_iowait;
> +
> + unsigned long wait_time;
> + int wait_node;
> +
>
> Here OTOH a #if CONFIG_MUA could help to avoid a little bit of bloat.
>
> } ____cacheline_aligned;
>
> +#define cpu_to_node(cpu) __cpu_to_node(cpu)
>
> I wonder why we don't have a proper cpu_to_node() yet, but as long
> as it doesn't exist please use __cpu_to_node() directly.
>
> +#define LOADSCALE 128
>
> Any explanation?
>
> +#ifdef CONFIG_NUMA
>
> sched.c uses #if CONFIG_FOO, not #ifdef CONFIG_FOO, it would be cool
> if you could follow the style of existing source files.
>
> +/* Number of CPUs per node: sane values until all CPUs are up */
> +int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
> +int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are
> sorted!)*/ +#define node_ncpus(node) _node_nr_cpus[node]
>
> Parametrized macros and variables aren't in the ?ame namespace, what about
> just node_nr_cpus for the macro, too. And should these be static?
>
> +
> +#define NODE_DELAY_IDLE (1*HZ/1000)
> +#define NODE_DELAY_BUSY (20*HZ/1000)
>
> Comments, please..
>
> +/* the following macro implies that logical CPUs are sorted by node number
> */ +#define loop_over_node(cpu,node) \
> + for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
>
> Move to asm/topology.h?
>
> + ptr=0;
> + for (n=0; n<numnodes; n++) {
>
> You need to add lots of space to match Documentation/?odingStyle.. :)
>
> + for (cpu = 0; cpu < NR_CPUS; cpu++) {
> + if (!cpu_online(cpu)) continue;
>
> And linebreaks..
>
> Btw, what about a for_each_cpu that has the cpu_online check in topology.h?
>
> + /* Wait longer before stealing if own node's load is average. */
> + if (NODES_BALANCED(avg_load,nd[this_node].average_load))
>
> Shouldn't NODES_BALANCED shout less and be an inline called nodes_balanced?
>
> + this_rq->wait_node = busiest_node;
> + this_rq->wait_time = jiffies;
> + goto out;
> + } else
> + /* old most loaded node: check if waited enough */
> + if (jiffies - this_rq->wait_time < steal_delay)
> + goto out;
>
> That indentation looks really strange, why not just
>
> /* old most loaded node: check if waited enough */
> } else if (jiffies - this_rq->wait_time < steal_delay)
> goto out;
>
> +
> + if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
>
> Dito, the name shouts a bit too much
>
> +#endif
>
> #endif /* CONFIG_NUMA */
>
> -#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
> +#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
>
> And explanation why you changed that constant?
>
> p = req->task;
> - cpu_dest = __ffs(p->cpus_allowed);
> + cpu_dest = __ffs(p->cpus_allowed & cpu_online_map);
> +
>
> This looks like a bugfix valid without the rest of the patch.

2002-12-31 13:20:42

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.53] NUMA scheduler (1/3)

Here comes the minimal NUMA scheduler built on top of the O(1) load
balancer rediffed for 2.5.53 with some changes in the core part. As
suggested by Michael, I added the cputimes_stat patch, as it is
absolutely needed for understanding the scheduler behavior.

The three patches:
01-numa-sched-core-2.5.53-24.patch: core NUMA scheduler infrastructure
providing a node aware load_balancer. Cosmetic changes + more comments.

02-numa-sched-ilb-2.5.53-21.patch: initial load balancing, selects
least loaded node & CPU at exec().

03-cputimes_stat-2.5.53.patch: adds back to the kernel per CPU user
and system time statistics for each process in /proc/PID/cpu. Needed
for evaluating scheduler behavior and performance of tasks running
on SMP and NUMA systems.

Regards,
Erich


Attachments:
01-numa-sched-core-2.5.53-24.patch (12.17 kB)

2002-12-31 13:24:33

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.53] NUMA scheduler (3/3)

... and the third patch ...

On Tuesday 31 December 2002 14:29, Erich Focht wrote:
> Here comes the minimal NUMA scheduler built on top of the O(1) load
> balancer rediffed for 2.5.53 with some changes in the core part. As
> suggested by Michael, I added the cputimes_stat patch, as it is
> absolutely needed for understanding the scheduler behavior.
>
> The three patches:
> 01-numa-sched-core-2.5.53-24.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer. Cosmetic changes + more comments.
>
> 02-numa-sched-ilb-2.5.53-21.patch: initial load balancing, selects
> least loaded node & CPU at exec().
>
> 03-cputimes_stat-2.5.53.patch: adds back to the kernel per CPU user
> and system time statistics for each process in /proc/PID/cpu. Needed
> for evaluating scheduler behavior and performance of tasks running
> on SMP and NUMA systems.
>
> Regards,
> Erich


Attachments:
03-cputimes_stat-2.5.53.patch (3.24 kB)

2002-12-31 13:21:59

by Erich Focht

[permalink] [raw]
Subject: [PATCH 2.5.53] NUMA scheduler (2/3)

... the second patch ...

On Tuesday 31 December 2002 14:29, Erich Focht wrote:
> Here comes the minimal NUMA scheduler built on top of the O(1) load
> balancer rediffed for 2.5.53 with some changes in the core part. As
> suggested by Michael, I added the cputimes_stat patch, as it is
> absolutely needed for understanding the scheduler behavior.
>
> The three patches:
> 01-numa-sched-core-2.5.53-24.patch: core NUMA scheduler infrastructure
> providing a node aware load_balancer. Cosmetic changes + more comments.
>
> 02-numa-sched-ilb-2.5.53-21.patch: initial load balancing, selects
> least loaded node & CPU at exec().
>
> 03-cputimes_stat-2.5.53.patch: adds back to the kernel per CPU user
> and system time statistics for each process in /proc/PID/cpu. Needed
> for evaluating scheduler behavior and performance of tasks running
> on SMP and NUMA systems.
>
> Regards,
> Erich


Attachments:
02-numa-sched-ilb-2.5.53-21.patch (4.76 kB)

2003-01-04 01:50:53

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: [PATCH 2.5.53] NUMA scheduler (1/3)

On Tue, 2002-12-31 at 05:29, Erich Focht wrote:
> Here comes the minimal NUMA scheduler built on top of the O(1) load
> balancer rediffed for 2.5.53 with some changes in the core part. As
> suggested by Michael, I added the cputimes_stat patch, as it is
> absolutely needed for understanding the scheduler behavior.

Thanks for this latest patch. I've managed to cobble together
a 4 node NUMAQ system (16 700 MHZ PIII procs, 16GB memory) and
ran kernbench and schedbench on this, along with the 2.5.50 and
2.5.52 versions. Results remain fairly consistent with what
we have been obtaining on earlier versions.

Kernbench:
Elapsed User System CPU
sched50 29.96s 288.308s 83.606s 1240.8%
sched52 29.836s 285.832s 84.464s 1240.4%
sched53 29.364s 284.808s 83.174s 1252.6%
stock50 31.074s 303.664s 89.194s 1264.2%
stock53 31.204s 306.224s 87.776s 1263.2%

Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
sched50 22.00 35.50 88.04 0.86
sched52 22.04 34.77 88.18 0.75
sched53 22.16 34.39 88.66 0.74
stock50 27.18 45.99 108.75 0.87
stock53 0.00 41.96 133.85 1.07

Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
sched50 32.05 46.81 256.45 1.81
sched52 32.75 50.33 262.07 1.63
sched53 30.56 46.58 244.59 1.67
stock50 44.75 63.68 358.09 2.55
stock53 0.00 59.64 318.63 2.40

Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
sched50 55.34 70.74 885.61 4.09
sched52 50.64 62.30 810.50 4.31
sched53 54.04 70.01 864.84 3.68
stock50 65.36 80.72 1045.94 5.92
stock53 0.00 78.01 947.06 6.70

Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
sched50 59.47 139.77 1903.37 10.24
sched52 58.37 132.90 1868.30 7.45
sched53 57.35 132.30 1835.49 9.29
stock50 84.51 194.89 2704.83 12.44
stock53 0.00 182.95 2612.83 12.46

Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
sched50 63.55 276.81 4067.65 21.58
sched52 66.75 293.31 4272.72 21.06
sched53 62.38 276.99 3992.97 19.90
stock50 99.68 422.06 6380.04 25.92
stock53 0.00 441.42 6723.01 26.83

Note that the 0.00 in the AvgUser column for stock53 was due to
me not applying the cputime patch (03--cputimes_stat-2.5.53.patch).
Not sure how I managed to forget that one. I'll rerun this on
a kernel with that patch on Monday.

--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2003-01-05 05:27:02

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 2.5.53] NUMA scheduler (1/3)

>> Here comes the minimal NUMA scheduler built on top of the O(1) load
>> balancer rediffed for 2.5.53 with some changes in the core part. As
>> suggested by Michael, I added the cputimes_stat patch, as it is
>> absolutely needed for understanding the scheduler behavior.
>
> Thanks for this latest patch. I've managed to cobble together
> a 4 node NUMAQ system (16 700 MHZ PIII procs, 16GB memory) and
> ran kernbench and schedbench on this, along with the 2.5.50 and
> 2.5.52 versions. Results remain fairly consistent with what
> we have been obtaining on earlier versions.
>
> Kernbench:
> Elapsed User System CPU
> sched50 29.96s 288.308s 83.606s 1240.8%
> sched52 29.836s 285.832s 84.464s 1240.4%
> sched53 29.364s 284.808s 83.174s 1252.6%
> stock50 31.074s 303.664s 89.194s 1264.2%
> stock53 31.204s 306.224s 87.776s 1263.2%

Not sure what you're correllating here because your rows are all named
the same thing. However, the new version seems to be much slower
on systime (about 7-8% for me), which roughly correllates with your
last two rows above. Me no like.

Erich, can you seperate out the cleanups from the functional changes?
The cleanups look cool. The only thing that I can see easily was
BUSY_REBALANCE_TICK - I tried changing that back to the old version
but it made no difference. Here's a diff going from your old scheduler
(that I've been keeping update in my tree) to your new one on 2.5.54.

M.

diff -urpN -X /home/fletch/.diff.exclude
oldsched/arch/i386/kernel/smpboot.c newsched/arch/i386/kernel/smpboot.c
--- oldsched/arch/i386/kernel/smpboot.c Sat Jan 4 14:57:43 2003
+++ newsched/arch/i386/kernel/smpboot.c Sat Jan 4 14:58:25 2003
@@ -1198,9 +1198,7 @@ int __devinit __cpu_up(unsigned int cpu)
void __init smp_cpus_done(unsigned int max_cpus)
{
zap_low_mappings();
-#ifdef CONFIG_NUMA
build_node_data();
-#endif
}

void __init smp_intr_init()
diff -urpN -X /home/fletch/.diff.exclude
oldsched/arch/ia64/kernel/smpboot.c newsched/arch/ia64/kernel/smpboot.c
--- oldsched/arch/ia64/kernel/smpboot.c Sat Jan 4 14:57:43 2003
+++ newsched/arch/ia64/kernel/smpboot.c Sat Jan 4 14:58:25 2003
@@ -528,9 +528,7 @@ smp_cpus_done (unsigned int dummy)

printk(KERN_INFO"Total of %d processors activated (%lu.%02lu
BogoMIPS).\n",
num_online_cpus(), bogosum/(500000/HZ), (bogosum/(5000/HZ))%100);
-#ifdef CONFIG_NUMA
build_node_data();
-#endif
}

int __devinit
diff -urpN -X /home/fletch/.diff.exclude oldsched/arch/ppc64/kernel/smp.c
newsched/arch/ppc64/kernel/smp.c
--- oldsched/arch/ppc64/kernel/smp.c Sat Jan 4 14:57:43 2003
+++ newsched/arch/ppc64/kernel/smp.c Sat Jan 4 14:58:25 2003
@@ -685,7 +685,5 @@ void __init smp_cpus_done(unsigned int m

/* XXX fix this, xics currently relies on it - Anton */
smp_threads_ready = 1;
-#ifdef CONFIG_NUMA
build_node_data();
-#endif
}
diff -urpN -X /home/fletch/.diff.exclude oldsched/include/linux/sched.h
newsched/include/linux/sched.h
--- oldsched/include/linux/sched.h Sat Jan 4 14:57:50 2003
+++ newsched/include/linux/sched.h Sat Jan 4 14:58:28 2003
@@ -20,7 +20,6 @@
#include <asm/mmu.h>

#include <linux/smp.h>
-#include <asm/topology.h>
#include <linux/sem.h>
#include <linux/signal.h>
#include <linux/securebits.h>
@@ -160,19 +159,6 @@ extern void update_one_process(struct ta
unsigned long system, int cpu);
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;
-#ifdef CONFIG_NUMA
-extern void sched_balance_exec(void);
-extern void node_nr_running_init(void);
-#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
- rq->nr_running++
-#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
- rq->nr_running--
-#else
-#define sched_balance_exec() {}
-#define node_nr_running_init() {}
-#define nr_running_inc(rq) rq->nr_running++
-#define nr_running_dec(rq) rq->nr_running--
-#endif

#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
@@ -458,8 +444,21 @@ extern void set_cpus_allowed(task_t *p,
#endif

#ifdef CONFIG_NUMA
-extern void build_pools(void);
+extern void build_node_data(void);
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+ rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+ rq->nr_running--
+#else
+#define build_node_data() {}
+#define sched_balance_exec() {}
+#define node_nr_running_init() {}
+#define nr_running_inc(rq) rq->nr_running++
+#define nr_running_dec(rq) rq->nr_running--
#endif
+
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
diff -urpN -X /home/fletch/.diff.exclude oldsched/kernel/sched.c
newsched/kernel/sched.c
--- oldsched/kernel/sched.c Sat Jan 4 14:57:50 2003
+++ newsched/kernel/sched.c Sat Jan 4 14:58:28 2003
@@ -159,10 +159,10 @@ struct runqueue {
struct list_head migration_queue;

atomic_t nr_iowait;
-
+#ifdef CONFIG_NUMA
unsigned long wait_time;
int wait_node;
-
+#endif
} ____cacheline_aligned;

static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -182,24 +182,33 @@ static struct runqueue runqueues[NR_CPUS
# define task_running(rq, p) ((rq)->curr == (p))
#endif

-#define cpu_to_node(cpu) __cpu_to_node(cpu)
+/*
+ * Node loads are scaled with LOADSCALE. This way:
+ * - we avoid zeros in integer divisions (dividing by node CPU number),
+ * - loads of nodes with different numbers of CPUs are comparable.
+ */
#define LOADSCALE 128

#ifdef CONFIG_NUMA
/* Number of CPUs per node: sane values until all CPUs are up */
-int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] = NR_CPUS };
-int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus are
sorted!)*/
-#define node_ncpus(node) _node_nr_cpus[node]
+static int _node_nr_cpus[MAX_NUMNODES] = { [0 ... MAX_NUMNODES-1] =
NR_CPUS };
+static int node_ptr[MAX_NUMNODES+1]; /* first cpu of node (logical cpus
are sorted!)*/
+#define node_nr_cpus(node) _node_nr_cpus[node]

+/*
+ * Delay stealing from remote node when own queue is idle/busy. These
delays
+ * tend to equalize the node loads. NODE_DELAY_IDLE is nonzero because we
+ * want to give idle CPUs in the busiest node a chance to steal first.
+ */
#define NODE_DELAY_IDLE (1*HZ/1000)
#define NODE_DELAY_BUSY (20*HZ/1000)

/* the following macro implies that logical CPUs are sorted by node number
*/
#define loop_over_node(cpu,node) \
- for(cpu=node_ptr[node]; cpu<node_ptr[node+1]; cpu++)
+ for(cpu = node_ptr[node]; cpu < node_ptr[node+1]; cpu++)

/*
- * Build node_ptr and node_ncpus data after all CPUs have come up. This
+ * Build node_ptr and node_nr_cpus data after all CPUs have come up. This
* expects that the logical CPUs are sorted by their node numbers! Check
* out the NUMA API for details on why this should be that way. [EF]
*/
@@ -209,24 +218,25 @@ void build_node_data(void)
unsigned long mask;

ptr=0;
- for (n=0; n<numnodes; n++) {
+ for (n = 0; n < numnodes; n++) {
mask = __node_to_cpu_mask(n) & cpu_online_map;
node_ptr[n] = ptr;
- for (cpu=0; cpu<NR_CPUS; cpu++)
+ for (cpu = 0; cpu < NR_CPUS; cpu++)
if (mask & (1UL << cpu))
ptr++;
- node_ncpus(n) = ptr - node_ptr[n];;
+ node_nr_cpus(n) = ptr - node_ptr[n];
}
+ node_ptr[numnodes] = ptr;
printk("CPU nodes : %d\n",numnodes);
- for (n=0; n<numnodes; n++)
+ for (n = 0; n < numnodes; n++)
printk("node %d : %d .. %d\n",n,node_ptr[n],node_ptr[n+1]-1);
}

#else
-#define node_ncpus(node) num_online_cpus()
+#define node_nr_cpus(node) num_online_cpus()
#define NODE_DELAY_IDLE 0
#define NODE_DELAY_BUSY 0
-#define loop_over_node(cpu,n) for(cpu=0; cpu<NR_CPUS; cpu++)
+#define loop_over_node(cpu,n) for(cpu = 0; cpu < NR_CPUS; cpu++)
#endif


@@ -740,18 +750,18 @@ struct node_queue_data {
* Check if CPUs are balanced. The check is more involved than the O(1)
original
* because that one is simply wrong in certain cases (integer arithmetic
!!!) EF
*/
-#define CPUS_BALANCED(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 +
3)/4))
+#define cpus_balanced(m,t) (((m)<=1) || (((m)-(t))/2 < (((m)+(t))/2 +
3)/4))
/*
* Check if nodes are imbalanced. "this" node is balanced (compared to the
"comp"
* node) when it's load is not smaller than "comp"'s load - LOADSCALE/2.
*/
-#define NODES_BALANCED(comp,this) (((comp)-(this)) < LOADSCALE/2)
+#define nodes_balanced(comp,this) (((comp)-(this)) < LOADSCALE/2)

static inline runqueue_t *find_busiest_queue(int this_cpu, int idle, int
*nr_running)
{
runqueue_t *busiest_rq = NULL, *this_rq = cpu_rq(this_cpu), *src_rq;
int busiest_cpu, busiest_node=0, cpu, load, max_avg_load, avg_load;
- int n, steal_delay, system_load = 0, this_node=cpu_to_node(this_cpu);
+ int n, steal_delay, system_load = 0, this_node=__cpu_to_node(this_cpu);
struct node_queue_data nd[MAX_NUMNODES], *node;

if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
@@ -766,8 +776,9 @@ static inline runqueue_t *find_busiest_q

/* compute all node loads and save their max cpu loads */
for (cpu = 0; cpu < NR_CPUS; cpu++) {
- if (!cpu_online(cpu)) continue;
- node = &nd[cpu_to_node(cpu)];
+ if (!cpu_online(cpu))
+ continue;
+ node = &nd[__cpu_to_node(cpu)];
src_rq = cpu_rq(cpu);
if (idle || (src_rq->nr_running < this_rq->prev_nr_running[cpu]))
load = src_rq->nr_running;
@@ -783,7 +794,7 @@ static inline runqueue_t *find_busiest_q

busiest_cpu = nd[this_node].busiest_cpu;
if (busiest_cpu != this_cpu) {
- if (!CPUS_BALANCED(nd[this_node].busiest_cpu_load,*nr_running)) {
+ if (!cpus_balanced(nd[this_node].busiest_cpu_load,*nr_running)) {
busiest_rq = cpu_rq(busiest_cpu);
this_rq->wait_node = -1;
goto out;
@@ -795,7 +806,7 @@ static inline runqueue_t *find_busiest_q
for (n = 0; n < numnodes; n++) {
node = &nd[n];
system_load += node->total_load;
- node->average_load = node->total_load*LOADSCALE/node_ncpus(n);
+ node->average_load = node->total_load*LOADSCALE/node_nr_cpus(n);
if (node->average_load > max_avg_load) {
max_avg_load = node->average_load;
busiest_node = n;
@@ -804,7 +815,7 @@ static inline runqueue_t *find_busiest_q

/* Exit if not enough imbalance on any remote node. */
if ((busiest_node == this_node) || (max_avg_load <= LOADSCALE) ||
- NODES_BALANCED(max_avg_load,nd[this_node].average_load)) {
+ nodes_balanced(max_avg_load,nd[this_node].average_load)) {
this_rq->wait_node = -1;
goto out;
}
@@ -812,7 +823,7 @@ static inline runqueue_t *find_busiest_q
busiest_cpu = nd[busiest_node].busiest_cpu;
avg_load = system_load*LOADSCALE/num_online_cpus();
/* Wait longer before stealing if own node's load is average. */
- if (NODES_BALANCED(avg_load,nd[this_node].average_load))
+ if (nodes_balanced(avg_load,nd[this_node].average_load))
steal_delay = NODE_DELAY_BUSY;
else
steal_delay = NODE_DELAY_IDLE;
@@ -821,12 +832,11 @@ static inline runqueue_t *find_busiest_q
this_rq->wait_node = busiest_node;
this_rq->wait_time = jiffies;
goto out;
- } else
/* old most loaded node: check if waited enough */
- if (jiffies - this_rq->wait_time < steal_delay)
- goto out;
+ } else if (jiffies - this_rq->wait_time < steal_delay)
+ goto out;

- if ((!CPUS_BALANCED(nd[busiest_node].busiest_cpu_load,*nr_running))) {
+ if ((!cpus_balanced(nd[busiest_node].busiest_cpu_load,*nr_running))) {
busiest_rq = cpu_rq(busiest_cpu);
this_rq->wait_node = -1;
}
@@ -950,10 +960,10 @@ out:
* frequency and balancing agressivity depends on whether the CPU is
* idle or not.
*
- * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
* systems with HZ=100, every 10 msecs.)
*/
-#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)

static inline void idle_tick(runqueue_t *rq)

2003-01-06 03:52:29

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: [PATCH 2.5.53] NUMA scheduler (1/3)

On Sat, 2003-01-04 at 21:35, Martin J. Bligh wrote:
> >> Here comes the minimal NUMA scheduler built on top of the O(1) load
> >> balancer rediffed for 2.5.53 with some changes in the core part. As
> >> suggested by Michael, I added the cputimes_stat patch, as it is
> >> absolutely needed for understanding the scheduler behavior.
> >
> > Thanks for this latest patch. I've managed to cobble together
> > a 4 node NUMAQ system (16 700 MHZ PIII procs, 16GB memory) and
> > ran kernbench and schedbench on this, along with the 2.5.50 and
> > 2.5.52 versions. Results remain fairly consistent with what
> > we have been obtaining on earlier versions.
> >
> > Kernbench:
> > Elapsed User System CPU
> > sched50 29.96s 288.308s 83.606s 1240.8%
> > sched52 29.836s 285.832s 84.464s 1240.4%
> > sched53 29.364s 284.808s 83.174s 1252.6%
> > stock50 31.074s 303.664s 89.194s 1264.2%
> > stock53 31.204s 306.224s 87.776s 1263.2%
>
> Not sure what you're correllating here because your rows are all named
> the same thing. However, the new version seems to be much slower
> on systime (about 7-8% for me), which roughly correllates with your
> last two rows above. Me no like.

Sorry, I forgot to include a bit better description of what the
row labels mean.

sched50 = linux 2.5.50 with the NUMA scheduler
sched52 = linux 2.5.52 with the NUMA scheduler
sched53 = linux 2.5.53 with the NUMA scheduler
stock50 = linux 2.5.50 without the NUMA scheduler
stock53 = linux 2.5.53 without the NUMA scheduler

Thus, this shows that the NUMA scheduler drops systime by ~5.5 secs,
or roughly 8%. So, my testing is not showing an increase in systime
like you apparently are seeing.

Michael
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2003-01-06 05:59:14

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 2.5.53] NUMA scheduler (1/3)

>> > Kernbench:
>> > Elapsed User System CPU
>> > sched50 29.96s 288.308s 83.606s 1240.8%
>> > sched52 29.836s 285.832s 84.464s 1240.4%
>> > sched53 29.364s 284.808s 83.174s 1252.6%
>> > stock50 31.074s 303.664s 89.194s 1264.2%
>> > stock53 31.204s 306.224s 87.776s 1263.2%
>>
>> Not sure what you're correllating here because your rows are all named
>> the same thing. However, the new version seems to be much slower
>> on systime (about 7-8% for me), which roughly correllates with your
>> last two rows above. Me no like.
>
> Sorry, I forgot to include a bit better description of what the
> row labels mean.
>
> sched50 = linux 2.5.50 with the NUMA scheduler
> sched52 = linux 2.5.52 with the NUMA scheduler
> sched53 = linux 2.5.53 with the NUMA scheduler
> stock50 = linux 2.5.50 without the NUMA scheduler
> stock53 = linux 2.5.53 without the NUMA scheduler
>
> Thus, this shows that the NUMA scheduler drops systime by ~5.5 secs,
> or roughly 8%. So, my testing is not showing an increase in systime
> like you apparently are seeing.

Sorry, the row names weren't that bad if I actually read them carefully ;-)

I was doing a slightly different test - Erich's old sched code vs the new
both on 2.5.54, and seem to have a degredation.

M.

2003-01-07 02:15:58

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: [PATCH 2.5.53] NUMA scheduler (1/3)

On Sun, 2003-01-05 at 22:07, Martin J. Bligh wrote:
> >> > Kernbench:
> >> > Elapsed User System CPU
> >> > sched50 29.96s 288.308s 83.606s 1240.8%
> >> > sched52 29.836s 285.832s 84.464s 1240.4%
> >> > sched53 29.364s 284.808s 83.174s 1252.6%
> >> > stock50 31.074s 303.664s 89.194s 1264.2%
> >> > stock53 31.204s 306.224s 87.776s 1263.2%
> >
> > sched50 = linux 2.5.50 with the NUMA scheduler
> > sched52 = linux 2.5.52 with the NUMA scheduler
> > sched53 = linux 2.5.53 with the NUMA scheduler
> > stock50 = linux 2.5.50 without the NUMA scheduler
> > stock53 = linux 2.5.53 without the NUMA scheduler
>
> I was doing a slightly different test - Erich's old sched code vs the new
> both on 2.5.54, and seem to have a degredation.
>
> M.

Martin,

I ran 2.5.54 with an older version of Erich's NUMA scheduler and
with the version sent out for 2.5.53. Results were similar:

Kernbench:
Elapsed User System CPU
sched54 29.112s 283.888s 82.84s 1259.4%
oldsched54 29.436s 286.942s 82.722s 1256.2%

sched54 = linux 2.5.54 with the 2.5.53 version of the NUMA scheduler
oldsched54 = linux 2.5.54 with an earlier version of the NUMA scheduler

The numbers for the new version are actually a touch better, but
close enough to be within a reasonable margin of error.

I'll post numbers against stock 2.5.54 and include schedbench, tomorrow.

Michael

--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2003-01-07 11:18:09

by Erich Focht

[permalink] [raw]
Subject: Re: [PATCH 2.5.53] NUMA scheduler (1/3)

Hi Michael and Martin,

thanks a lot for the testing!

I rechecked the changes and really don't see any reason for a
slowdown. Michael's measurements seem to confirm that this is just a
statistical effect. I suggest: when in doubt, do 10 kernel compiles
instead of 5. A simple statistical error estimation as I did for
schedbench might help, too. Guess I've sent you the script a while
ago.

I understand from your emails that the 2.5.53 patches apply and work
for 2.5.54, therefore I'll wait for 2.5.55 with a rediff.

Regards,
Erich


On Tuesday 07 January 2003 03:23, Michael Hohnbaum wrote:
> On Sun, 2003-01-05 at 22:07, Martin J. Bligh wrote:
> > >> > Kernbench:
> > >> > Elapsed User System CPU
> > >> > sched50 29.96s 288.308s 83.606s 1240.8%
> > >> > sched52 29.836s 285.832s 84.464s 1240.4%
> > >> > sched53 29.364s 284.808s 83.174s 1252.6%
> > >> > stock50 31.074s 303.664s 89.194s 1264.2%
> > >> > stock53 31.204s 306.224s 87.776s 1263.2%
> > >
> > > sched50 = linux 2.5.50 with the NUMA scheduler
> > > sched52 = linux 2.5.52 with the NUMA scheduler
> > > sched53 = linux 2.5.53 with the NUMA scheduler
> > > stock50 = linux 2.5.50 without the NUMA scheduler
> > > stock53 = linux 2.5.53 without the NUMA scheduler
> >
> > I was doing a slightly different test - Erich's old sched code vs the new
> > both on 2.5.54, and seem to have a degredation.
> >
> > M.
>
> Martin,
>
> I ran 2.5.54 with an older version of Erich's NUMA scheduler and
> with the version sent out for 2.5.53. Results were similar:
>
> Kernbench:
> Elapsed User System CPU
> sched54 29.112s 283.888s 82.84s 1259.4%
> oldsched54 29.436s 286.942s 82.722s 1256.2%
>
> sched54 = linux 2.5.54 with the 2.5.53 version of the NUMA scheduler
> oldsched54 = linux 2.5.54 with an earlier version of the NUMA scheduler
>
> The numbers for the new version are actually a touch better, but
> close enough to be within a reasonable margin of error.
>
> I'll post numbers against stock 2.5.54 and include schedbench, tomorrow.
>
> Michael

2003-01-07 23:25:42

by Michael Hohnbaum

[permalink] [raw]
Subject: Re: [PATCH 2.5.53] NUMA scheduler (1/3)

On Tue, 2003-01-07 at 03:27, Erich Focht wrote:
> Hi Michael and Martin,
>
> thanks a lot for the testing!
>
> I rechecked the changes and really don't see any reason for a
> slowdown. Michael's measurements seem to confirm that this is just a
> statistical effect. I suggest: when in doubt, do 10 kernel compiles
> instead of 5. A simple statistical error estimation as I did for
> schedbench might help, too. Guess I've sent you the script a while
> ago.
>
One more set of test results, this time including schedbench. Previous
runs did not have the cputimes_stat patch, so the schedbench numbers
were not particularly useful.

Kernbench:
Elapsed User System CPU
oldsched54 29.75s 287.02s 82.876s 1242.8%
sched54 29.112s 283.888s 82.84s 1259.4%
stock54 31.348s 303.134s 87.824s 1247.2%
sched50 29.96s 288.308s 83.606s 1240.8%
stock50 31.074s 303.664s 89.194s 1264.2%

Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
oldsched54 20.44 37.41 81.81 0.93
sched54 22.03 34.90 88.15 0.75
stock54 49.35 57.53 197.45 0.86
sched50 22.00 35.50 88.04 0.86
stock50 27.18 45.99 108.75 0.87

Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
oldsched54 31.98 51.59 255.88 1.93
sched54 27.95 37.12 223.66 1.50
stock54 43.14 62.97 345.18 2.12
sched50 32.05 46.81 256.45 1.81
stock50 44.75 63.68 358.09 2.55

Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
oldsched54 58.39 80.29 934.38 4.30
sched54 55.37 69.58 886.10 3.79
stock54 66.00 81.25 1056.25 7.12
sched50 55.34 70.74 885.61 4.09
stock50 65.36 80.72 1045.94 5.92

Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
oldsched54 54.53 119.71 1745.37 11.70
sched54 57.93 132.11 1854.01 10.74
stock54 77.81 173.26 2490.31 12.37
sched50 59.47 139.77 1903.37 10.24
stock50 84.51 194.89 2704.83 12.44

Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
oldsched54 79.57 339.88 5093.31 20.74
sched54 72.91 308.87 4667.03 21.06
stock54 86.68 368.55 5548.57 25.73
sched50 63.55 276.81 4067.65 21.58
stock50 99.68 422.06 6380.04 25.92

Row names are the same as before (see below). Numbers seem to be
fairly consistent.

> I understand from your emails that the 2.5.53 patches apply and work
> for 2.5.54, therefore I'll wait for 2.5.55 with a rediff.

It is fine with me to wait for 2.5.55 to rediff. I'm getting patch
errors now with the cputime_stat patch, so a rediff of that with
2.5.55 would be handy (although I suppose I should quit being lazy
and just do that myself).
>
> Regards,
> Erich

Michael
>
>
> On Tuesday 07 January 2003 03:23, Michael Hohnbaum wrote:
> > On Sun, 2003-01-05 at 22:07, Martin J. Bligh wrote:
> > > >> > Kernbench:
> > > >> > Elapsed User System CPU
> > > >> > sched50 29.96s 288.308s 83.606s 1240.8%
> > > >> > sched52 29.836s 285.832s 84.464s 1240.4%
> > > >> > sched53 29.364s 284.808s 83.174s 1252.6%
> > > >> > stock50 31.074s 303.664s 89.194s 1264.2%
> > > >> > stock53 31.204s 306.224s 87.776s 1263.2%
> > > >
> > > > sched50 = linux 2.5.50 with the NUMA scheduler
> > > > sched52 = linux 2.5.52 with the NUMA scheduler
> > > > sched53 = linux 2.5.53 with the NUMA scheduler
> > > > stock50 = linux 2.5.50 without the NUMA scheduler
> > > > stock53 = linux 2.5.53 without the NUMA scheduler
> > >
> > > I was doing a slightly different test - Erich's old sched code vs the new
> > > both on 2.5.54, and seem to have a degredation.
> > >
> > > M.
> >
> > Martin,
> >
> > I ran 2.5.54 with an older version of Erich's NUMA scheduler and
> > with the version sent out for 2.5.53. Results were similar:
> >
> > Kernbench:
> > Elapsed User System CPU
> > sched54 29.112s 283.888s 82.84s 1259.4%
> > oldsched54 29.436s 286.942s 82.722s 1256.2%
> >
> > sched54 = linux 2.5.54 with the 2.5.53 version of the NUMA scheduler
> > oldsched54 = linux 2.5.54 with an earlier version of the NUMA scheduler
> >
> > The numbers for the new version are actually a touch better, but
> > close enough to be within a reasonable margin of error.
> >
> > I'll post numbers against stock 2.5.54 and include schedbench, tomorrow.
> >
> > Michael
>
>
--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486