LinuxLists.cc - [PATCH] NUMA scheduler 1/2

2002-10-25 17:31:58

Subject: [PATCH] NUMA scheduler 1/2

Here come the rediffed (for 2.5.44) patches for my version of the
NUMA scheduler extensions. I'm only sending the first two parts of
the complete set of 5 patches (which make the node affine NUMA scheduler
with dynamic homenode selection). The two patches lead to a pooling
NUMA scheduler with initial load balancing at exec().

The balancing strategy so far is:
- try to balance inside the own node
- when balancing across nodes, try to avoid big differences in node loads.
- when doing an exec(), move the task to the least loaded node.

On a 16 CPU NEC Azusa these two patches lead to roughly a factor of 4
improvement in the "hackbench" test of Rusty (which he now calls
schedbench).

Patch descriptions are as in my previous post:
01-numa_sched_core-2.5.44-10a.patch :
Provides basic NUMA functionality. It implements CPU pools
with all the mess needed to initialize them. Also it has a
node aware find_busiest_queue() which first scans the own
node for more loaded CPUs. If no steal candidate is found on
the own node, it finds the most loaded node and tries to steal
a task from it. By steal delays for remote node steals it
tries to achieve equal node load. These delays can be extended
to cope with multi-level node hierarchies (that patch is not
included).
02-numa_sched_ilb-2.5.44-10.patch :
This patch provides simple initial load balancing during exec().
It is node aware and will select the least loaded node. Also it
does a round-robin initial node selection to distribute the load
better across the nodes.

The patches should run on ia32 NUMAQ and ia64 Azusa & TX7. Other
architectures just need the build_node() call similar to
arch/i386/kernel/smpboot.c. Be careful to REALLY initialize
cache_decay_ticks (that shouldn't be zero on an SMP machine, anyway).

The first patch provides important infrastructure for any following
NUMA scheduler patches. It introduces CPU pools and a way to loop over
single CPU pools. The pool data initialization is somewhat messy and
sensitive. I'm trying to rewrite it to use RCU, anyway the problem is
that we have to initialize the pool data to something reasonable before
we know how many CPUs will be up and before the cpu_to_node() macro
delivers reasonable numbers. Later on the pool data must be initialized
(and could be changed by CPU hotplug) in a way that goes unnoticed by
the load balancer...

Regards,
Erich

Attachments:

01-numa_sched_core-2.5.44-10a.patch (18.39 kB)

2002-10-25 17:33:17

by Erich Focht

[permalink] [raw]

Subject: [PATCH] NUMA scheduler 2/2

The second part of the patch:

> 02-numa_sched_ilb-2.5.44-10.patch :
> This patch provides simple initial load balancing during exec().
> It is node aware and will select the least loaded node. Also it
> does a round-robin initial node selection to distribute the load
> better across the nodes.

Regards,
Erich

Attachments:

02-numa_sched_ilb-2.5.44-10.patch (3.13 kB)

2002-10-26 00:24:45

by Michael Hohnbaum

[permalink] [raw]

Subject: Re: [PATCH] NUMA scheduler 1/2

On Fri, 2002-10-25 at 10:37, Erich Focht wrote:
> Here come the rediffed (for 2.5.44) patches for my version of the
> NUMA scheduler extensions. I'm only sending the first two parts of
> the complete set of 5 patches (which make the node affine NUMA scheduler
> with dynamic homenode selection). The two patches lead to a pooling
> NUMA scheduler with initial load balancing at exec().
>

These patches produced a kernel that built and booted first try for
me. Thanks. I ran kernbench and your numa_test (schedbench) on
this numa scheduler (erich44), my simple numa scheduler (hbaum44), and
a stock kernel (stock44).

Kernbench:
Elapsed User System CPU
stock44 21.08s 196.80s 58.14s 1208.8%
hbaum44 20.49s 192.57s 50.32s 1184.8%
erich44 21.01s 193.47s 56.71s 1191.0%

Schedbench 4:
Elapsed TotalUser TotalSys AvgUser
stock44 39.47 49.99 157.94 0.96
hbaum44 38.43 48.76 153.77 1.12
erich44 24.28 36.10 97.15 0.79

Schedbench 8:
Elapsed TotalUser TotalSys AvgUser
stock44 49.46 71.07 395.77 1.92
hbaum44 37.52 57.99 300.25 2.17
erich44 30.67 47.93 245.48 2.59

Schedbench 16:
Elapsed TotalUser TotalSys AvgUser
stock44 64.17 81.48 1026.94 6.41
hbaum44 52.23 73.23 835.81 5.18
erich44 52.25 61.20 836.12 4.69

Schedbench 32:
Elapsed TotalUser TotalSys AvgUser
stock44 72.45 165.86 2318.84 12.78
hbaum44 56.74 137.58 1816.17 8.81
erich44 55.98 121.19 1791.58 9.35

Schedbench 64:
Elapsed TotalUser TotalSys AvgUser
stock44 110.31 461.29 7060.60 26.02
hbaum44 58.30 255.90 3732.08 20.10
erich44 56.94 237.09 3644.95 21.26

The results seem fairly consistent with what we have been seeing all
along. Erich's scheduler tends to be about the same as stock on
kernbench, while mine is roughly 5% better.

On schedbench Erich's does better on small loads, but as
the load increases to one task per cpu it becomes a dead heat between
the two.

It is probably worth noting that my scheduler change is a bit smaller
with 146 insertions, 27 deletions across 3 files, versus 432 insertions,
127 deletions across 4 files. But that should be expected, as my goal
was to keep the changes as small as possible, while still providing
measureable performance gains.
--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486