LinuxLists.cc - Re: [Lse-tech] Re: NUMA scheduler 2nd approach

[permalink] [raw]

Subject: Re: [Lse-tech] Re: NUMA scheduler 2nd approach

>> I played with this today on my 4 node (16 CPU) NUMAQ. Spent most
>> of the time working with the first three patches. What I found was
>> that rebalancing was happening too much between nodes. I tried a
>> few things to change this, but have not yet settled on the best
>> approach. A key item to work with is the check in find_busiest_node
>> to determine if the found node is busier enough to warrant stealing
>> from it. Currently the check is that the node has 125% of the load
>> of the current node. I think that, for my system at least, we need
>> to add in a constant to this equation. I tried using 4 and that
>> helped a little.
>
> Michael,
>
> in:
>
> +static int find_busiest_node(int this_node)
> +{
> + int i, node = this_node, load, this_load, maxload;
> +
> + this_load = maxload = atomic_read(&node_nr_running[this_node]);
> + for (i = 0; i < numnodes; i++) {
> + if (i == this_node)
> + continue;
> + load = atomic_read(&node_nr_running[i]);
> + if (load > maxload && (4*load > ((5*4*this_load)/4))) {
> + maxload = load;
> + node = i;
> + }
> + }
> + return node;
> +}
>
> You changed ((5*4*this_load)/4) to:
> (5*4*(this_load+4)/4)
> or
> (4+(5*4*(this_load)/4)) ?
>
> We def need some constant to avoid low load ping pong, right?
>
> Finally I added in the 04 patch, and that helped
>> a lot. Still, there is too much process movement between nodes.
>
> perhaps increase INTERNODE_LB?

Before we tweak this too much, how about using the global load
average for this? I can envisage a situation where we have two
nodes with 8 tasks per node, one with 12 tasks, and one with four.
You really don't want the ones with 8 tasks pulling stuff from
the 12 ... only for the least loaded node to start pulling stuff
later.

What about if we take the global load average, and multiply by
num_cpus_on_this_node / num_cpus_globally ... that'll give us
roughly what we should have on this node. If we're significantly
out underloaded compared to that, we start pulling stuff from
the busiest node? And you get the damping over time for free.

I think it'd be best if we stay fairly conservative for this, all
we're trying to catch is the corner case where a bunch of stuff
has forked, but not execed, and we have a long term imbalance.
Agressive rebalance might win a few benchmarks, but you'll just
cause inter-node task bounce on others.

M.

2003-01-14 05:43:22

[permalink] [raw]

Subject: Re: [Lse-tech] Re: NUMA scheduler 2nd approach

On Mon, 2003-01-13 at 20:45, Andrew Theurer wrote:
> > Erich,
> >
> > I played with this today on my 4 node (16 CPU) NUMAQ. Spent most
> > of the time working with the first three patches. What I found was
> > that rebalancing was happening too much between nodes. I tried a
> > few things to change this, but have not yet settled on the best
> > approach. A key item to work with is the check in find_busiest_node
> > to determine if the found node is busier enough to warrant stealing
> > from it. Currently the check is that the node has 125% of the load
> > of the current node. I think that, for my system at least, we need
> > to add in a constant to this equation. I tried using 4 and that
> > helped a little.
>
> Michael,
>
> in:
>
> +static int find_busiest_node(int this_node)
> +{
> + int i, node = this_node, load, this_load, maxload;
> +
> + this_load = maxload = atomic_read(&node_nr_running[this_node]);
> + for (i = 0; i < numnodes; i++) {
> + if (i == this_node)
> + continue;
> + load = atomic_read(&node_nr_running[i]);
> + if (load > maxload && (4*load > ((5*4*this_load)/4))) {
> + maxload = load;
> + node = i;
> + }
> + }
> + return node;
> +}
>
> You changed ((5*4*this_load)/4) to:
> (5*4*(this_load+4)/4)
> or
> (4+(5*4*(this_load)/4)) ?

I suppose I should not have been so dang lazy and cut-n-pasted
the line I changed. The change was (((5*4*this_load)/4) + 4)
which should be the same as your second choice.
>
> We def need some constant to avoid low load ping pong, right?

Yep. Without the constant, one could have 6 processes on node
A and 4 on node B, and node B would end up stealing. While making
a perfect balance, the expense of the off-node traffic does not
justify it. At least on the NUMAQ box. It might be justified
for a different NUMA architecture, which is why I propose putting
this check in a macro that can be defined in topology.h for each
architecture.
>
> Finally I added in the 04 patch, and that helped
> > a lot. Still, there is too much process movement between nodes.
>
> perhaps increase INTERNODE_LB?

That is on the list to try. Martin was mumbling something about
use the system wide load average to help make the inter-node
balance decision. I'd like to give that a try before tweaking
ITERNODE_LB.
>
> -Andrew Theurer
>
Michael Hohnbaum
[email protected]

2003-01-14 11:04:52

[permalink] [raw]

Subject: Re: [Lse-tech] Re: NUMA scheduler 2nd approach

Hi Martin,

> Before we tweak this too much, how about using the global load
> average for this? I can envisage a situation where we have two
> nodes with 8 tasks per node, one with 12 tasks, and one with four.
> You really don't want the ones with 8 tasks pulling stuff from
> the 12 ... only for the least loaded node to start pulling stuff
> later.

Hmmm, yet another idea from the old NUMA scheduler coming back,
therefore it has my full support ;-). Though we can't do it as I did
it there: in the old NUMA approach every cross-node steal was delayed,
only 1-2ms if the own node was underloaded, a lot more if the own
node's load was average or above average. We might need to finally
steal something even if we're having "above average" load, because the
cpu mask of the tasks on the overloaded node might only allow them to
migrate to us... But this is also a special case which we should solve
later.

> What about if we take the global load average, and multiply by
> num_cpus_on_this_node / num_cpus_globally ... that'll give us
> roughly what we should have on this node. If we're significantly
> out underloaded compared to that, we start pulling stuff from
> the busiest node? And you get the damping over time for free.

Patch 05 is going towards this direction but the constants I chose
were very aggressive. I'll update the whole set of patches and try to
send out something today.

Best regards,
Erich

2003-01-14 15:46:10

[permalink] [raw]

Subject: [PATCH 2.5.58] new NUMA scheduler

Here's the new version of the NUMA scheduler built on top of the
miniature scheduler of Martin. I incorporated Michael's ideas and
Christoph's suggestions and rediffed for 2.5.58.

The whole patch is really tiny: 9.5k. This time I attached the numa
scheduler in form of two patches:

numa-sched-2.5.58.patch (7k) : components 01, 02, 03
numa-sched-add-2.5.58.patch (3k) : components 04, 05

The single components are also attached in a small tgz archive:

01-minisched-2.5.58.patch : the miniature scheduler from
Martin. Balances strictly within a node. Removed the
find_busiest_in_mask() function.

02-initial-lb-2.5.58.patch : Michael's initial load balancer at
exec(). Cosmetic corrections sugegsted by Christoph.

03-internode-lb-2.5.58.patch : internode load balancer core. Called
after NODE_BALANCE_RATE calls of the inter-node load balancer. Tunable
parameters:
NODE_BALANCE_RATE (default: 10)
NODE_THRESHOLD (default: 125) : consider only nodes with load
above NODE_THRESHOLD/100 * own_node_load
I added the constant factor of 4 suggested by Michael, but I'm not
really happy with it. This should be nr_cpus_in_node, but we don't
have that info in topology.h

04-smooth-node-load-2.5.58.patch : The node load measure is smoothed
by adding half of the previous node load (and 1/4 of the one before,
etc..., as discussed in the LSE call). This should improve a bit the
behavior in case of short timed load peaks and avoid bouncing tasks
between nodes.

05-var-intnode-lb-2.5.58.patch : Replaces the fixed NODE_BALANCE_RATE
interval (between cross-node balancer calls) by a variable
node-specific interval. Currently only two values used:
NODE_BALANCE_MIN : 10
NODE_BALANCE_MAX : 40
If the node load is less than avg_node_load/2, we switch to
NODE_BALANCE_MIN, otherwise we use the large interval.
I also added a function to reduce the number of tasks stolen from
remote nodes.

Regards,
Erich

Attachments:

numa-sched-2.5.58.patch (7.59 kB)
numa-sched-add-2.5.58.patch (3.03 kB)
numa-sched-patches.tgz (3.78 kB)
Download all attachments

2003-01-14 15:59:06

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [Lse-tech] [PATCH 2.5.58] new NUMA scheduler

On Tue, Jan 14, 2003 at 04:55:06PM +0100, Erich Focht wrote:
> Here's the new version of the NUMA scheduler built on top of the
> miniature scheduler of Martin. I incorporated Michael's ideas and
> Christoph's suggestions and rediffed for 2.5.58.

This one looks a lot nicer. You might also want to take the different
nr_running stuff from the patch I posted, I think it looks a lot nicer.

The patch (not updated yet) is below again for reference.

--- 1.62/fs/exec.c Fri Jan 10 08:21:00 2003
+++ edited/fs/exec.c Mon Jan 13 15:33:32 2003
@@ -1031,6 +1031,8 @@
int retval;
int i;

+ sched_balance_exec();
+
file = open_exec(filename);

retval = PTR_ERR(file);
--- 1.119/include/linux/sched.h Sat Jan 11 07:44:15 2003
+++ edited/include/linux/sched.h Mon Jan 13 15:58:11 2003
@@ -444,6 +444,14 @@
# define set_cpus_allowed(p, new_mask) do { } while (0)
#endif

+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#else
+# define sched_balance_exec() do { } while (0)
+# define node_nr_running_init() do { } while (0)
+#endif
+
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
--- 1.91/init/main.c Mon Jan 6 04:08:49 2003
+++ edited/init/main.c Mon Jan 13 15:33:33 2003
@@ -495,6 +495,7 @@

migration_init();
#endif
+ node_nr_running_init();
spawn_ksoftirqd();
}

--- 1.148/kernel/sched.c Sat Jan 11 07:44:22 2003
+++ edited/kernel/sched.c Mon Jan 13 16:17:34 2003
@@ -67,6 +67,7 @@
#define INTERACTIVE_DELTA 2
#define MAX_SLEEP_AVG (2*HZ)
#define STARVATION_LIMIT (2*HZ)
+#define NODE_BALANCE_RATIO 10

/*
* If a task is 'interactive' then we reinsert it in the active
@@ -154,6 +155,11 @@
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];

+#ifdef CONFIG_NUMA
+ atomic_t *node_nr_running;
+ int nr_balanced;
+#endif
+
task_t *migration_thread;
struct list_head migration_queue;

@@ -178,6 +184,38 @@
#endif

/*
+ * Keep track of running tasks.
+ */
+#if CONFIG_NUMA
+
+/* XXX(hch): this should go into a struct sched_node_data */
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+ {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+static inline void nr_running_init(struct runqueue *rq)
+{
+ rq->node_nr_running = &node_nr_running[0];
+}
+
+static inline void nr_running_inc(struct runqueue *rq)
+{
+ atomic_inc(rq->node_nr_running);
+ rq->nr_running++;
+}
+
+static inline void nr_running_dec(struct runqueue *rq)
+{
+ atomic_dec(rq->node_nr_running);
+ rq->nr_running--;
+}
+
+#else
+# define nr_running_init(rq) do { } while (0)
+# define nr_running_inc(rq) do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq) do { (rq)->nr_running--; } while (0)
+#endif /* CONFIG_NUMA */
+
+/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
* explicitly disabling preemption.
@@ -294,7 +332,7 @@
p->prio = effective_prio(p);
}
enqueue_task(p, array);
- rq->nr_running++;
+ nr_running_inc(rq);
}

/*
@@ -302,7 +340,7 @@
*/
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
- rq->nr_running--;
+ nr_running_dec(rq);
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
dequeue_task(p, p->array);
@@ -624,9 +662,108 @@
spin_unlock(&rq2->lock);
}

-#if CONFIG_SMP
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu. Then
+ * the cpu_allowed mask is restored.
+ *
+ * Note: This isn't actually numa-specific, but just not used otherwise.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+ unsigned long old_mask = p->cpus_allowed;
+
+ if (old_mask & (1UL << dest_cpu)) {
+ unsigned long flags;
+ struct runqueue *rq;
+
+ /* force the process onto the specified CPU */
+ set_cpus_allowed(p, 1UL << dest_cpu);
+
+ /* restore the cpus allowed mask */
+ rq = task_rq_lock(p, &flags);
+ p->cpus_allowed = old_mask;
+ task_rq_unlock(rq, &flags);
+ }
+}

/*
+ * Find the least loaded CPU. Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+ int i, minload, load, best_cpu, node = 0;
+ unsigned long cpumask;
+
+ best_cpu = task_cpu(p);
+ if (cpu_rq(best_cpu)->nr_running <= 2)
+ return best_cpu;
+
+ minload = 10000000;
+ for (i = 0; i < numnodes; i++) {
+ load = atomic_read(&node_nr_running[i]);
+ if (load < minload) {
+ minload = load;
+ node = i;
+ }
+ }
+
+ minload = 10000000;
+ cpumask = __node_to_cpu_mask(node);
+ for (i = 0; i < NR_CPUS; ++i) {
+ if (!(cpumask & (1UL << i)))
+ continue;
+ if (cpu_rq(i)->nr_running < minload) {
+ best_cpu = i;
+ minload = cpu_rq(i)->nr_running;
+ }
+ }
+ return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+ int new_cpu;
+
+ if (numnodes > 1) {
+ new_cpu = sched_best_cpu(current);
+ if (new_cpu != smp_processor_id())
+ sched_migrate_task(current, new_cpu);
+ }
+}
+
+static int find_busiest_node(int this_node)
+{
+ int i, node = this_node, load, this_load, maxload;
+
+ this_load = maxload = atomic_read(&node_nr_running[this_node]);
+ for (i = 0; i < numnodes; i++) {
+ if (i == this_node)
+ continue;
+ load = atomic_read(&node_nr_running[i]);
+ if (load > maxload && (4*load > ((5*4*this_load)/4))) {
+ maxload = load;
+ node = i;
+ }
+ }
+
+ return node;
+}
+
+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++)
+ cpu_rq(i)->node_nr_running = node_nr_running + __cpu_to_node(i);
+}
+#endif /* CONFIG_NUMA */
+
+#if CONFIG_SMP
+/*
* double_lock_balance - lock the busiest runqueue
*
* this_rq is locked already. Recalculate nr_running if we have to
@@ -652,9 +789,10 @@
}

/*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_queue - find the busiest runqueue among the cpus in cpumask
*/
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu,
+ int idle, int *imbalance, unsigned long cpumask)
{
int nr_running, load, max_load, i;
runqueue_t *busiest, *rq_src;
@@ -689,7 +827,7 @@
busiest = NULL;
max_load = 1;
for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
+ if (!cpu_online(i) || !((1UL << i) & cpumask))
continue;

rq_src = cpu_rq(i);
@@ -736,9 +874,9 @@
static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
{
dequeue_task(p, src_array);
- src_rq->nr_running--;
+ nr_running_dec(src_rq);
set_task_cpu(p, this_cpu);
- this_rq->nr_running++;
+ nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
@@ -758,13 +896,27 @@
*/
static void load_balance(runqueue_t *this_rq, int idle)
{
- int imbalance, idx, this_cpu = smp_processor_id();
+ int imbalance, idx, this_cpu, this_node;
+ unsigned long cpumask;
runqueue_t *busiest;
prio_array_t *array;
struct list_head *head, *curr;
task_t *tmp;

- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+ this_cpu = smp_processor_id();
+ this_node = __cpu_to_node(this_cpu);
+ cpumask = __node_to_cpu_mask(this_node);
+
+#if CONFIG_NUMA
+ /*
+ * Avoid rebalancing between nodes too often.
+ */
+ if (!(++this_rq->nr_balanced % NODE_BALANCE_RATIO))
+ cpumask |= __node_to_cpu_mask(find_busiest_node(this_node));
+#endif
+
+ busiest = find_busiest_queue(this_rq, this_cpu, idle,
+ &imbalance, cpumask);
if (!busiest)
goto out;

@@ -2231,6 +2383,7 @@
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
atomic_set(&rq->nr_iowait, 0);
+ nr_running_init(rq);

for (j = 0; j < 2; j++) {
array = rq->arrays + j;

2003-01-14 16:14:27

[permalink] [raw]

Subject: [PATCH 2.5.58] new NUMA scheduler: fix

In the previous email the patch 02-initial-lb-2.5.58.patch had a bug
and this was present in the numa-sched-2.5.58.patch and
numa-sched-add-2.5.58.patch, too. Please use the patches attached to
this email! Sorry for the silly mistake...

Christoph, I used your way of coding nr_running_inc/dec now.

Regards,
Erich

On Tuesday 14 January 2003 16:55, Erich Focht wrote:
> Here's the new version of the NUMA scheduler built on top of the
> miniature scheduler of Martin. I incorporated Michael's ideas and
> Christoph's suggestions and rediffed for 2.5.58.
>
> The whole patch is really tiny: 9.5k. This time I attached the numa
> scheduler in form of two patches:
>
> numa-sched-2.5.58.patch (7k) : components 01, 02, 03
> numa-sched-add-2.5.58.patch (3k) : components 04, 05
>
> The single components are also attached in a small tgz archive:
>
> 01-minisched-2.5.58.patch : the miniature scheduler from
> Martin. Balances strictly within a node. Removed the
> find_busiest_in_mask() function.
>
> 02-initial-lb-2.5.58.patch : Michael's initial load balancer at
> exec(). Cosmetic corrections sugegsted by Christoph.
>
> 03-internode-lb-2.5.58.patch : internode load balancer core. Called
> after NODE_BALANCE_RATE calls of the inter-node load balancer. Tunable
> parameters:
> NODE_BALANCE_RATE (default: 10)
> NODE_THRESHOLD (default: 125) : consider only nodes with load
> above NODE_THRESHOLD/100 * own_node_load
> I added the constant factor of 4 suggested by Michael, but I'm not
> really happy with it. This should be nr_cpus_in_node, but we don't
> have that info in topology.h
>
> 04-smooth-node-load-2.5.58.patch : The node load measure is smoothed
> by adding half of the previous node load (and 1/4 of the one before,
> etc..., as discussed in the LSE call). This should improve a bit the
> behavior in case of short timed load peaks and avoid bouncing tasks
> between nodes.
>
> 05-var-intnode-lb-2.5.58.patch : Replaces the fixed NODE_BALANCE_RATE
> interval (between cross-node balancer calls) by a variable
> node-specific interval. Currently only two values used:
> NODE_BALANCE_MIN : 10
> NODE_BALANCE_MAX : 40
> If the node load is less than avg_node_load/2, we switch to
> NODE_BALANCE_MIN, otherwise we use the large interval.
> I also added a function to reduce the number of tasks stolen from
> remote nodes.
>
> Regards,
> Erich

Attachments:

numa-sched-2.5.58.patch (7.97 kB)
numa-sched-add-2.5.58.patch (3.04 kB)
numa-sched-patches.tgz (3.93 kB)
Download all attachments

2003-01-14 16:34:13

[permalink] [raw]

Subject: [PATCH 2.5.58] new NUMA scheduler: fix

Aargh, I should have gone home earlier...
For those who really care about patch 05, it's attached. It's all
untested as I don't have a ia32 NUMA machine running 2.5.58...

Erich

On Tuesday 14 January 2003 17:23, Erich Focht wrote:
> In the previous email the patch 02-initial-lb-2.5.58.patch had a bug
> and this was present in the numa-sched-2.5.58.patch and
> numa-sched-add-2.5.58.patch, too. Please use the patches attached to
> this email! Sorry for the silly mistake...
>
> Christoph, I used your way of coding nr_running_inc/dec now.
>
> Regards,
> Erich

Attachments:

05-var-intnode-lb-2.5.58.patch (2.88 kB)

2003-01-14 16:42:28

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#else
+#define sched_balance_exec() {}
+#endif

You accidentally (?) removed the stub for node_nr_running_init.
Also sched.h used # define inside ifdefs.

+#ifdef CONFIG_NUMA
+ atomic_t * node_ptr;

The name is still a bit non-descriptive and the * placed wrong :)
What about atomic_t *nr_running_at_node?

+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+ {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+

I think my two comments here were pretty usefull :)

+static inline void nr_running_dec(runqueue_t *rq)
+{
+ atomic_dec(rq->node_ptr);
+ rq->nr_running--;
+}
+
+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++)
+ cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+}
+#else
+# define nr_running_init(rq) do { } while (0)
+# define nr_running_inc(rq) do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq) do { (rq)->nr_running--; } while (0)
+#endif /* CONFIG_NUMA */
+
/*
@@ -689,7 +811,7 @@ static inline runqueue_t *find_busiest_q
busiest = NULL;
max_load = 1;
for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
+ if (!cpu_online(i) || !((1UL << i) & cpumask) )

spurious whitespace before the closing brace.

prio_array_t *array;
struct list_head *head, *curr;
task_t *tmp;
+ int this_node = __cpu_to_node(this_cpu);
+ unsigned long cpumask = __node_to_cpu_mask(this_node);

If that's not to much style nitpicking: put this_node on one line with all the
other local ints and initialize all three vars after the declarations (like
in my patch *duck*)

+#if CONFIG_NUMA
+ rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */

I had a nr_running_init() abstraction for this, but you only took it
partially. It would be nice to merge the last bit to get rid of this ifdef.

Else the patch looks really, really good and I'm looking forward to see
it in mainline real soon!

2003-01-14 18:51:54

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

On Tue, 2003-01-14 at 08:43, Erich Focht wrote:
> Aargh, I should have gone home earlier...
> For those who really care about patch 05, it's attached. It's all
> untested as I don't have a ia32 NUMA machine running 2.5.58...

One more minor problem - the first two patches are missing the
following defines, and result in compile issues:

#define MAX_INTERNODE_LB 40
#define MIN_INTERNODE_LB 4
#define NODE_BALANCE_RATIO 10

Looking through previous patches, and the 05 patch, I found
these defines and put them under the #if CONFIG_NUMA in sched.c
that defines node_nr_running and friends.

With these three lines added, I have a kernel built and booted
using the first numa-sched and numa-sched-add patches.

Test results will follow later in the day.

Michael

>
> Erich
>
>
> On Tuesday 14 January 2003 17:23, Erich Focht wrote:
> > In the previous email the patch 02-initial-lb-2.5.58.patch had a bug
> > and this was present in the numa-sched-2.5.58.patch and
> > numa-sched-add-2.5.58.patch, too. Please use the patches attached to
> > this email! Sorry for the silly mistake...
> >
> > Christoph, I used your way of coding nr_running_inc/dec now.
> >
> > Regards,
> > Erich

--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2003-01-14 21:45:58

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH 2.5.58] new NUMA scheduler: fix

On Tue, 2003-01-14 at 11:02, Michael Hohnbaum wrote:
> On Tue, 2003-01-14 at 08:43, Erich Focht wrote:
> > Aargh, I should have gone home earlier...
> > For those who really care about patch 05, it's attached. It's all
> > untested as I don't have a ia32 NUMA machine running 2.5.58...
>
> One more minor problem - the first two patches are missing the
> following defines, and result in compile issues:
>
> #define MAX_INTERNODE_LB 40
> #define MIN_INTERNODE_LB 4
> #define NODE_BALANCE_RATIO 10
>
> Looking through previous patches, and the 05 patch, I found
> these defines and put them under the #if CONFIG_NUMA in sched.c
> that defines node_nr_running and friends.
>
> With these three lines added, I have a kernel built and booted
> using the first numa-sched and numa-sched-add patches.
>
> Test results will follow later in the day.

Trying to apply the 05 patch, I discovered that it was already
in there. Something is messed up with the combined patches, so
I went back to the tgz file you provided and started over. I'm
not sure what the kernel is that I built and tested earlier today,
but I suspect it was, for the most part, the complete patchset
(i.e., patches 1-5). Building a kernel with patches 1-4 from
the tgz file does not need the additional defines mentioned in
my previous email.

Testing is starting from scratch with a known patch base. The
plan is to test with patches 1-4, then add in 05. I should have
some numbers for you before the end of my day. btw, the numbers
looked real good for the runs on whatever kernel it was that I
built this morning.
>
> Michael
>
> > Erich
--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2003-01-14 23:55:26

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

On Tuesday 14 January 2003 16:55, Erich Focht wrote:
> Here's the new version of the NUMA scheduler built on top of the
> miniature scheduler of Martin. I incorporated Michael's ideas and
> Christoph's suggestions and rediffed for 2.5.58.
>
> The whole patch is really tiny: 9.5k. This time I attached the numa
> scheduler in form of two patches:

Ran tests on three different kernels:

stock58 - linux 2.5.58 with the cputime_stats patch
sched1-4-58 - stock58 with the first 4 NUMA scheduler patches
sched1-5-58 - stock58 with all 5 NUMA scheduler patches

Kernbench:
Elapsed User System CPU
stock58 31.656s 305.85s 89.232s 1248.2%
sched1-4-58 29.886s 287.506s 82.94s 1239%
sched1-5-58 29.994s 288.796s 84.706s 1245%

Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
stock58 27.73 42.80 110.96 0.85
sched1-4-58 32.86 46.41 131.47 0.85
sched1-5-58 32.37 45.34 129.52 0.89

Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
stock58 45.97 61.87 367.81 2.11
sched1-4-58 31.39 49.18 251.22 2.15
sched1-5-58 37.52 61.32 300.22 2.06

Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
stock58 60.91 83.63 974.71 6.18
sched1-4-58 54.31 62.11 869.11 3.84
sched1-5-58 51.60 59.05 825.72 4.74

Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
stock58 84.26 195.16 2696.65 16.53
sched1-4-58 61.49 140.51 1968.06 9.57
sched1-5-58 55.23 117.32 1767.71 7.78

Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
stock58 123.27 511.77 7889.77 27.78
sched1-4-58 63.39 266.40 4057.92 20.55
sched1-5-58 59.57 250.25 3813.39 17.05

One anomaly noted was that the kernbench system time went up
about 5% with the 2.5.58 kernel from what it was on the last
version I tested with (2.5.55). This increase is in both stock
and with the NUMA scheduler, so is not caused by the NUMA scheduler.

Now that I've got baselines for these, I plan to next look at
tweaking various parameters within the scheduler and see how what
happens. Also, I owe Erich numbers running hackbench. Overall, I am
pleased with these results.

And just for grins, here are the detailed results for running
numa_test 32

sched1-4-58:

Executing 32 times: ./randupdt 1000000
Running 'hackbench 20' in the background: Time: 9.044
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 100.0 0.0 0.0 | 1 1 | 55.04
2 0.0 0.0 4.8 95.2 | 2 3 *| 70.38
3 0.0 0.0 3.2 96.8 | 2 3 *| 72.72
4 0.0 26.4 2.8 70.7 | 2 3 *| 72.99
5 100.0 0.0 0.0 0.0 | 0 0 | 57.03
6 100.0 0.0 0.0 0.0 | 0 0 | 55.06
7 100.0 0.0 0.0 0.0 | 0 0 | 57.18
8 0.0 100.0 0.0 0.0 | 1 1 | 55.38
9 100.0 0.0 0.0 0.0 | 0 0 | 54.37
10 0.0 100.0 0.0 0.0 | 1 1 | 56.06
11 0.0 13.2 0.0 86.8 | 3 3 | 64.33
12 0.0 0.0 0.0 100.0 | 3 3 | 62.35
13 1.7 0.0 98.3 0.0 | 2 2 | 67.47
14 100.0 0.0 0.0 0.0 | 0 0 | 55.94
15 0.0 29.4 61.9 8.6 | 3 2 *| 78.76
16 0.0 100.0 0.0 0.0 | 1 1 | 56.42
17 18.9 0.0 74.9 6.2 | 3 2 *| 70.57
18 0.0 0.0 100.0 0.0 | 2 2 | 63.01
19 0.0 100.0 0.0 0.0 | 1 1 | 55.97
20 0.0 0.0 92.7 7.3 | 3 2 *| 65.62
21 0.0 0.0 100.0 0.0 | 2 2 | 62.70
22 0.0 100.0 0.0 0.0 | 1 1 | 55.53
23 0.0 1.5 0.0 98.5 | 3 3 | 56.95
24 0.0 100.0 0.0 0.0 | 1 1 | 55.75
25 0.0 30.0 2.3 67.7 | 2 3 *| 77.78
26 0.0 0.0 0.0 100.0 | 3 3 | 57.71
27 13.6 0.0 86.4 0.0 | 0 2 *| 66.55
28 100.0 0.0 0.0 0.0 | 0 0 | 55.43
29 0.0 100.0 0.0 0.0 | 1 1 | 56.12
30 19.8 0.0 62.5 17.6 | 3 2 *| 66.92
31 100.0 0.0 0.0 0.0 | 0 0 | 54.90
32 100.0 0.0 0.0 0.0 | 0 0 | 54.70
AverageUserTime 61.49 seconds
ElapsedTime 140.51
TotalUserTime 1968.06
TotalSysTime 9.57

sched1-5-58:

Executing 32 times: ./randupdt 1000000
Running 'hackbench 20' in the background: Time: 9.145
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 100.0 0.0 0.0 0.0 | 0 0 | 54.88
2 0.0 100.0 0.0 0.0 | 1 1 | 54.08
3 0.0 0.0 0.0 100.0 | 3 3 | 55.48
4 0.0 0.0 0.0 100.0 | 3 3 | 55.47
5 100.0 0.0 0.0 0.0 | 0 0 | 53.84
6 100.0 0.0 0.0 0.0 | 0 0 | 53.37
7 0.0 0.0 0.0 100.0 | 3 3 | 55.41
8 90.9 9.1 0.0 0.0 | 1 0 *| 55.58
9 0.0 100.0 0.0 0.0 | 1 1 | 55.61
10 0.0 100.0 0.0 0.0 | 1 1 | 54.56
11 0.0 0.0 98.1 1.9 | 2 2 | 56.25
12 0.0 0.0 0.0 100.0 | 3 3 | 55.07
13 0.0 0.0 0.0 100.0 | 3 3 | 54.92
14 0.0 100.0 0.0 0.0 | 1 1 | 54.59
15 100.0 0.0 0.0 0.0 | 0 0 | 55.10
16 5.0 0.0 95.0 0.0 | 2 2 | 56.97
17 0.0 0.0 100.0 0.0 | 2 2 | 55.51
18 100.0 0.0 0.0 0.0 | 0 0 | 53.97
19 0.0 4.7 95.3 0.0 | 2 2 | 57.21
20 0.0 0.0 100.0 0.0 | 2 2 | 55.53
21 0.0 0.0 100.0 0.0 | 2 2 | 56.46
22 0.0 0.0 100.0 0.0 | 2 2 | 55.48
23 0.0 0.0 0.0 100.0 | 3 3 | 55.99
24 0.0 100.0 0.0 0.0 | 1 1 | 55.32
25 0.0 6.2 93.8 0.0 | 2 2 | 57.66
26 0.0 0.0 0.0 100.0 | 3 3 | 55.60
27 0.0 100.0 0.0 0.0 | 1 1 | 54.65
28 0.0 0.0 0.0 100.0 | 3 3 | 56.39
29 0.0 100.0 0.0 0.0 | 1 1 | 54.91
30 100.0 0.0 0.0 0.0 | 0 0 | 53.58
31 100.0 0.0 0.0 0.0 | 0 0 | 53.53
32 31.5 68.4 0.0 0.0 | 0 1 *| 54.41
AverageUserTime 55.23 seconds
ElapsedTime 117.32
TotalUserTime 1767.71
TotalSysTime 7.78

--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2003-01-15 07:39:32

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

OK, ran some tests on newsched3 = patches 1+2+3 oldsched = Erich's old newsched4-tune = patches
Tuning seems to help Sleep time now ;-)

Kernbench:
2.5.58-mjb1 2.5.58-mjb1-oldsched 2.5.58-mjb1-newsched2 2.5.58-mjb1-newsched3 2.5.58-mjb1-newsched4 2.5.58-mjb1-newsched4-tune 2.5.58-mjb1-newsched5
NUMA schedbench 4:
2.5.58-mjb1 2.5.58-mjb1-oldsched 2.5.58-mjb1-newsched2 2.5.58-mjb1-newsched3 2.5.58-mjb1-newsched4 2.5.58-mjb1-newsched4-tune 2.5.58-mjb1-newsched5
NUMA schedbench 8:
2.5.58-mjb1 2.5.58-mjb1-oldsched 2.5.58-mjb1-newsched2 2.5.58-mjb1-newsched3 2.5.58-mjb1-newsched4 2.5.58-mjb1-newsched4-tune 2.5.58-mjb1-newsched5
NUMA schedbench 16:
2.5.58-mjb1 2.5.58-mjb1-oldsched 2.5.58-mjb1-newsched2 2.5.58-mjb1-newsched3 2.5.58-mjb1-newsched4 2.5.58-mjb1-newsched4-tune 2.5.58-mjb1-newsched5
NUMA schedbench 32:
2.5.58-mjb1 2.5.58-mjb1-oldsched 2.5.58-mjb1-newsched2 2.5.58-mjb1-newsched3 2.5.58-mjb1-newsched4 2.5.58-mjb1-newsched4-tune 2.5.58-mjb1-newsched5
NUMA schedbench 64:
2.5.58-mjb1 2.5.58-mjb1-oldsched 2.5.58-mjb1-newsched2 2.5.58-mjb1-newsched3 2.5.58-mjb1-newsched4 2.5.58-mjb1-newsched4-tune 2.5.58-mjb1-newsched5
--- sched.c.premjb4 +++ sched.c 2003-01-14 @@ -85,7 +85,7 @@
#define NODE_THRESHOLD #define MAX_INTERNODE_LB 40
#define MIN_INTERNODE_LB 4
-#define NODE_BALANCE_RATIO 10
+#define NODE_BALANCE_RATIO 250

/*
* If a task is 'interactive' @@ -763,7 +763,8 @@
this_rq()-> if (load - (100*load + (100*load + && node = i;
}

incremental version of the stack.
... etc.
code + Michael's ilb.
1,2,3,4 + tuning patch below:
quite a bit ... need to stick this into arch topo code.
Elapsed User System CPU
19.522s 186.566s 41.516s 1167.8%
19.488s 186.73s 42.382s 1175.6%
19.286s 186.418s 40.998s 1178.8%
19.58s 187.658s 43.694s 1181.2%
19.266s 187.772s 42.984s 1197.4%
19.424s 186.664s 41.422s 1173.6%
19.462s 187.692s 43.02s 1185%
AvgUser Elapsed TotalUser TotalSys

0.00 35.16 88.55 0.68
0.00 19.12 63.71 0.48
0.00 35.73 88.26 0.58
0.00 35.64 88.46 0.60
0.00 37.10 91.99 0.58
0.00 35.34 88.60 0.64
AvgUser Elapsed TotalUser TotalSys
0.00 35.34 88.60 0.64
0.00 64.01 338.77 1.50
0.00 31.56 227.72 1.03
0.00 35.44 220.63 1.36
0.00 35.47 223.86 1.33
0.00 37.04 232.92 1.14
0.00 36.11 223.14 1.39
AvgUser Elapsed TotalUser TotalSys
0.00 36.11 223.14 1.39
0.00 62.60 834.67 4.85
0.00 57.24 850.12 2.64
0.00 64.15 870.25 3.18
0.00 64.01 875.17 3.10
0.00 57.84 841.48 2.96
0.00 61.87 828.37 3.47
AvgUser Elapsed TotalUser TotalSys
0.00 61.87 828.37 3.47
0.00 154.30 2031.93 9.35
0.00 117.75 1798.53 5.52
0.00 122.87 1771.71 8.33
0.00 134.86 1863.51 8.27
0.00 118.18 1809.38 6.58
0.00 134.36 1853.94 8.33
AvgUser Elapsed TotalUser TotalSys
0.00 134.36 1853.94 8.33
0.00 318.68 4852.81 21.47
0.00 241.11 3603.29 12.70
0.00 258.72 3977.50 16.88
0.00 252.87 3850.55 18.51
0.00 235.43 3627.28 15.90
0.00 265.09 3967.70 18.81
2003-01-14 22:12:36.000000000 -0800
22:20:19.000000000 -0800
125
then we reinsert it in the active
+ atomic_read(&node_nr_running[i]);
prev_node_load[i] = load;
> maxload &&
> ((NODE_THRESHOLD*100*this_load)/100))) {
> (NODE_THRESHOLD*this_load))
load > this_load + 4) {
maxload = load;

2003-01-15 15:01:54

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

Thanks for the patience with the problems in the yesterday patches. I
resend the patches in the same form. Made following changes:

- moved NODE_BALANCE_{RATE,MIN,MAX} to topology.h
- removed the divide in the find_busiest_node() loop (thanks, Martin)
- removed the modulo (%) in in the cross-node balancing trigger
- re-added node_nr_running_init() stub, nr_running_init() and comments
from Christoph
- removed the constant factor 4 in find_busiest_node. The
find_busiest_queue routine will take care of the case where the
busiest_node is running only few processes (at most one per CPU) and
return busiest=NULL .

I hope we can start tuning the parameters now. In the basic NUMA
scheduler part these are:
NODE_THRESHOLD : minimum percentage of node overload to
trigger cross-node balancing
NODE_BALANCE_RATE : arch specific, cross-node balancing is called
after this many intra-node load balance calls

In the extended NUMA scheduler the fixed value of NODE_BALANCE_RATE is
replaced by a variable rate set to :
NODE_BALANCE_MIN : if own node load is less then avg_load/2
NODE_BALANCE_MAX : if load is larger than avg_load/2
Together with the reduced number of steals across nodes this might
help us in achieving equal load among nodes. I'm not aware of any
simple benchmark which can demonstrate this...

Regards,
Erich

Attachments:

numa-sched-2.5.58.patch (10.24 kB)
numa-sched-add-2.5.58.patch (5.18 kB)
numa-sched-patches.tgz (4.62 kB)
Download all attachments

2003-01-16 00:04:00

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

On Wed, 2003-01-15 at 07:10, Erich Focht wrote:
> Thanks for the patience with the problems in the yesterday patches. I
> resend the patches in the same form. Made following changes:
>
> - moved NODE_BALANCE_{RATE,MIN,MAX} to topology.h
> - removed the divide in the find_busiest_node() loop (thanks, Martin)
> - removed the modulo (%) in in the cross-node balancing trigger
> - re-added node_nr_running_init() stub, nr_running_init() and comments
> from Christoph
> - removed the constant factor 4 in find_busiest_node. The
> find_busiest_queue routine will take care of the case where the
> busiest_node is running only few processes (at most one per CPU) and
> return busiest=NULL .

These applied clean and looked good. I ran numbers for kernels
with patches 1-4 and patches 1-5. Results below.
>
> I hope we can start tuning the parameters now. In the basic NUMA
> scheduler part these are:
> NODE_THRESHOLD : minimum percentage of node overload to
> trigger cross-node balancing
> NODE_BALANCE_RATE : arch specific, cross-node balancing is called
> after this many intra-node load balance calls

I need to spend a some time staring at this code and putting together
a series of tests to try. I think the basic mechanism looks good, so
hopefully we are down to finding the best numbers for the architecture.
>
> In the extended NUMA scheduler the fixed value of NODE_BALANCE_RATE is
> replaced by a variable rate set to :
> NODE_BALANCE_MIN : if own node load is less then avg_load/2
> NODE_BALANCE_MAX : if load is larger than avg_load/2
> Together with the reduced number of steals across nodes this might
> help us in achieving equal load among nodes. I'm not aware of any
> simple benchmark which can demonstrate this...
>
> Regards,
> Erich
> ----
>
results:

stock58h: linux 2.5.58 with cputime stats patch
sched14r2-58: stock58h with latest NUMA sched patches 1 through 4
sched15r2-58: stock58h with latest NUMA sched patches 1 through 5
sched1-4-58: stock58h with previous NUMA sched patches 1 through 4

Kernbench:
Elapsed User System CPU
sched14r2-58 29.488s 284.02s 82.132s 1241.8%
sched15r2-58 29.778s 282.792s 83.478s 1229.8%
stock58h 31.656s 305.85s 89.232s 1248.2%
sched1-4-58 29.886s 287.506s 82.94s 1239%

Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
sched14r2-58 22.50 37.20 90.03 0.94
sched15r2-58 16.63 24.23 66.56 0.69
stock58h 27.73 42.80 110.96 0.85
sched1-4-58 32.86 46.41 131.47 0.85

Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
sched14r2-58 30.27 43.22 242.23 1.75
sched15r2-58 30.90 42.46 247.28 1.48
stock58h 45.97 61.87 367.81 2.11
sched1-4-58 31.39 49.18 251.22 2.15

Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
sched14r2-58 52.78 57.28 844.61 3.70
sched15r2-58 48.44 65.31 775.25 3.30
stock58h 60.91 83.63 974.71 6.18
sched1-4-58 54.31 62.11 869.11 3.84

Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
sched14r2-58 56.60 116.99 1811.56 5.94
sched15r2-58 56.42 116.75 1805.82 6.45
stock58h 84.26 195.16 2696.65 16.53
sched1-4-58 61.49 140.51 1968.06 9.57

Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
sched14r2-58 56.48 232.63 3615.55 16.02
sched15r2-58 56.38 236.25 3609.03 15.41
stock58h 123.27 511.77 7889.77 27.78
sched1-4-58 63.39 266.40 4057.92 20.55

--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2003-01-16 05:57:28

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

I'm not keen on the way the minisched patch got reformatted. I changed
it into a seperate function, which I think is much cleaner by the time
you've added the third patch - no #ifdef CONFIG_NUMA in load_balance.

Rejigged patches attatched, no functional changes.

Anyway, I perf tested this, and it comes out more or less the same as
the tuned version I was poking at last night (ie best of the bunch).
Looks pretty good to me.

M.

PS. The fourth patch was so small, and touching the same stuff as 3
that I rolled it into the third one here. Seems like a universal
benefit ;-)

Attachments:

(No filename) (579.00 B)
numasched1 (1.61 kB)
numasched2 (5.77 kB)
numasched3 (4.51 kB)
Download all attachments

2003-01-16 16:38:11

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

Hi Martin and Michael,

thanks for testing again!

On Thursday 16 January 2003 07:05, Martin J. Bligh wrote:
> I'm not keen on the way the minisched patch got reformatted. I changed
> it into a seperate function, which I think is much cleaner by the time
> you've added the third patch - no #ifdef CONFIG_NUMA in load_balance.

Fine. This form is also nearer to the codingstyle rule: "functions
should do only one thing" (I'm reading those more carefully now ;-)

> Anyway, I perf tested this, and it comes out more or less the same as
> the tuned version I was poking at last night (ie best of the bunch).
> Looks pretty good to me.

Great!

> PS. The fourth patch was so small, and touching the same stuff as 3
> that I rolled it into the third one here. Seems like a universal
> benefit ;-)

Yes, it's a much smaller step than patch #5. It would make sense to
have this included right from the start.

Regards,
Erich

2003-01-16 17:58:34

by Robert Love

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

On Thu, 2003-01-16 at 11:47, Erich Focht wrote:

> Fine. This form is also nearer to the codingstyle rule: "functions
> should do only one thing" (I'm reading those more carefully now ;-)

Good ;)

This is looking good. Thanks hch for going over it with your fine tooth
comb.

Erich and Martin, what more needs to be done prior to inclusion? Do you
still want an exec balancer in place?

Robert Love

2003-01-16 18:49:18

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

> On Thu, 2003-01-16 at 11:47, Erich Focht wrote:
>
>> Fine. This form is also nearer to the codingstyle rule: "functions
>> should do only one thing" (I'm reading those more carefully now ;-)
>
> Good ;)
>
> This is looking good. Thanks hch for going over it with your fine tooth
> comb.
>
> Erich and Martin, what more needs to be done prior to inclusion? Do you
> still want an exec balancer in place?

Yup, that's in patch 2 already. I just pushed it ... will see what
happens.

M.

2003-01-16 18:59:03

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

> well, it needs to settle down a bit more, we are technically in a
> codefreeze :-)

The current codeset is *completely* non-invasive to non-NUMA systems.
It can't break anything.

M.

2003-01-16 19:01:25

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

On Thu, Jan 16, 2003 at 08:07:09PM +0100, Ingo Molnar wrote:
> well, it needs to settle down a bit more, we are technically in a
> codefreeze :-)

We're in feature freeze. Not sure whether fixing the scheduler for
one type of hardware supported by Linux is a feature 8)

Anyway, patch 1 should certainly merged ASAP, for the other we can wait
a bit more to settle, but I don't think it's really worth the wait.

2003-01-16 18:54:04

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

On 16 Jan 2003, Robert Love wrote:

> > Fine. This form is also nearer to the codingstyle rule: "functions
> > should do only one thing" (I'm reading those more carefully now ;-)
>
> Good ;)
>
> This is looking good. Thanks hch for going over it with your fine tooth
> comb.
>
> Erich and Martin, what more needs to be done prior to inclusion? Do you
> still want an exec balancer in place?

well, it needs to settle down a bit more, we are technically in a
codefreeze :-)

Ingo

2003-01-16 19:29:40

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

On Thu, 16 Jan 2003, Christoph Hellwig wrote:

> > well, it needs to settle down a bit more, we are technically in a
> > codefreeze :-)
>
> We're in feature freeze. Not sure whether fixing the scheduler for one
> type of hardware supported by Linux is a feature 8)
>
> Anyway, patch 1 should certainly merged ASAP, for the other we can wait
> a bit more to settle, but I don't think it's really worth the wait.

agreed, the patch is unintrusive, but by settling down i mean things like
this:

+/* XXX(hch): this should go into a struct sched_node_data */

should be decided one way or another.

i'm also not quite happy about the conceptual background of
rq->nr_balanced. This load-balancing rate-limit is arbitrary and not
neutral at all. The way this should be done is to move the inter-node
balancing conditional out of load_balance(), and only trigger it from the
timer interrupt, with a given rate. On basically all NUMA hardware i
suspect it's much better to do inter-node balancing only on a very slow
scale. Making it dependnet on an arbitrary portion of the idle-CPU
rebalancing act makes the frequency of inter-node rebalancing almost
arbitrarily high.

ie. there are two basic types of rebalancing acts in multiprocessor
environments: 'synchronous balancing' and 'asynchronous balancing'.
Synchronous balancing is done whenever a CPU runs idle - this can happen
at a very high rate, so it needs to be low overhead and unintrusive. This
was already so when i did the SMP balancer. The asynchronous blancing
component (currently directly triggered from every CPU's own timer
interrupt), has a fixed frequency, and thus can be almost arbitrarily
complex. It's the one that is aware of the global scheduling picture. For
NUMA i'd suggest two asynchronous frequencies: one intra-node frequency,
and an inter-node frequency - configured by the architecture and roughly
in the same proportion to each other as cachemiss latencies.

(this all means that unless there's empirical data showing the opposite,
->nr_balanced can be removed completely, and fixed frequency balancing can
be done from the timer tick. This should further simplify the patch.)

Ingo

2003-01-16 19:36:22

by John Bradford

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

> > > well, it needs to settle down a bit more, we are technically in a
> > > codefreeze :-)
> >
> > We're in feature freeze. Not sure whether fixing the scheduler for one
> > type of hardware supported by Linux is a feature 8)

Yes, we are definitely _not_ in a code freeze yet, and I doubt that we
will be for at least a few months.

John.

2003-01-16 19:41:53

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

> complex. It's the one that is aware of the global scheduling picture. For
> NUMA i'd suggest two asynchronous frequencies: one intra-node frequency,
> and an inter-node frequency - configured by the architecture and roughly
> in the same proportion to each other as cachemiss latencies.

That's exactly what's in the latest set of patches - admittedly it's a
multiplier of when we run load_balance, not the tick multiplier, but
that's very easy to fix. Can you check out the stuff I posted last night?
I think it's somewhat cleaner ...

M.

2003-01-16 20:06:19

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

On Thu, 16 Jan 2003, Martin J. Bligh wrote:

> > complex. It's the one that is aware of the global scheduling picture. For
> > NUMA i'd suggest two asynchronous frequencies: one intra-node frequency,
> > and an inter-node frequency - configured by the architecture and roughly
> > in the same proportion to each other as cachemiss latencies.
>
> That's exactly what's in the latest set of patches - admittedly it's a
> multiplier of when we run load_balance, not the tick multiplier, but
> that's very easy to fix. Can you check out the stuff I posted last
> night? I think it's somewhat cleaner ...

yes, i saw it, it has the same tying between idle-CPU-rebalance and
inter-node rebalance, as Erich's patch. You've put it into
cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
coupled balancing act. There are two synchronous balancing acts currently:
the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
must involve any 'heavy' balancing, only local balancing. The inter-node
balancing (which is heavier than even the global SMP balancer), should
never be triggered from the high-frequency path. [whether it's high
frequency or not depends on the actual workload, but it can be potentially
_very_ high frequency, easily on the order of 1 million times a second -
then you'll call the inter-node balancer 100K times a second.]

I'd strongly suggest to decouple the heavy NUMA load-balancing code from
the fastpath and re-check the benchmark numbers.

Ingo

(*) whether sched_balance_exec() is a high-frequency path or not is up to
debate. Right now it's not possible to get much more than a couple of
thousand exec()'s per second on fast CPUs. Hopefully that will change in
the future though, so exec() events could become really fast. So i'd
suggest to only do local (ie. SMP-alike) balancing in the exec() path, and
only do NUMA cross-node balancing with a fixed frequency, from the timer
tick. But exec()-time is really special, since the user task usually has
zero cached state at this point, so we _can_ do cheap cross-node balancing
as well. So it's a boundary thing - probably doing the full-blown
balancing is the right thing.

2003-01-16 20:21:53

by Rick Lindsley

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH 2.5.58] new NUMA scheduler: fix

[whether it's high
frequency or not depends on the actual workload, but it can be potentially
_very_ high frequency, easily on the order of 1 million times a second -
then you'll call the inter-node balancer 100K times a second.]

If this is due to thread creation/death, though, you might want this level
of inter-node balancing (or at least checking). It could represent a lot
of fork/execs that are now overloading one or more nodes. Is it reasonable
to expect this sort of load on a relatively proc/thread-stable machine?

Rick

2003-01-16 23:40:15

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

On Thu, 2003-01-16 at 12:19, Ingo Molnar wrote:
>
> Ingo
>
> (*) whether sched_balance_exec() is a high-frequency path or not is up to
> debate. Right now it's not possible to get much more than a couple of
> thousand exec()'s per second on fast CPUs. Hopefully that will change in
> the future though, so exec() events could become really fast. So i'd
> suggest to only do local (ie. SMP-alike) balancing in the exec() path, and
> only do NUMA cross-node balancing with a fixed frequency, from the timer
> tick. But exec()-time is really special, since the user task usually has
> zero cached state at this point, so we _can_ do cheap cross-node balancing
> as well. So it's a boundary thing - probably doing the full-blown
> balancing is the right thing.
>
The reason for doing load balancing on every exec is that, as you say,
it is cheap to do the balancing at this point - no cache state, minimal
memory allocations. If we did not balance at this point and relied on
balancing from the timer tick, there would be much more movement of
established processes between nodes, which is expensive. Ideally, the
exec balancing is good enough so that on a well functioning system there
is no inter-node balancing taking place at the timer tick. Our testing
has shown that the exec load balance code does a very good job of
spreading processes across nodes, and thus, very little timer tick
balancing. The timer tick internode balancing is there as a safety
valve for those cases where exec balancing is not adequate. Workloads
with long running processes, and workloads with processes that do lots
of forks but not execs, are examples.

An earlier version of Erich's initial load balancer provided for
the option of balancing at either fork or exec, and the capability
of selecting on a per-process basis which to use. That could be
used if workloads that do forks with no execs become a problem on
NUMA boxes.

--

Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486

2003-01-16 23:43:55

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

> yes, i saw it, it has the same tying between idle-CPU-rebalance and
> inter-node rebalance, as Erich's patch. You've put it into
> cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
> coupled balancing act. There are two synchronous balancing acts currently:
> the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
> must involve any 'heavy' balancing, only local balancing. The inter-node

If I understand that correctly (and I'm not sure I do), you're saying
you don't think the exec time balance should go global? That would
break most of the concept ... *something* has to distribute stuff around
nodes, and the exec point is the cheapest time to do that (least "weight"
to move. I'd like to base it off cached load-averages, rather than going
sniffing every runqueue, but if you're meaning we should only exec time
balance inside a node, I disagree. Am I just misunderstanding you?

At the moment, the high-freq balancer is only inside a node. Exec balancing
is global, and the "low-frequency" balancer is global. WRT the idle-time
balancing, I agree with what I *think* you're saying ... this shouldn't
clock up the rq->nr_balanced counter ... this encourages too much cross-node
stealing. I'll hack that change out and see what it does to the numbers.

Would appreciate more feedback on the first paragraph. Thanks,

M.

2003-01-17 07:10:38

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

On Thu, 16 Jan 2003, Martin J. Bligh wrote:

> If I understand that correctly (and I'm not sure I do), you're saying
> you don't think the exec time balance should go global? That would break
> most of the concept ... *something* has to distribute stuff around
> nodes, and the exec point is the cheapest time to do that (least
> "weight" to move. [...]

the exec()-time balancing is special, since it only moves the task in
question - so the 'push' should indeed be a global decision. _But_, exec()
is also a natural balancing point for the local node (we potentially just
got rid of a task, which might create imbalance within the node), so it
might make sense to do a 'local' balancing run as well, if the exec()-ing
task was indeed pushed to another node.

> At the moment, the high-freq balancer is only inside a node. Exec
> balancing is global, and the "low-frequency" balancer is global. WRT the
> idle-time balancing, I agree with what I *think* you're saying ... this
> shouldn't clock up the rq->nr_balanced counter ... this encourages too
> much cross-node stealing. I'll hack that change out and see what it does
> to the numbers.

yes, this should also further unify the SMP and NUMA balancing code.

Ingo

2003-01-17 08:39:10

[permalink] [raw]

Subject: [patch] sched-2.5.59-A2

Martin, Erich,

could you give the attached patch a go, it implements my
cleanup/reorganization suggestions ontop of 2.5.59:

- decouple the 'slow' rebalancer from the 'fast' rebalancer and attach it
to the scheduler tick. Remove rq->nr_balanced.

- do idle-rebalancing every 1 msec, intra-node rebalancing every 200
msecs and inter-node rebalancing every 400 msecs.

- move the tick-based rebalancing logic into rebalance_tick(), it looks
more organized this way and we have all related code in one spot.

- clean up the topology.h include file structure. Since generic kernel
code uses all the defines already, there's no reason to keep them in
asm-generic.h. I've created a linux/topology.h file that includes
asm/topology.h and takes care of the default and generic definitions.
Moved over a generic topology definition from mmzone.h.

- renamed rq->prev_nr_running[] to rq->prev_cpu_load[] - this further
unifies the SMP and NUMA balancers and is more in line with the
prev_node_load name.

If performance drops due to this patch then the benchmarking goal would be
to tune the following frequencies:

#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
#define NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)

In theory NODE_REBALANCE_TICK could be defined by the NUMA platform,
although in the past such per-platform scheduler tunings used to end
'orphaned' after some time. 400 msecs is pretty conservative at the
moment, it could be made more frequent if benchmark results support it.

the patch compiles and boots on UP and SMP, it compiles on x86-NUMA.

Ingo

--- linux/drivers/base/cpu.c.orig 2003-01-17 10:02:19.000000000 +0100
+++ linux/drivers/base/cpu.c 2003-01-17 10:02:25.000000000 +0100
@@ -6,8 +6,7 @@
#include <linux/module.h>
#include <linux/init.h>
#include <linux/cpu.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>

static int cpu_add_device(struct device * dev)
--- linux/drivers/base/node.c.orig 2003-01-17 10:02:50.000000000 +0100
+++ linux/drivers/base/node.c 2003-01-17 10:03:03.000000000 +0100
@@ -7,8 +7,7 @@
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>

static int node_add_device(struct device * dev)
--- linux/drivers/base/memblk.c.orig 2003-01-17 10:02:33.000000000 +0100
+++ linux/drivers/base/memblk.c 2003-01-17 10:02:38.000000000 +0100
@@ -7,8 +7,7 @@
#include <linux/init.h>
#include <linux/memblk.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>

static int memblk_add_device(struct device * dev)
--- linux/include/asm-generic/topology.h.orig 2003-01-17 09:49:38.000000000 +0100
+++ linux/include/asm-generic/topology.h 2003-01-17 10:02:08.000000000 +0100
@@ -1,56 +0,0 @@
-/*
- * linux/include/asm-generic/topology.h
- *
- * Written by: Matthew Dobson, IBM Corporation
- *
- * Copyright (C) 2002, IBM Corp.
- *
- * All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
- * NON INFRINGEMENT. See the GNU General Public License for more
- * details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- *
- * Send feedback to <[email protected]>
- */
-#ifndef _ASM_GENERIC_TOPOLOGY_H
-#define _ASM_GENERIC_TOPOLOGY_H
-
-/* Other architectures wishing to use this simple topology API should fill
- in the below functions as appropriate in their own <asm/topology.h> file. */
-#ifndef __cpu_to_node
-#define __cpu_to_node(cpu) (0)
-#endif
-#ifndef __memblk_to_node
-#define __memblk_to_node(memblk) (0)
-#endif
-#ifndef __parent_node
-#define __parent_node(node) (0)
-#endif
-#ifndef __node_to_first_cpu
-#define __node_to_first_cpu(node) (0)
-#endif
-#ifndef __node_to_cpu_mask
-#define __node_to_cpu_mask(node) (cpu_online_map)
-#endif
-#ifndef __node_to_memblk
-#define __node_to_memblk(node) (0)
-#endif
-
-/* Cross-node load balancing interval. */
-#ifndef NODE_BALANCE_RATE
-#define NODE_BALANCE_RATE 10
-#endif
-
-#endif /* _ASM_GENERIC_TOPOLOGY_H */
--- linux/include/asm-ppc64/topology.h.orig 2003-01-17 09:54:46.000000000 +0100
+++ linux/include/asm-ppc64/topology.h 2003-01-17 09:55:18.000000000 +0100
@@ -46,18 +46,6 @@
return mask;
}

-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
-#else /* !CONFIG_NUMA */
-
-#define __cpu_to_node(cpu) (0)
-#define __memblk_to_node(memblk) (0)
-#define __parent_node(nid) (0)
-#define __node_to_first_cpu(node) (0)
-#define __node_to_cpu_mask(node) (cpu_online_map)
-#define __node_to_memblk(node) (0)
-
#endif /* CONFIG_NUMA */

#endif /* _ASM_PPC64_TOPOLOGY_H */
--- linux/include/linux/topology.h.orig 2003-01-17 09:57:20.000000000 +0100
+++ linux/include/linux/topology.h 2003-01-17 10:09:38.000000000 +0100
@@ -0,0 +1,43 @@
+/*
+ * linux/include/linux/topology.h
+ */
+#ifndef _LINUX_TOPOLOGY_H
+#define _LINUX_TOPOLOGY_H
+
+#include <asm/topology.h>
+
+/*
+ * The default (non-NUMA) topology definitions:
+ */
+#ifndef __cpu_to_node
+#define __cpu_to_node(cpu) (0)
+#endif
+#ifndef __memblk_to_node
+#define __memblk_to_node(memblk) (0)
+#endif
+#ifndef __parent_node
+#define __parent_node(node) (0)
+#endif
+#ifndef __node_to_first_cpu
+#define __node_to_first_cpu(node) (0)
+#endif
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu) (cpu_online_map)
+#endif
+#ifndef __node_to_cpu_mask
+#define __node_to_cpu_mask(node) (cpu_online_map)
+#endif
+#ifndef __node_to_memblk
+#define __node_to_memblk(node) (0)
+#endif
+
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu) __node_to_cpu_mask(__cpu_to_node(cpu))
+#endif
+
+/*
+ * Generic defintions:
+ */
+#define numa_node_id() (__cpu_to_node(smp_processor_id()))
+
+#endif /* _LINUX_TOPOLOGY_H */
--- linux/include/linux/mmzone.h.orig 2003-01-17 09:58:20.000000000 +0100
+++ linux/include/linux/mmzone.h 2003-01-17 10:01:17.000000000 +0100
@@ -255,9 +255,7 @@
#define MAX_NR_MEMBLKS 1
#endif /* CONFIG_NUMA */

-#include <asm/topology.h>
-/* Returns the number of the current Node. */
-#define numa_node_id() (__cpu_to_node(smp_processor_id()))
+#include <linux/topology.h>

#ifndef CONFIG_DISCONTIGMEM
extern struct pglist_data contig_page_data;
--- linux/include/asm-ia64/topology.h.orig 2003-01-17 09:54:33.000000000 +0100
+++ linux/include/asm-ia64/topology.h 2003-01-17 09:54:38.000000000 +0100
@@ -60,7 +60,4 @@
*/
#define __node_to_memblk(node) (node)

-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
#endif /* _ASM_IA64_TOPOLOGY_H */
--- linux/include/asm-i386/topology.h.orig 2003-01-17 09:55:28.000000000 +0100
+++ linux/include/asm-i386/topology.h 2003-01-17 09:56:27.000000000 +0100
@@ -61,17 +61,6 @@
/* Returns the number of the first MemBlk on Node 'node' */
#define __node_to_memblk(node) (node)

-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 100
-
-#else /* !CONFIG_NUMA */
-/*
- * Other i386 platforms should define their own version of the
- * above macros here.
- */
-
-#include <asm-generic/topology.h>
-
#endif /* CONFIG_NUMA */

#endif /* _ASM_I386_TOPOLOGY_H */
--- linux/include/asm-i386/cpu.h.orig 2003-01-17 10:03:22.000000000 +0100
+++ linux/include/asm-i386/cpu.h 2003-01-17 10:03:31.000000000 +0100
@@ -3,8 +3,8 @@

#include <linux/device.h>
#include <linux/cpu.h>
+#include <linux/topology.h>

-#include <asm/topology.h>
#include <asm/node.h>

struct i386_cpu {
--- linux/include/asm-i386/node.h.orig 2003-01-17 10:04:02.000000000 +0100
+++ linux/include/asm-i386/node.h 2003-01-17 10:04:08.000000000 +0100
@@ -4,8 +4,7 @@
#include <linux/device.h>
#include <linux/mmzone.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>

struct i386_node {
struct node node;
--- linux/include/asm-i386/memblk.h.orig 2003-01-17 10:03:51.000000000 +0100
+++ linux/include/asm-i386/memblk.h 2003-01-17 10:03:56.000000000 +0100
@@ -4,8 +4,8 @@
#include <linux/device.h>
#include <linux/mmzone.h>
#include <linux/memblk.h>
+#include <linux/topology.h>

-#include <asm/topology.h>
#include <asm/node.h>

struct i386_memblk {
--- linux/kernel/sched.c.orig 2003-01-17 09:22:24.000000000 +0100
+++ linux/kernel/sched.c 2003-01-17 10:29:53.000000000 +0100
@@ -153,10 +153,9 @@
nr_uninterruptible;
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
- int prev_nr_running[NR_CPUS];
+ int prev_cpu_load[NR_CPUS];
#ifdef CONFIG_NUMA
atomic_t *node_nr_running;
- unsigned int nr_balanced;
int prev_node_load[MAX_NUMNODES];
#endif
task_t *migration_thread;
@@ -765,29 +764,6 @@
return node;
}

-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
- int this_node = __cpu_to_node(this_cpu);
- /*
- * Avoid rebalancing between nodes too often.
- * We rebalance globally once every NODE_BALANCE_RATE load balances.
- */
- if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
- int node = find_busiest_node(this_node);
- this_rq->nr_balanced = 0;
- if (node >= 0)
- return (__node_to_cpu_mask(node) | (1UL << this_cpu));
- }
- return __node_to_cpu_mask(this_node);
-}
-
-#else /* !CONFIG_NUMA */
-
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
- return cpu_online_map;
-}
-
#endif /* CONFIG_NUMA */

#if CONFIG_SMP
@@ -807,10 +783,10 @@
spin_lock(&busiest->lock);
spin_lock(&this_rq->lock);
/* Need to recalculate nr_running */
- if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+ if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ nr_running = this_rq->prev_cpu_load[this_cpu];
} else
spin_lock(&busiest->lock);
}
@@ -847,10 +823,10 @@
* that case we are less picky about moving a task across CPUs and
* take what can be taken.
*/
- if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+ if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ nr_running = this_rq->prev_cpu_load[this_cpu];

busiest = NULL;
max_load = 1;
@@ -859,11 +835,11 @@
continue;

rq_src = cpu_rq(i);
- if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
+ if (idle || (rq_src->nr_running < this_rq->prev_cpu_load[i]))
load = rq_src->nr_running;
else
- load = this_rq->prev_nr_running[i];
- this_rq->prev_nr_running[i] = rq_src->nr_running;
+ load = this_rq->prev_cpu_load[i];
+ this_rq->prev_cpu_load[i] = rq_src->nr_running;

if ((load > max_load) && (rq_src != this_rq)) {
busiest = rq_src;
@@ -922,7 +898,7 @@
* We call this with the current runqueue locked,
* irqs disabled.
*/
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int idle, unsigned long cpumask)
{
int imbalance, idx, this_cpu = smp_processor_id();
runqueue_t *busiest;
@@ -930,8 +906,7 @@
struct list_head *head, *curr;
task_t *tmp;

- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
- cpus_to_balance(this_cpu, this_rq));
+ busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
if (!busiest)
goto out;

@@ -1006,21 +981,61 @@
* frequency and balancing agressivity depends on whether the CPU is
* idle or not.
*
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
* systems with HZ=100, every 10 msecs.)
+ *
+ * On NUMA, do a node-rebalance every 400 msecs.
*/
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
+#define NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)

-static inline void idle_tick(runqueue_t *rq)
+static void rebalance_tick(runqueue_t *this_rq, int idle)
{
- if (jiffies % IDLE_REBALANCE_TICK)
+ int this_cpu = smp_processor_id();
+ unsigned long j = jiffies, cpumask, this_cpumask = 1UL << this_cpu;
+
+ if (idle) {
+ if (!(j % IDLE_REBALANCE_TICK)) {
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, 1, this_cpumask);
+ spin_unlock(&this_rq->lock);
+ }
return;
- spin_lock(&rq->lock);
- load_balance(rq, 1);
- spin_unlock(&rq->lock);
+ }
+ /*
+ * First do inter-node rebalancing, then intra-node rebalancing,
+ * if both events happen in the same tick. The inter-node
+ * rebalancing does not necessarily have to create a perfect
+ * balance within the node, since we load-balance the most loaded
+ * node with the current CPU. (ie. other CPUs in the local node
+ * are not balanced.)
+ */
+#if CONFIG_NUMA
+ if (!(j % NODE_REBALANCE_TICK)) {
+ int node = find_busiest_node(__cpu_to_node(this_cpu));
+ if (node >= 0) {
+ cpumask = __node_to_cpu_mask(node) | this_cpumask;
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, 1, cpumask);
+ spin_unlock(&this_rq->lock);
+ }
+ }
+#endif
+ if (!(j % BUSY_REBALANCE_TICK)) {
+ cpumask = __cpu_to_node_mask(this_cpu);
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, 1, cpumask);
+ spin_unlock(&this_rq->lock);
+ }
+}
+#else
+/*
+ * on UP we do not need to balance between CPUs:
+ */
+static inline void rebalance_tick(runqueue_t *this_rq, int idle)
+{
}
-
#endif

DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
@@ -1063,9 +1078,7 @@
kstat_cpu(cpu).cpustat.iowait += sys_ticks;
else
kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
- idle_tick(rq);
-#endif
+ rebalance_tick(rq, 1);
return;
}
if (TASK_NICE(p) > 0)
@@ -1121,11 +1134,8 @@
enqueue_task(p, rq->active);
}
out:
-#if CONFIG_SMP
- if (!(jiffies % BUSY_REBALANCE_TICK))
- load_balance(rq, 0);
-#endif
spin_unlock(&rq->lock);
+ rebalance_tick(rq, 0);
}

void scheduling_functions_start_here(void) { }
@@ -1184,7 +1194,7 @@
pick_next_task:
if (unlikely(!rq->nr_running)) {
#if CONFIG_SMP
- load_balance(rq, 1);
+ load_balance(rq, 1, __cpu_to_node_mask(smp_processor_id()));
if (rq->nr_running)
goto pick_next_task;
#endif
--- linux/mm/page_alloc.c.orig 2003-01-17 10:01:29.000000000 +0100
+++ linux/mm/page_alloc.c 2003-01-17 10:01:35.000000000 +0100
@@ -28,8 +28,7 @@
#include <linux/blkdev.h>
#include <linux/slab.h>
#include <linux/notifier.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>

DECLARE_BITMAP(node_online_map, MAX_NUMNODES);
DECLARE_BITMAP(memblk_online_map, MAX_NR_MEMBLKS);
--- linux/mm/vmscan.c.orig 2003-01-17 10:01:44.000000000 +0100
+++ linux/mm/vmscan.c 2003-01-17 10:01:52.000000000 +0100
@@ -27,10 +27,10 @@
#include <linux/pagevec.h>
#include <linux/backing-dev.h>
#include <linux/rmap-locking.h>
+#include <linux/topology.h>

#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
-#include <asm/topology.h>
#include <asm/div64.h>

#include <linux/swapops.h>

2003-01-17 11:00:50

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

Hi Ingo,

On Thursday 16 January 2003 21:19, Ingo Molnar wrote:
> On Thu, 16 Jan 2003, Martin J. Bligh wrote:
> > > complex. It's the one that is aware of the global scheduling picture.
> > > For NUMA i'd suggest two asynchronous frequencies: one intra-node
> > > frequency, and an inter-node frequency - configured by the architecture
> > > and roughly in the same proportion to each other as cachemiss
> > > latencies.
> >
> > That's exactly what's in the latest set of patches - admittedly it's a
> > multiplier of when we run load_balance, not the tick multiplier, but
> > that's very easy to fix. Can you check out the stuff I posted last
> > night? I think it's somewhat cleaner ...
>
> yes, i saw it, it has the same tying between idle-CPU-rebalance and
> inter-node rebalance, as Erich's patch. You've put it into
> cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
> coupled balancing act. There are two synchronous balancing acts currently:
> the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
> must involve any 'heavy' balancing, only local balancing.

I prefer a single point of entry called load_balance() to multiple
functionally different balancers. The reason is that the later choice
might lead to balancers competing or working against each other. Not
now but the design could mislead to such developments.

But the main other reasons for calling the cross-node balancer after
NODE_BALANCE_RATE calls to the intra-node balancer (you call it
synchronous balancing) is performance:

Davide Libenzi showed quite a while ago that one benefits a lot if the
idle CPUs stay idle for rather short time. IIRC, his conclusion for
the multi-queue scheduler was that an order of magnitude of 10ms is
long enough, below you start feeling the balancing overhead, above you
waste useful cycles.

On a NUMA system this is even more important: the longer you leave
fresh tasks on an overloaded node, the more probable it is that they
allocate their memory there. And then they will run with poor
performance on the node which stayed idle for 200-400ms before
stealing them. So one wastes 200-400ms on each CPU of the idle node
and at the end gets tasks which perform poorly, anyway. If the tasks
are "old", at least we didn't waste too much time beeing idle. The
long-term target should be that the tasks should remember where their
memory is and return to that node.

> The inter-node
> balancing (which is heavier than even the global SMP balancer), should
> never be triggered from the high-frequency path.

Hmmm, we made it really slim. Actually the cross-node balancing might
even be cheaper than the global SMP balancer:
- it first loops over the nodes (loop length 4 on a 16 CPU NUMA-Q & Azusa)
- then it loops over the cpumask of the most loaded node + the current
CPU (loop length 5 on a NUMA-Q & Azusa).
This has to be compared with the loop length of 16 when doing the
global SMP rebalance. The additional work done for averaging is
minimal. The more nodes, the cheaper the NUMA cross-node balancing
compared to the global SMP balancing.

Besides: the CPU is idle anyway! So who cares whether it just
unsuccessfully scans its own empty node or looks at the other nodes
from time to time? It does this lockless and doesn't modify any
variables in other runqueues, so doesn't create cache misses for other
CPUs.

> [whether it's high
> frequency or not depends on the actual workload, but it can be potentially
> _very_ high frequency, easily on the order of 1 million times a second -
> then you'll call the inter-node balancer 100K times a second.]

You mean because cpu_idle() loops over schedule()? The code is:
while (1) {
void (*idle)(void) = pm_idle;
if (!idle)
idle = default_idle;
irq_stat[smp_processor_id()].idle_timestamp = jiffies;
while (!need_resched())
idle();
schedule();
}
So if the CPU is idle, it won't go through schedule(), except we get
an interrupt from time to time... And then, it doesn't really
matter. Or do you want to keep idle CPUs free for serving interrupts?
That could be legitimate, but is not the typical load I had in
mind and is an issue not related to the NUMA scheduler. But maybe you
have something else in mind, that I didn't consider, yet.

Under normal conditions the rebalancing I though about would work the
following way:

Busy CPU:
- intra-node rebalance every 200ms (interval timer controlled)
- cross-node rebalance every NODE_BALANCE_RATE*200ms (2s)
- when about to go idle, rebalance internally or across nodes, 10
times more often within the node

Idle CPU:
- intra-node rebalance every 1ms
- cross-node rebalance every NODE_REBALANCE_RATE * 1ms (10ms)
This doesn't appear to be too frequent for me... after all the cpu
is idle and couldn't steal anything from it's own node.

I don't insist too much on this design, but I can't see any serious
reasons against it. Of course, the performance should decide.

I'm about to test the two versions in discussion on an NEC Asama
(small configuration with 4 nodes, good memory latency ratio between
nodes (1.6), no node-level cache).

Best regards,
Erich

On Thursday 16 January 2003 21:19, Ingo Molnar wrote:
> On Thu, 16 Jan 2003, Martin J. Bligh wrote:
> > > complex. It's the one that is aware of the global scheduling picture.
> > > For NUMA i'd suggest two asynchronous frequencies: one intra-node
> > > frequency, and an inter-node frequency - configured by the architecture
> > > and roughly in the same proportion to each other as cachemiss
> > > latencies.
> >
> > That's exactly what's in the latest set of patches - admittedly it's a
> > multiplier of when we run load_balance, not the tick multiplier, but
> > that's very easy to fix. Can you check out the stuff I posted last
> > night? I think it's somewhat cleaner ...
>
> yes, i saw it, it has the same tying between idle-CPU-rebalance and
> inter-node rebalance, as Erich's patch. You've put it into
> cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
> coupled balancing act. There are two synchronous balancing acts currently:
> the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
> must involve any 'heavy' balancing, only local balancing. The inter-node
> balancing (which is heavier than even the global SMP balancer), should
> never be triggered from the high-frequency path. [whether it's high
> frequency or not depends on the actual workload, but it can be potentially
> _very_ high frequency, easily on the order of 1 million times a second -
> then you'll call the inter-node balancer 100K times a second.]
>
> I'd strongly suggest to decouple the heavy NUMA load-balancing code from
> the fastpath and re-check the benchmark numbers.
>
> Ingo
>
> (*) whether sched_balance_exec() is a high-frequency path or not is up to
> debate. Right now it's not possible to get much more than a couple of
> thousand exec()'s per second on fast CPUs. Hopefully that will change in
> the future though, so exec() events could become really fast. So i'd
> suggest to only do local (ie. SMP-alike) balancing in the exec() path, and
> only do NUMA cross-node balancing with a fixed frequency, from the timer
> tick. But exec()-time is really special, since the user task usually has
> zero cached state at this point, so we _can_ do cheap cross-node balancing
> as well. So it's a boundary thing - probably doing the full-blown
> balancing is the right thing.

2003-01-17 13:54:58

[permalink] [raw]

Subject: Re: [PATCH 2.5.58] new NUMA scheduler: fix

On Fri, 17 Jan 2003, Erich Focht wrote:

> I prefer a single point of entry called load_balance() to multiple
> functionally different balancers. [...]

agreed - my cleanup patch keeps that property.

> [...] IIRC, his conclusion for the multi-queue scheduler was that an
> order of magnitude of 10ms is long enough, below you start feeling the
> balancing overhead, above you waste useful cycles.

this is one reason why we do the idle rebalancing at 1 msec granularity
right now.

> On a NUMA system this is even more important: the longer you leave fresh
> tasks on an overloaded node, the more probable it is that they allocate
> their memory there. And then they will run with poor performance on the
> node which stayed idle for 200-400ms before stealing them. So one wastes
> 200-400ms on each CPU of the idle node and at the end gets tasks which
> perform poorly, anyway. If the tasks are "old", at least we didn't waste
> too much time beeing idle. The long-term target should be that the tasks
> should remember where their memory is and return to that node.

i'd much rather vote for fork() and exec() time 'pre-balancing' and then
making it quite hard to move a task across nodes.

> > The inter-node balancing (which is heavier than even the global SMP
> > balancer), should never be triggered from the high-frequency path.
>
> Hmmm, we made it really slim. [...]

this is a misunderstanding. I'm not worried about the algorithmic overhead
_at all_, i'm worried about the effect of too frequent balancing - tasks
being moved between runqueues too often. That has shown to be a problem on
SMP. The prev_load type of statistic measurement relies on a constant
frequency - it can lead to over-balancing if it's called too often.

> So if the CPU is idle, it won't go through schedule(), except we get an
> interrupt from time to time... [...]

(no, it's even better than that, we never leave the idle loop except when
we _know_ that there is scheduling work to be done. Hence the
need_resched() test. But i'm not worried about balancing overhead at all.)

Ingo

2003-01-17 14:25:59