2002-02-22 18:57:29

by Mike Kravetz

[permalink] [raw]
Subject: NUMA scheduling

Below is preliminary patch to implement some form of NUMA scheduling
on top of Ingo's K3 scheduler patch for 2.4.17. This is VERY early
code and brings up some issues that need to be discussed/explored in
more detail. This patch was created to form a basis for discussion,
rather than as a solution. The patch was created for the i386 based
NUMA system I have access to. It will not work on other architectures.
However, the only architecture specific code is a call to initialize
some of the NUMA specific scheduling data structures. Therefore, it
should be trivial to port.

Here is what the patch does:
- Creates 'cpu sets', which are sets of cpus that tasks can be scheduled
on. The cpus in these sets should all have something in common
such as local memory, a shared cache, etc.

- In general, once a task begins execution on a cpu set, it remains
on that cpu set. The load balancing code that exists in the K3
scheduler is used to balance the load among cpus within a set.

- Load balancing also takes place across cpu set boundaries. Load
balancing across set boundaries can happen:
- at exec time. This is an obvious choice since we have just thrown
away a tasks address space and are about to create a new one. The
exec'ing task is migrated to the least loaded cpu set
- when a cpu goes idle. At this time we first look for another
task to run within the cpu set. If there is no task available
within the set, then we attempt to get a task from the most
loaded cpu set.
- at specified intervals when a cpu is idle. In the K3 scheduler,
the timer code notices a cpu is idle and looks for more work
to do. First it checks other runqueues within the cpu set. If
nothing is found, it attempts to get a task from the most loaded
cpu set.
- at specified intervals when a cpu is busy. This is also kicked
off via the timer interrupt code. It simply attempts to move
tasks from the most loaded cpu sets to the least loaded cpu sets.
For the most part, cpu set load balancing is invoked in the same places
as cpu to cpu load balancing in the K3 scheduler. The only exception
is exec time task placement. In the patch, the various forms of
cpu set load balancing are behind #defines which can be turned on or
off for experimentation.

Things not addressed in the patch:
- Topology discovery. To create cpu sets, a routine numa_sched_init()
is called after we know how many cpus there are in the system.
Inside numa_sched_init(), there are hard coded values used to indicate
the number of cpus per set, and the 'distance' between sets.
Obviously, creation of cpu sets would be based on topology information
that is not available at this time. In addition, there should be
support for (or at least an understanding of) non-symmetric and
multi-level topologies.
- Tasks should have a notion of a 'home' cpu set. Short term execution
of a task on a remote cpu set should only be done to take advantage
of idle cpus. Tasks should only execute for short periods of time on
remote cpu sets. Most task execution should occur on the home cpu set.
Long term execution on a remote cpu set would involve a full task
migration which includes moving/copying the memory associated with
the task.
- Partitioning of scheduling data. On a NUMA architecture, data
structures should be backed by memory local to the cpus which will be
accessing them most frequently. Therefore, things like runqueues
and cpu_sets should not be arrays.
- Better determination for what tasks to execute remotely/migrate.
The path uses the existing code in load_balance(). Most likely this
will be sufficient.

--
Mike

diff -Naur linux-2.4.17-sched.orig/arch/i386/kernel/smpboot.c linux-2.4.17-ns/arch/i386/kernel/smpboot.c
--- linux-2.4.17-sched.orig/arch/i386/kernel/smpboot.c Thu Feb 7 17:43:14 2002
+++ linux-2.4.17-ns/arch/i386/kernel/smpboot.c Mon Feb 11 18:54:33 2002
@@ -1198,6 +1198,11 @@
}
}
}
+
+ /*
+ * Hack to get cpu sets initialized on NUMA architectures
+ */
+ numa_sched_init();

#ifndef CONFIG_VISWS
/*
diff -Naur linux-2.4.17-sched.orig/fs/exec.c linux-2.4.17-ns/fs/exec.c
--- linux-2.4.17-sched.orig/fs/exec.c Fri Dec 21 17:41:55 2001
+++ linux-2.4.17-ns/fs/exec.c Mon Feb 11 19:39:59 2002
@@ -860,6 +860,10 @@
int retval;
int i;

+#if CONFIG_SMP
+ sched_exec_migrate();
+#endif
+
file = open_exec(filename);

retval = PTR_ERR(file);
diff -Naur linux-2.4.17-sched.orig/include/linux/sched.h linux-2.4.17-ns/include/linux/sched.h
--- linux-2.4.17-sched.orig/include/linux/sched.h Thu Feb 7 20:25:29 2002
+++ linux-2.4.17-ns/include/linux/sched.h Tue Feb 12 22:45:59 2002
@@ -152,6 +152,7 @@
extern void sched_task_migrated(task_t *p);
extern void smp_migrate_task(int cpu, task_t *task);
extern unsigned long cache_decay_ticks;
+extern void sched_exec_migrate(void);

#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
diff -Naur linux-2.4.17-sched.orig/kernel/sched.c linux-2.4.17-ns/kernel/sched.c
--- linux-2.4.17-sched.orig/kernel/sched.c Thu Feb 7 17:43:13 2002
+++ linux-2.4.17-ns/kernel/sched.c Thu Feb 14 16:59:57 2002
@@ -20,6 +20,53 @@
#include <linux/interrupt.h>
#include <asm/mmu_context.h>

+/*
+ * Definitions that depend on/should be part of NUMA topology discovery
+ */
+/*
+ * Periodically poll for new (off cpu set) work as part of the idle loop
+#define IDLE_LOOP_REBALANCE
+ */
+/*
+ * Look for off cpu set work before going idle
+ */
+#define GOING_IDLE_REBALANCE
+/*
+ * Attempt to balance load between cpu sets (even when busy)
+#define BUSY_REBALANCE
+ */
+/*
+ * Send tasks to least loaded CPU at exec time
+ */
+#define EXEC_BALANCE
+
+#define NR_CPU_SETS NR_CPUS
+
+union cpu_set {
+ struct cpu_set_data {
+ spinlock_t lock;
+ unsigned long local_cpus;
+ unsigned long load_avg;
+ } csd;
+ char __pad [SMP_CACHE_BYTES];
+};
+typedef union cpu_set cpu_set_t;
+#define cs_lock csd.lock
+#define cs_local_cpus csd.local_cpus
+#define cs_load_avg csd.load_avg
+
+static int numa_num_cpu_sets = 0;
+static int numa_cpus_per_local_set = NR_CPUS;
+static int numa_cpu_set_distance = 1;
+static cpu_set_t cpu_sets[NR_CPU_SETS] __cacheline_aligned;
+
+#define cpu_to_set_id(c) ((c) / numa_cpus_per_local_set)
+#define next_cs_id(c) ((c) + 1 == numa_num_cpu_sets ? 0 : (c) + 1)
+#define cpu_to_set(c) (cpu_sets + ((c) / numa_cpus_per_local_set))
+/*
+ * End of NUMA definitions
+ */
+
#define BITMAP_SIZE ((((MAX_PRIO+7)/8)+sizeof(long)-1)/sizeof(long))

typedef struct runqueue runqueue_t;
@@ -41,6 +88,7 @@
*/
struct runqueue {
spinlock_t lock;
+ cpu_set_t *cpu_set;
unsigned long nr_running, nr_switches, expired_timestamp;
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
@@ -395,11 +443,13 @@
static void load_balance(runqueue_t *this_rq, int idle)
{
int imbalance, nr_running, load, max_load,
- idx, i, this_cpu = smp_processor_id();
+ idx, i, this_cpu = smp_processor_id(),
+ total_set_load = 0, cpus_in_set = 0;
task_t *next = this_rq->idle, *tmp;
runqueue_t *busiest, *rq_src;
prio_array_t *array;
list_t *head, *curr;
+ unsigned long target_cpu_set;

/*
* We search all runqueues to find the most busy one.
@@ -430,7 +480,11 @@

busiest = NULL;
max_load = 1;
+ target_cpu_set = this_rq->cpu_set->cs_local_cpus;
for (i = 0; i < smp_num_cpus; i++) {
+ /* Skip CPUs not in the target set */
+ if (!(target_cpu_set & (1UL << cpu_logical_map(i))))
+ continue;
rq_src = cpu_rq(cpu_logical_map(i));
if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
load = rq_src->nr_running;
@@ -438,6 +492,9 @@
load = this_rq->prev_nr_running[i];
this_rq->prev_nr_running[i] = rq_src->nr_running;

+ total_set_load += load;
+ cpus_in_set++;
+
if ((load > max_load) && (rq_src != this_rq)) {
busiest = rq_src;
max_load = load;
@@ -458,10 +515,20 @@
* Make sure nothing changed since we checked the
* runqueue length.
*/
- if (busiest->nr_running <= this_rq->nr_running + 1)
+ if (busiest->nr_running <= nr_running + 1)
goto out_unlock;

/*
+ * Update cpu set load average
+ */
+ total_set_load = total_set_load / cpus_in_set;
+ if (total_set_load != this_rq->cpu_set->cs_load_avg) {
+ spin_lock(&this_rq->cpu_set->cs_lock);
+ this_rq->cpu_set->cs_load_avg = total_set_load;
+ spin_unlock(&this_rq->cpu_set->cs_lock);
+ }
+
+ /*
* We first consider expired tasks. Those will likely not be
* executed in the near future, and they are most likely to
* be cache-cold, thus switching CPUs has the least effect
@@ -534,6 +601,49 @@
spin_unlock(&busiest->lock);
}

+#if defined(IDLE_LOOP_REBALANCE) || defined(GOING_IDLE_REBALANCE) || \
+ defined(BUSY_REBALANCE)
+/*
+ * Load balance CPU sets
+ */
+static void balance_cpu_sets(runqueue_t *this_rq, int idle)
+{
+ int i, this_load, max_load;
+ cpu_set_t *this_set, *target_set = NULL;
+
+ this_set = cpu_to_set(smp_processor_id());
+ this_load = this_set->cs_load_avg;
+
+ if (this_load > 1)
+ max_load = this_load;
+ else {
+ if (idle)
+ max_load = 0; /* Looking for anything! */
+ else
+ max_load = 1;
+ }
+
+ for(i = 0; i < numa_num_cpu_sets; i++) {
+ if (cpu_sets[i].cs_load_avg > max_load) {
+ max_load = cpu_sets[i].cs_load_avg;
+ target_set = &cpu_sets[i];
+ }
+ }
+
+ if (!target_set || (max_load <= this_load))
+ return;
+
+ /*
+ * We point current CPU at target cpu_set. This is safe
+ * because the caller ensures the current CPUs runqueue
+ * is locked.
+ */
+ this_rq()->cpu_set = target_set;
+ load_balance(this_rq, idle);
+ this_rq()->cpu_set = this_set;
+}
+#endif
+
/*
* One of the idle_cpu_tick() or the busy_cpu_tick() function will
* gets called every timer tick, on every CPU. Our balancing action
@@ -545,16 +655,104 @@
*/
#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#define CS_IDLE_REBALANCE_TICK (IDLE_REBALANCE_TICK * numa_cpu_set_distance)
+#define CS_BUSY_REBALANCE_TICK (BUSY_REBALANCE_TICK * numa_cpu_set_distance)

static inline void idle_tick(void)
{
- if (jiffies % IDLE_REBALANCE_TICK)
+ unsigned long now = jiffies;
+
+ if (now % IDLE_REBALANCE_TICK)
return;
spin_lock(&this_rq()->lock);
load_balance(this_rq(), 1);
+
+#ifdef IDLE_LOOP_REBALANCE
+ /*
+ * If there are no waiting tasks on the local cpu set, then
+ * at least take a look at other sets.
+ */
+ if (!this_rq()->nr_running && !(now % CS_IDLE_REBALANCE_TICK))
+ balance_cpu_sets(this_rq(), 1);
+#endif
+
spin_unlock(&this_rq()->lock);
}

+/*
+ * Code to be executed as part of the exec path on NUMA architectures.
+ * Since exec throws away the old process image, there is little
+ * advantage to keeping the task on the same cpu set. Therefore, we
+ * consider moving the task to the least laded cpu set.
+ */
+void sched_exec_migrate(void)
+{
+#ifdef EXEC_BALANCE
+ int this_cpu, this_cs_id, this_load_avg, min_load_avg,
+ i, target_cs_id, target_cpu;
+ unsigned long target_cs_mask,
+ cpus_allowed = current->cpus_allowed;
+ runqueue_t *rq;
+
+ /*
+ * Only consider migration if other tasks are waiting for
+ * this CPU, and the load average for this cpu set is
+ * greater than 1. Also, make sure the task is allowed
+ * to run on another cpu set.
+ */
+ this_cpu = smp_processor_id();
+ if (cpu_rq(this_cpu)->nr_running < 2)
+ return;
+
+ this_cs_id = cpu_to_set_id(this_cpu);
+ this_load_avg = cpu_sets[this_cs_id].cs_load_avg;
+ if (this_load_avg < 1)
+ return;
+
+ if (!(cpus_allowed & ~(cpu_sets[this_cs_id].cs_local_cpus)))
+ return;
+
+ /*
+ * Look for another cpu set with a lower load average.
+ */
+ min_load_avg = this_load_avg;
+ target_cs_id = this_cs_id;
+ for (i = next_cs_id(this_cs_id); i != this_cs_id; i = next_cs_id(i)) {
+ if (!(cpus_allowed & cpu_sets[i].cs_local_cpus))
+ continue;
+ if (cpu_sets[i].cs_load_avg < min_load_avg) {
+ min_load_avg = cpu_sets[i].cs_load_avg;
+ target_cs_id = i;
+ }
+ }
+ if (!(min_load_avg < this_load_avg))
+ return;
+
+ /*
+ * Find shortest runqueue on target cpu set.
+ */
+ min_load_avg = INT_MAX;
+ target_cs_mask = cpu_sets[target_cs_id].cs_local_cpus;
+ target_cpu = this_cpu;
+ for (i=0; i < smp_num_cpus; i++) {
+ if (!(cpus_allowed & target_cs_mask & (1UL << i)))
+ continue;
+ rq = cpu_rq(cpu_logical_map(i));
+ if (rq->nr_running < min_load_avg) {
+ min_load_avg = rq->nr_running;
+ target_cpu = cpu_logical_map(i);
+ }
+ }
+
+ /*
+ * Migrate task
+ */
+ current->state = TASK_UNINTERRUPTIBLE;
+ smp_migrate_task(target_cpu, current);
+ schedule();
+#endif
+}
+
#endif

/*
@@ -618,6 +816,11 @@
#if CONFIG_SMP
if (!(now % BUSY_REBALANCE_TICK))
load_balance(rq, 0);
+
+#ifdef BUSY_REBALANCE
+ if (!(now % CS_BUSY_REBALANCE_TICK))
+ balance_cpu_sets(rq, 0);
+#endif
#endif
spin_unlock(&rq->lock);
}
@@ -659,6 +862,10 @@
if (unlikely(!rq->nr_running)) {
#if CONFIG_SMP
load_balance(rq, 1);
+#ifdef GOING_IDLE_REBALANCE
+ if (!rq->nr_running)
+ balance_cpu_sets(rq, 1);
+#endif
if (rq->nr_running)
goto pick_next_task;
#endif
@@ -1332,10 +1539,20 @@
runqueue_t *rq;
int i, j, k;

+ /*
+ * Default is to have a single cpu set containing all CPUs
+ */
+ numa_num_cpu_sets = 0;
+ cpu_sets[0].cs_lock = SPIN_LOCK_UNLOCKED;
+ cpu_sets[0].cs_local_cpus = -1;
+ cpu_sets[0].cs_load_avg = 0;
+
for (i = 0; i < NR_CPUS; i++) {
runqueue_t *rq = cpu_rq(i);
prio_array_t *array;

+ rq->cpu_set = &cpu_sets[0];
+
rq->active = rq->arrays + 0;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
@@ -1372,3 +1589,42 @@
atomic_inc(&init_mm.mm_count);
enter_lazy_tlb(&init_mm, current, smp_processor_id());
}
+
+void numa_sched_init(void)
+{
+ int i, cpu_set_num;
+ unsigned long numa_local_cpu_set_mask;
+
+#define NUMA_CPUS_PER_LOCAL_SET 4
+#define NUMA_CPU_SET_DISTANCE 2
+
+ numa_cpus_per_local_set = NUMA_CPUS_PER_LOCAL_SET;
+ numa_cpu_set_distance = NUMA_CPU_SET_DISTANCE;
+
+ numa_local_cpu_set_mask = 0UL;
+ for (i=0; i<numa_cpus_per_local_set; i++) {
+ numa_local_cpu_set_mask = numa_local_cpu_set_mask << 1;
+ numa_local_cpu_set_mask |= 0x1UL;
+ }
+
+ /*
+ * Make sure all runqueues point to the correct cpu_set
+ */
+ cpu_set_num = -1;
+ numa_num_cpu_sets = 0;
+ for (i=0; i<smp_num_cpus; i++) {
+ if (cpu_to_set_id(i) != cpu_set_num) {
+ cpu_set_num = cpu_to_set_id(i);
+
+ cpu_sets[cpu_set_num].cs_lock = SPIN_LOCK_UNLOCKED;
+ cpu_sets[cpu_set_num].cs_local_cpus =
+ numa_local_cpu_set_mask <<
+ (cpu_set_num * numa_cpus_per_local_set);
+ cpu_sets[cpu_set_num].cs_load_avg = 0;
+
+ numa_num_cpu_sets++;
+ }
+
+ runqueues[i].cpu_set = &cpu_sets[cpu_set_num];
+ }
+}


2002-02-22 19:14:32

by Jesse Barnes

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

On Fri, Feb 22, 2002 at 10:56:06AM -0800, Mike Kravetz wrote:
> Below is preliminary patch to implement some form of NUMA scheduling
> on top of Ingo's K3 scheduler patch for 2.4.17. This is VERY early
> code and brings up some issues that need to be discussed/explored in
> more detail. This patch was created to form a basis for discussion,
> rather than as a solution. The patch was created for the i386 based
> NUMA system I have access to. It will not work on other architectures.
> However, the only architecture specific code is a call to initialize
> some of the NUMA specific scheduling data structures. Therefore, it
> should be trivial to port.

Ah, you beat me to it; I was working on code very similar to this when
I got your e-mail. I think this sort of thing will address the
problem we've been seeing on machines with more than 16 cpus or so
(IDLE_REBALANCE_TICK is too small, flooding CPUs with load_balance
requests), as well as making numa scheduling a little more efficient.
I'll see if I can make it work on our platform and let you know how it
goes.

Jesse

2002-02-22 19:30:23

by Peter Rival

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

Ditto. Maybe I'll even get a chance to check it out on an EV7. ;)

- Pete

Jesse Barnes wrote:

> On Fri, Feb 22, 2002 at 10:56:06AM -0800, Mike Kravetz wrote:
> > Below is preliminary patch to implement some form of NUMA scheduling
> > on top of Ingo's K3 scheduler patch for 2.4.17. This is VERY early
> > code and brings up some issues that need to be discussed/explored in
> > more detail. This patch was created to form a basis for discussion,
> > rather than as a solution. The patch was created for the i386 based
> > NUMA system I have access to. It will not work on other architectures.
> > However, the only architecture specific code is a call to initialize
> > some of the NUMA specific scheduling data structures. Therefore, it
> > should be trivial to port.
>
> Ah, you beat me to it; I was working on code very similar to this when
> I got your e-mail. I think this sort of thing will address the
> problem we've been seeing on machines with more than 16 cpus or so
> (IDLE_REBALANCE_TICK is too small, flooding CPUs with load_balance
> requests), as well as making numa scheduling a little more efficient.
> I'll see if I can make it work on our platform and let you know how it
> goes.
>
> Jesse
>
> _______________________________________________
> Lse-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lse-tech

2002-02-23 00:00:41

by Mike Kravetz

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

On Fri, Feb 22, 2002 at 10:56:06AM -0800, Mike Kravetz wrote:
> Below is preliminary patch to implement some form of NUMA scheduling
> on top of Ingo's K3 scheduler patch for 2.4.17.

My apologies, the previously included patch was created on top of
of Ingo's J9 scheduler patch. The patch below is built on K3.

--
Mike

diff -Naur linux-2.4.17-K3/arch/i386/kernel/smpboot.c linux-2.4.17-K3-ns/arch/i386/kernel/smpboot.c
--- linux-2.4.17-K3/arch/i386/kernel/smpboot.c Thu Feb 14 18:42:53 2002
+++ linux-2.4.17-K3-ns/arch/i386/kernel/smpboot.c Thu Feb 14 17:38:51 2002
@@ -1198,6 +1198,11 @@
}
}
}
+
+ /*
+ * Hack to get cpu sets initialized on NUMA architectures
+ */
+ numa_sched_init();

#ifndef CONFIG_VISWS
/*
diff -Naur linux-2.4.17-K3/fs/exec.c linux-2.4.17-K3-ns/fs/exec.c
--- linux-2.4.17-K3/fs/exec.c Fri Dec 21 17:41:55 2001
+++ linux-2.4.17-K3-ns/fs/exec.c Thu Feb 14 17:38:51 2002
@@ -860,6 +860,10 @@
int retval;
int i;

+#if CONFIG_SMP
+ sched_exec_migrate();
+#endif
+
file = open_exec(filename);

retval = PTR_ERR(file);
diff -Naur linux-2.4.17-K3/include/linux/sched.h linux-2.4.17-K3-ns/include/linux/sched.h
--- linux-2.4.17-K3/include/linux/sched.h Thu Feb 14 18:46:54 2002
+++ linux-2.4.17-K3-ns/include/linux/sched.h Thu Feb 14 17:52:18 2002
@@ -152,6 +152,7 @@
extern void sched_task_migrated(task_t *p);
extern void smp_migrate_task(int cpu, task_t *task);
extern unsigned long cache_decay_ticks;
+extern void sched_exec_migrate(void);

#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
diff -Naur linux-2.4.17-K3/kernel/sched.c linux-2.4.17-K3-ns/kernel/sched.c
--- linux-2.4.17-K3/kernel/sched.c Thu Feb 14 18:42:52 2002
+++ linux-2.4.17-K3-ns/kernel/sched.c Thu Feb 14 17:40:47 2002
@@ -22,6 +22,53 @@
#include <linux/kernel_stat.h>

/*
+ * Definitions that depend on/should be part of NUMA topology discovery
+ */
+/*
+ * Periodically poll for new (off cpu set) work as part of the idle loop
+#define IDLE_LOOP_REBALANCE
+ */
+/*
+ * Look for off cpu set work before going idle
+ */
+#define GOING_IDLE_REBALANCE
+/*
+ * Attempt to balance load between cpu sets (even when busy)
+#define BUSY_REBALANCE
+ */
+/*
+ * Send tasks to least loaded CPU at exec time
+ */
+#define EXEC_BALANCE
+
+#define NR_CPU_SETS NR_CPUS
+
+union cpu_set {
+ struct cpu_set_data {
+ spinlock_t lock;
+ unsigned long local_cpus;
+ unsigned long load_avg;
+ } csd;
+ char __pad [SMP_CACHE_BYTES];
+};
+typedef union cpu_set cpu_set_t;
+#define cs_lock csd.lock
+#define cs_local_cpus csd.local_cpus
+#define cs_load_avg csd.load_avg
+
+static int numa_num_cpu_sets = 0;
+static int numa_cpus_per_local_set = NR_CPUS;
+static int numa_cpu_set_distance = 1;
+static cpu_set_t cpu_sets[NR_CPU_SETS] __cacheline_aligned;
+
+#define cpu_to_set_id(c) ((c) / numa_cpus_per_local_set)
+#define next_cs_id(c) ((c) + 1 == numa_num_cpu_sets ? 0 : (c) + 1)
+#define cpu_to_set(c) (cpu_sets + ((c) / numa_cpus_per_local_set))
+/*
+ * End of NUMA definitions
+ */
+
+/*
* Priority of a process goes from 0 to 139. The 0-99
* priority range is allocated to RT tasks, the 100-139
* range is for SCHED_OTHER tasks. Priority values are
@@ -140,6 +187,7 @@
*/
struct runqueue {
spinlock_t lock;
+ cpu_set_t *cpu_set;
unsigned long nr_running, nr_switches, expired_timestamp;
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
@@ -484,11 +532,13 @@
static void load_balance(runqueue_t *this_rq, int idle)
{
int imbalance, nr_running, load, max_load,
- idx, i, this_cpu = smp_processor_id();
+ idx, i, this_cpu = smp_processor_id(),
+ total_set_load = 0, cpus_in_set = 0;
task_t *next = this_rq->idle, *tmp;
runqueue_t *busiest, *rq_src;
prio_array_t *array;
list_t *head, *curr;
+ unsigned long target_cpu_set;

/*
* We search all runqueues to find the most busy one.
@@ -519,7 +569,11 @@

busiest = NULL;
max_load = 1;
+ target_cpu_set = this_rq->cpu_set->cs_local_cpus;
for (i = 0; i < smp_num_cpus; i++) {
+ /* Skip CPUs not in the target set */
+ if (!(target_cpu_set & (1UL << cpu_logical_map(i))))
+ continue;
rq_src = cpu_rq(cpu_logical_map(i));
if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
load = rq_src->nr_running;
@@ -527,6 +581,9 @@
load = this_rq->prev_nr_running[i];
this_rq->prev_nr_running[i] = rq_src->nr_running;

+ total_set_load += load;
+ cpus_in_set++;
+
if ((load > max_load) && (rq_src != this_rq)) {
busiest = rq_src;
max_load = load;
@@ -547,10 +604,20 @@
* Make sure nothing changed since we checked the
* runqueue length.
*/
- if (busiest->nr_running <= this_rq->nr_running + 1)
+ if (busiest->nr_running <= nr_running + 1)
goto out_unlock;

/*
+ * Update cpu set load average
+ */
+ total_set_load = total_set_load / cpus_in_set;
+ if (total_set_load != this_rq->cpu_set->cs_load_avg) {
+ spin_lock(&this_rq->cpu_set->cs_lock);
+ this_rq->cpu_set->cs_load_avg = total_set_load;
+ spin_unlock(&this_rq->cpu_set->cs_lock);
+ }
+
+ /*
* We first consider expired tasks. Those will likely not be
* executed in the near future, and they are most likely to
* be cache-cold, thus switching CPUs has the least effect
@@ -623,6 +690,49 @@
spin_unlock(&busiest->lock);
}

+#if defined(IDLE_LOOP_REBALANCE) || defined(GOING_IDLE_REBALANCE) || \
+ defined(BUSY_REBALANCE)
+/*
+ * Load balance CPU sets
+ */
+static void balance_cpu_sets(runqueue_t *this_rq, int idle)
+{
+ int i, this_load, max_load;
+ cpu_set_t *this_set, *target_set = NULL;
+
+ this_set = cpu_to_set(smp_processor_id());
+ this_load = this_set->cs_load_avg;
+
+ if (this_load > 1)
+ max_load = this_load;
+ else {
+ if (idle)
+ max_load = 0; /* Looking for anything! */
+ else
+ max_load = 1;
+ }
+
+ for(i = 0; i < numa_num_cpu_sets; i++) {
+ if (cpu_sets[i].cs_load_avg > max_load) {
+ max_load = cpu_sets[i].cs_load_avg;
+ target_set = &cpu_sets[i];
+ }
+ }
+
+ if (!target_set || (max_load <= this_load))
+ return;
+
+ /*
+ * We point current CPU at target cpu_set. This is safe
+ * because the caller ensures the current CPUs runqueue
+ * is locked.
+ */
+ this_rq()->cpu_set = target_set;
+ load_balance(this_rq, idle);
+ this_rq()->cpu_set = this_set;
+}
+#endif
+
/*
* One of the idle_cpu_tick() or the busy_cpu_tick() function will
* gets called every timer tick, on every CPU. Our balancing action
@@ -634,16 +744,104 @@
*/
#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#define CS_IDLE_REBALANCE_TICK (IDLE_REBALANCE_TICK * numa_cpu_set_distance)
+#define CS_BUSY_REBALANCE_TICK (BUSY_REBALANCE_TICK * numa_cpu_set_distance)

static inline void idle_tick(void)
{
- if (jiffies % IDLE_REBALANCE_TICK)
+ unsigned long now = jiffies;
+
+ if (now % IDLE_REBALANCE_TICK)
return;
spin_lock(&this_rq()->lock);
load_balance(this_rq(), 1);
+
+#ifdef IDLE_LOOP_REBALANCE
+ /*
+ * If there are no waiting tasks on the local cpu set, then
+ * at least take a look at other sets.
+ */
+ if (!this_rq()->nr_running && !(now % CS_IDLE_REBALANCE_TICK))
+ balance_cpu_sets(this_rq(), 1);
+#endif
+
spin_unlock(&this_rq()->lock);
}

+/*
+ * Code to be executed as part of the exec path on NUMA architectures.
+ * Since exec throws away the old process image, there is little
+ * advantage to keeping the task on the same cpu set. Therefore, we
+ * consider moving the task to the least laded cpu set.
+ */
+void sched_exec_migrate(void)
+{
+#ifdef EXEC_BALANCE
+ int this_cpu, this_cs_id, this_load_avg, min_load_avg,
+ i, target_cs_id, target_cpu;
+ unsigned long target_cs_mask,
+ cpus_allowed = current->cpus_allowed;
+ runqueue_t *rq;
+
+ /*
+ * Only consider migration if other tasks are waiting for
+ * this CPU, and the load average for this cpu set is
+ * greater than 1. Also, make sure the task is allowed
+ * to run on another cpu set.
+ */
+ this_cpu = smp_processor_id();
+ if (cpu_rq(this_cpu)->nr_running < 2)
+ return;
+
+ this_cs_id = cpu_to_set_id(this_cpu);
+ this_load_avg = cpu_sets[this_cs_id].cs_load_avg;
+ if (this_load_avg < 1)
+ return;
+
+ if (!(cpus_allowed & ~(cpu_sets[this_cs_id].cs_local_cpus)))
+ return;
+
+ /*
+ * Look for another cpu set with a lower load average.
+ */
+ min_load_avg = this_load_avg;
+ target_cs_id = this_cs_id;
+ for (i = next_cs_id(this_cs_id); i != this_cs_id; i = next_cs_id(i)) {
+ if (!(cpus_allowed & cpu_sets[i].cs_local_cpus))
+ continue;
+ if (cpu_sets[i].cs_load_avg < min_load_avg) {
+ min_load_avg = cpu_sets[i].cs_load_avg;
+ target_cs_id = i;
+ }
+ }
+ if (!(min_load_avg < this_load_avg))
+ return;
+
+ /*
+ * Find shortest runqueue on target cpu set.
+ */
+ min_load_avg = INT_MAX;
+ target_cs_mask = cpu_sets[target_cs_id].cs_local_cpus;
+ target_cpu = this_cpu;
+ for (i=0; i < smp_num_cpus; i++) {
+ if (!(cpus_allowed & target_cs_mask & (1UL << i)))
+ continue;
+ rq = cpu_rq(cpu_logical_map(i));
+ if (rq->nr_running < min_load_avg) {
+ min_load_avg = rq->nr_running;
+ target_cpu = cpu_logical_map(i);
+ }
+ }
+
+ /*
+ * Migrate task
+ */
+ current->state = TASK_UNINTERRUPTIBLE;
+ smp_migrate_task(target_cpu, current);
+ schedule();
+#endif
+}
+
#endif

/*
@@ -732,6 +930,11 @@
#if CONFIG_SMP
if (!(jiffies % BUSY_REBALANCE_TICK))
load_balance(rq, 0);
+
+#ifdef BUSY_REBALANCE
+ if (!(now % CS_BUSY_REBALANCE_TICK))
+ balance_cpu_sets(rq, 0);
+#endif
#endif
spin_unlock(&rq->lock);
}
@@ -772,6 +975,10 @@
if (unlikely(!rq->nr_running)) {
#if CONFIG_SMP
load_balance(rq, 1);
+#ifdef GOING_IDLE_REBALANCE
+ if (!rq->nr_running)
+ balance_cpu_sets(rq, 1);
+#endif
if (rq->nr_running)
goto pick_next_task;
#endif
@@ -1488,10 +1695,20 @@
runqueue_t *rq;
int i, j, k;

+ /*
+ * Default is to have a single cpu set containing all CPUs
+ */
+ numa_num_cpu_sets = 0;
+ cpu_sets[0].cs_lock = SPIN_LOCK_UNLOCKED;
+ cpu_sets[0].cs_local_cpus = -1;
+ cpu_sets[0].cs_load_avg = 0;
+
for (i = 0; i < NR_CPUS; i++) {
runqueue_t *rq = cpu_rq(i);
prio_array_t *array;

+ rq->cpu_set = &cpu_sets[0];
+
rq->active = rq->arrays + 0;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
@@ -1528,3 +1745,42 @@
atomic_inc(&init_mm.mm_count);
enter_lazy_tlb(&init_mm, current, smp_processor_id());
}
+
+void numa_sched_init(void)
+{
+ int i, cpu_set_num;
+ unsigned long numa_local_cpu_set_mask;
+
+#define NUMA_CPUS_PER_LOCAL_SET 4
+#define NUMA_CPU_SET_DISTANCE 2
+
+ numa_cpus_per_local_set = NUMA_CPUS_PER_LOCAL_SET;
+ numa_cpu_set_distance = NUMA_CPU_SET_DISTANCE;
+
+ numa_local_cpu_set_mask = 0UL;
+ for (i=0; i<numa_cpus_per_local_set; i++) {
+ numa_local_cpu_set_mask = numa_local_cpu_set_mask << 1;
+ numa_local_cpu_set_mask |= 0x1UL;
+ }
+
+ /*
+ * Make sure all runqueues point to the correct cpu_set
+ */
+ cpu_set_num = -1;
+ numa_num_cpu_sets = 0;
+ for (i=0; i<smp_num_cpus; i++) {
+ if (cpu_to_set_id(i) != cpu_set_num) {
+ cpu_set_num = cpu_to_set_id(i);
+
+ cpu_sets[cpu_set_num].cs_lock = SPIN_LOCK_UNLOCKED;
+ cpu_sets[cpu_set_num].cs_local_cpus =
+ numa_local_cpu_set_mask <<
+ (cpu_set_num * numa_cpus_per_local_set);
+ cpu_sets[cpu_set_num].cs_load_avg = 0;
+
+ numa_num_cpu_sets++;
+ }
+
+ runqueues[i].cpu_set = &cpu_sets[cpu_set_num];
+ }
+}

2002-02-25 18:32:31

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

Hello Mike,

On Fri, 22 Feb 2002, Mike Kravetz wrote:

> Below is preliminary patch to implement some form of NUMA scheduling
> on top of Ingo's K3 scheduler patch for 2.4.17. This is VERY early
> code and brings up some issues that need to be discussed/explored in
> more detail. This patch was created to form a basis for discussion,
> rather than as a solution. The patch was created for the i386 based
> NUMA system I have access to. It will not work on other architectures.
> However, the only architecture specific code is a call to initialize
> some of the NUMA specific scheduling data structures. Therefore, it
> should be trivial to port.

I am yet another one working on the NUMA extension of Ingo's
O(1) scheduler, having IA64 in mind, like Jesse, I guess. It's good to see
that so many people take care of this issue, lets me hope that we'll get a
good solution at the end.

My approach is in some details different from yours therefore I'll append
it to this email. It is for IA64 (DIG64 architectures), works fine on
NEC's AzusA Itanium server and is also meant as a discussion basis, not as
a finished solution/project.

Some basic but minor differences to your patch are:
- The 'cpu sets' are called cpu pools. Arrays for looping over the CPUs
within one pool are provided.
- It is built on top of Ingo's solution for set_cpus_allowed(), let's call
it K3+ (for 2.4.17 kernels).

The more important differences to your patch:
- It adresses the notion of "home node" of a task, this should be the node
from which a task should get its memory and on which it should preferably
run. I'm using this feature together with a memory affinity patch not
included here.
- Each task gets an additional "node" field in its task_struct,
- each runqueue tracks the number of tasks from different nodes,
- tasks coming from a remote node are migrated earlier.
- The load_balancing() concept is different:
- there are no special time intervals for balancing across pool
boundaries, the need for this can occur very quickly and I
have the feeling that 2*250ms is a long time for keeping the
nodes unbalanced. This means: each time load_balance() is called
it _can_ balance across pool boundaries (but doesn't have to).
- When load_balance() is called it first tries to balance the
local cpu pool. If this one is well balanced, it will try to
find the most loaded pool.
- Each runqueue has its own statistics arrays, thus gathering
the balancing information happens lockless. The information
is as "fresh" as possible. Again I think that with 2*250ms the
info in your cpu sets could be too old for certain cases.
- Initial balancing is done in do_fork() instead of do_execve(). Again: I
collect the current balance information because I have the feeling this
should be as "new" as possible. Using do_fork() means that we don't have
to call smp_migrate_task() (which went away in the latest version from
Ingo, but certainly we'd find some replacement), such that the task has
chances to live long on its initial CPU.
- When balancing initially I tried to take into account the case of cpu
pools of different sizes as well as cpus_allowed masks covering only parts
of a cpu pool.


I didn't address things like topology discovery, either. The topology is
hidden behind a macro called CPU_TO_NODE() and is currently considered to
have two levels (cpus and nodes). It is easy to extend this concept to a
multi-level scheduler (e.g. cpus -> multi-core packages -> nodes ->
supernodes) but I didn't have to (yet?). In such a scheduler the tasks
wouldn't get too far from their home node because we would first try to
balance locally and over short distances.

Also the load balancing decisions are only based on the CPU loads, sure
there are some more criteria we might want to use (memory consumption...).

Coming back to the discussion, which is partly embedded in the
description of the patch:

I like the way how your cpu sets are defined, this is much more elegant
than my solution. But your only way to address a set is by its cpu mask,
which means that when you want to address a set you have to loop over all
smp_num_cpus. With large numbers of cpus this could become a
problem. Anyhow, this is minor and can be changed easilly. As well as the
problem of having cpu sets of different sizes.

Of more concern is the reuse of old balancing data. For busy CPUs I think
the interval of 0.5s is far too long. Maybe this would make more sense if
the data would be some running average, otherwise I would expect mistakes
in the balancing decisions. Especially for the initial balancing, where I
would like to decide on which node the memory of the task should be. But I
still believe that this long term decision should be taken with the best
data available! Just think about quickly forking several tasks: they
might all go to the same node and lead to a terrible imbalance. Not that
bad if you take them to another node after a short while, but quite bad
if all the memory goes to their original node and most of the tasks have
to access it with poor latency.

Would be interesting to hear oppinions on initial balancing. What are the
pros and cons of balancing at do_fork() or do_execve()? And it would be
interesting to learn about other approaches, too...

Best regards,

Erich

diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/arch/ia64/kernel/smpboot.c 2.4.17-ia64-kdbv2.1-K3ia64-nod3/arch/ia64/kernel/smpboot.c
--- 2.4.17-ia64-kdbv2.1-K3ia64/arch/ia64/kernel/smpboot.c Mon Feb 18 12:26:44 2002
+++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/arch/ia64/kernel/smpboot.c Fri Feb 22 14:11:45 2002
@@ -83,6 +83,8 @@
/* which logical CPU number maps to which CPU (physical APIC ID) */
volatile int ia64_cpu_to_sapicid[NR_CPUS];

+char node_number[NR_CPUS] __cacheline_aligned;
+
static volatile unsigned long cpu_callin_map;

struct smp_boot_data smp_boot_data __initdata;
@@ -134,6 +136,23 @@

__setup("nointroute", nointroute);

+/* simple sort routine for sorting the hardware CPU IDs */
+void __init
+bub_sort(int n, int *a)
+{
+ int end, j, t;
+
+ for (end = n-1; end >= 0; end--) {
+ for (j = 0; j < end; j++) {
+ if (a[j] > a[j+1]) {
+ t = a[j+1];
+ a[j+1] = a[j];
+ a[j] = t;
+ }
+ }
+ }
+}
+
void
sync_master (void *arg)
{
@@ -496,6 +515,13 @@
if (max_cpus != -1)
printk (KERN_INFO "Limiting CPUs to %d\n", max_cpus);

+#ifdef CONFIG_IA64_DIG
+ /*
+ * To be on the safe side: sort SAPIC IDs of CPUs
+ */
+ bub_sort(smp_boot_data.cpu_count, &smp_boot_data.cpu_phys_id[0]);
+#endif
+
if (smp_boot_data.cpu_count > 1) {

printk(KERN_INFO "SMP: starting up secondaries.\n");
@@ -541,7 +567,24 @@
}
smp_done:
;
+#ifdef CONFIG_IA64_DIG
+ bld_node_number();
+ bld_pools();
+#endif
+}
+#ifdef CONFIG_IA64_DIG
+/* build translation table for CPU_TO_NODE macro */
+void __init
+bld_node_number(void)
+{
+ int cpu;
+
+ for (cpu = 0; cpu < NR_CPUS; cpu++)
+ if (cpu_online_map & (1<<cpu))
+ node_number[cpu] = SAPICID_TO_NODE(cpu_physical_id(cpu));
}
+#endif
+

/*
* Assume that CPU's have been discovered by some platform-dependant interface. For
@@ -575,3 +618,4 @@
* So let's try 10 ticks.
*/
unsigned long cache_decay_ticks=10;
+
diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/include/asm-ia64/smp.h 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/asm-ia64/smp.h
--- 2.4.17-ia64-kdbv2.1-K3ia64/include/asm-ia64/smp.h Thu Feb 21 19:33:12 2002
+++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/asm-ia64/smp.h Mon Feb 25 18:02:45 2002
@@ -13,6 +13,7 @@

#ifdef CONFIG_SMP

+#include <linux/cache.h>
#include <linux/init.h>
#include <linux/threads.h>
#include <linux/kernel.h>
@@ -113,6 +114,23 @@

#define NO_PROC_ID 0xffffffff /* no processor magic marker */

+extern char node_number[NR_CPUS] __cacheline_aligned;
+#ifdef CONFIG_IA64_DIG
+/* sooner or later this should be a configurable parameter [EF] */
+#define NR_NODES 4
+#define CPU_TO_NODE(cpu) node_number[cpu]
+/*
+ * This is the node ID on the NEC AzusA,
+ * on LION and BigSur it correctly initializes to node 0
+ */
+#define SAPICID_TO_NODE(hwid) ((hwid >> 12) & 0xff)
+#else
+/* need to be set for the specific platform! */
+#define NR_NODES 1
+#define CPU_TO_NODE(cpu) 0
+#define SAPICID_TO_NODE(hwid) 0
+#endif
+
extern void __init init_smp_config (void);
extern void smp_do_timer (struct pt_regs *regs);

diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/include/linux/prctl.h 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/linux/prctl.h
--- 2.4.17-ia64-kdbv2.1-K3ia64/include/linux/prctl.h Mon Feb 4 12:41:39 2002
+++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/linux/prctl.h Mon Feb 25 17:43:54 2002
@@ -26,4 +26,8 @@
# define PR_FPEMU_NOPRINT 1 /* silently emulate fp operations accesses */
# define PR_FPEMU_SIGFPE 2 /* don't emulate fp operations, send SIGFPE instead */

+/* Get/set node for node-affine scheduling */
+#define PR_GET_NODE 16
+#define PR_SET_NODE 17
+
#endif /* _LINUX_PRCTL_H */
diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/include/linux/sched.h 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/linux/sched.h
--- 2.4.17-ia64-kdbv2.1-K3ia64/include/linux/sched.h Thu Feb 21 19:33:12 2002
+++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/include/linux/sched.h Mon Feb 25 18:02:45 2002
@@ -314,6 +314,9 @@
unsigned long cpus_allowed;
unsigned int time_slice;

+ int node;
+ list_t node_list;
+
task_t *next_task, *prev_task;

struct mm_struct *mm, *active_mm;
diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/kernel/fork.c 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/fork.c
--- 2.4.17-ia64-kdbv2.1-K3ia64/kernel/fork.c Tue Feb 19 15:09:35 2002
+++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/fork.c Mon Feb 25 18:26:18 2002
@@ -639,11 +639,15 @@
#ifdef CONFIG_SMP
{
int i;
+ void sched_best_cpu(struct task_struct *p);
+#if NR_NODES > 1
+ void sched_best_node(struct task_struct *p);
+
+ if (!(clone_flags & CLONE_VM))
+ sched_best_node(p);
+#endif
+ sched_best_cpu(p);

- if (likely(p->cpus_allowed & (1UL<<smp_processor_id())))
- p->cpu = smp_processor_id();
- else
- p->cpu = __ffs(p->cpus_allowed);
/* ?? should we just memset this ?? */
for(i = 0; i < smp_num_cpus; i++)
p->per_cpu_utime[cpu_logical_map(i)] =
diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/kernel/sched.c 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/sched.c
--- 2.4.17-ia64-kdbv2.1-K3ia64/kernel/sched.c Thu Feb 21 19:40:11 2002
+++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/sched.c Mon Feb 25 18:12:29 2002
@@ -144,6 +144,9 @@
int prev_nr_running[NR_CPUS];
task_t *migration_thread;
list_t migration_queue;
+ int nr_homenode[NR_NODES];
+ int cpus_load[NR_CPUS];
+ int pools_load[NR_NODES];
} ____cacheline_aligned;

static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
@@ -163,6 +166,27 @@
return (p == task_rq(p)->curr);
}

+/*
+ * Variables for describing and accessing processor pools. Using a
+ * compressed row format like notation.
+ *
+ * numpools: number of CPU pools (nodes),
+ * pool_cpus[]: CPUs in pools sorted by their pool ID,
+ * pool_ptr[node]: index of first element in pool_cpus[] belonging to node.
+ * pool_mask[]: cpu mask of a pool.
+ *
+ * Example: loop over all CPUs in a pool p:
+ * for (i = pool_ptr[p]; i < pool_ptr[p+1]; i++) {
+ * cpu = pool_cpus[i];
+ * ...
+ * }
+ * <[email protected]>
+ */
+static int numpools;
+static int pool_ptr[NR_NODES+1];
+static int pool_cpus[NR_CPUS];
+static unsigned long pool_mask[NR_NODES];
+
static inline runqueue_t *task_rq_lock(task_t *p, unsigned long *flags)
{
struct runqueue *rq;
@@ -247,10 +271,12 @@
}
enqueue_task(p, array);
rq->nr_running++;
+ rq->nr_homenode[p->node]++;
}

static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
+ rq->nr_homenode[p->node]--;
rq->nr_running--;
dequeue_task(p, p->array);
p->array = NULL;
@@ -341,7 +367,7 @@

void wake_up_forked_process(task_t * p)
{
- runqueue_t *rq = this_rq();
+ runqueue_t *rq = task_rq(p);

spin_lock_irq(&rq->lock);
p->state = TASK_RUNNING;
@@ -356,7 +382,7 @@
p->prio = effective_prio(p);
}
INIT_LIST_HEAD(&p->migration_list);
- p->cpu = smp_processor_id();
+ //p->cpu = smp_processor_id();
activate_task(p, rq);
spin_unlock_irq(&rq->lock);
init_MUTEX(&p->migration_sem);
@@ -471,6 +497,110 @@
}

/*
+ * Find a runqueue from which to steal a task. We try to do this as locally as
+ * possible because we don't want to let tasks get far from their home node.
+ * This is done in two steps:
+ * 1. First try to find a runqueue within the own CPU pool (AKA node) with
+ * imbalance larger than 25% (relative to the current runqueue).
+ * 2. If the local node is well balanced, locate the most loaded node and its
+ * most loaded CPU. Remote runqueues running tasks having their homenode on the
+ * current node are preferred (those tasks count twice in the load calculation).
+ *
+ * This concept can be extended easilly to more than two levels (multi-level
+ * scheduler?), e.g.: CPU -> multi-core package -> node -> supernode...
+ * <[email protected]>
+ */
+static runqueue_t *scan_pools(runqueue_t *this_rq, int idle, int *curr_running,
+ int *imbalance)
+{
+ runqueue_t *busiest, *rq_src;
+ int i, ii, load, max_load, pool, max_pool_load, max_pool_idx,
+ best_cpu, nr_running = *curr_running,
+ this_cpu = smp_processor_id(),
+ this_pool = CPU_TO_NODE(this_cpu);
+
+ busiest = NULL;
+ max_load = 1;
+
+ this_rq->pools_load[this_pool] = 0;
+ for (ii = pool_ptr[this_pool]; ii < pool_ptr[this_pool+1]; ii++) {
+ i = pool_cpus[ii];
+ rq_src = cpu_rq(cpu_logical_map(i));
+ if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
+ load = rq_src->nr_running;
+ else
+ load = this_rq->prev_nr_running[i];
+ this_rq->prev_nr_running[i] = rq_src->nr_running;
+ this_rq->pools_load[this_pool] += load;
+
+ if ((load > max_load) && (rq_src != this_rq)) {
+ busiest = rq_src;
+ max_load = load;
+ }
+ }
+
+ if (likely(!busiest))
+ goto scan_all;
+
+ *imbalance = (max_load - nr_running)/2;
+
+ /* It needs an at least ~25% imbalance to trigger balancing. */
+ if (idle || (*imbalance >= (max_load + 3)/4))
+ goto out;
+ else
+ busiest = NULL;
+
+ scan_all:
+ max_pool_load = this_rq->pools_load[this_pool];
+ max_pool_idx = this_pool;
+ for (pool = 0; pool < numpools; pool++) {
+ if (pool == this_pool) continue; // current pool already done
+ this_rq->pools_load[pool] = 0;
+ for (ii = pool_ptr[pool]; ii < pool_ptr[pool+1]; ii++) {
+ i = pool_cpus[ii];
+ rq_src = cpu_rq(cpu_logical_map(i));
+ if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
+ load = rq_src->nr_running;
+ else
+ load = this_rq->prev_nr_running[i];
+ this_rq->prev_nr_running[i] = rq_src->nr_running;
+ /* prefer RQs which have tasks from this node running */
+ load += rq_src->nr_homenode[this_pool];
+ this_rq->cpus_load[i] = load;
+ this_rq->pools_load[pool] += load;
+ }
+ if (this_rq->pools_load[pool] > max_pool_load) {
+ max_pool_load = this_rq->pools_load[pool];
+ max_pool_idx = pool;
+ }
+ }
+ if (likely(max_pool_idx == this_pool))
+ goto out;
+
+ *imbalance = (max_pool_load - this_rq->pools_load[this_pool])/2;
+ /* It needs an at least ~25% imbalance to trigger balancing. */
+ if (!idle && (*imbalance < (max_pool_load + 3)/4))
+ goto out;
+
+ /* find most loaded CPU within pool from which we'll try to steal a task */
+ best_cpu = pool_cpus[pool_ptr[max_pool_idx]];
+ max_load = this_rq->cpus_load[best_cpu];
+ for (ii = pool_ptr[max_pool_idx]+1; ii < pool_ptr[max_pool_idx+1]; ii++) {
+ i = pool_cpus[ii];
+ if (this_rq->cpus_load[i] > max_load) {
+ max_load = this_rq->cpus_load[i];
+ best_cpu = i;
+ }
+ }
+ *imbalance = (max_load - nr_running)/2;
+ /* It needs an at least ~25% imbalance to trigger balancing. */
+ if (idle || (*imbalance >= (max_load + 3)/4))
+ busiest = cpu_rq(cpu_logical_map(best_cpu));
+ out:
+ return busiest;
+}
+
+/*
* Current runqueue is empty, or rebalance tick: if there is an
* inbalance (current runqueue is too short) then pull from
* busiest runqueue(s).
@@ -480,12 +610,12 @@
*/
static void load_balance(runqueue_t *this_rq, int idle)
{
- int imbalance, nr_running, load, max_load,
- idx, i, this_cpu = smp_processor_id();
+ int imbalance, nr_running, idx, this_cpu = smp_processor_id();
task_t *next = this_rq->idle, *tmp;
- runqueue_t *busiest, *rq_src;
+ runqueue_t *busiest;
prio_array_t *array;
list_t *head, *curr;
+ int this_pool=CPU_TO_NODE(this_cpu), take_own;

/*
* We search all runqueues to find the most busy one.
@@ -514,29 +644,9 @@
else
nr_running = this_rq->prev_nr_running[this_cpu];

- busiest = NULL;
- max_load = 1;
- for (i = 0; i < smp_num_cpus; i++) {
- rq_src = cpu_rq(cpu_logical_map(i));
- if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
- load = rq_src->nr_running;
- else
- load = this_rq->prev_nr_running[i];
- this_rq->prev_nr_running[i] = rq_src->nr_running;
-
- if ((load > max_load) && (rq_src != this_rq)) {
- busiest = rq_src;
- max_load = load;
- }
- }

- if (likely(!busiest))
- return;
-
- imbalance = (max_load - nr_running) / 2;
-
- /* It needs an at least ~25% imbalance to trigger balancing. */
- if (!idle && (imbalance < (max_load + 3)/4))
+ busiest = scan_pools(this_rq, idle, &nr_running, &imbalance);
+ if (!busiest)
return;

nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
@@ -548,6 +658,14 @@
goto out_unlock;

/*
+ * Try to steal tasks coming from this_pool, if any
+ */
+ if (busiest->nr_homenode[this_pool])
+ take_own = 1;
+ else
+ take_own = 0;
+
+ /*
* We first consider expired tasks. Those will likely not be
* executed in the near future, and they are most likely to
* be cache-cold, thus switching CPUs has the least effect
@@ -589,7 +707,8 @@
#define CAN_MIGRATE_TASK(p,rq,this_cpu) \
((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \
((p) != (rq)->curr) && \
- (tmp->cpus_allowed & (1 << (this_cpu))))
+ (tmp->cpus_allowed & (1 << (this_cpu))) && \
+ ((take_own && (tmp->node == this_pool)) || !take_own))

if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
curr = curr->next;
@@ -605,9 +724,11 @@
*/
dequeue_task(next, array);
busiest->nr_running--;
+ busiest->nr_homenode[next->node]--;
next->cpu = this_cpu;
this_rq->nr_running++;
enqueue_task(next, this_rq->active);
+ this_rq->nr_homenode[next->node]++;
if (next->prio < current->prio)
current->need_resched = 1;
if (!idle && --imbalance) {
@@ -1436,31 +1557,111 @@
spin_unlock(&rq2->lock);
}

-int sched_move_task(task_t *p, int src_cpu, int tgt_cpu)
+/* used as counter for round-robin node-scheduling */
+static atomic_t sched_node=ATOMIC_INIT(0);
+
+/*
+ * Find the least loaded CPU on the current node of the task.
+ */
+void sched_best_cpu(struct task_struct *p)
{
- int res = 0;
- unsigned long flags;
- runqueue_t *src = cpu_rq(src_cpu);
- runqueue_t *tgt = cpu_rq(tgt_cpu);
+ int nn, best_node = p->node, best_cpu, cpu, load, min_load;

- local_irq_save(flags);
- double_rq_lock(src, tgt);
- if (task_rq(p) != src || p == src->curr)
- goto out;
- if (p->cpu != tgt_cpu) {
- p->cpu = tgt_cpu;
- if (p->array) {
- deactivate_task(p, src);
- activate_task(p, tgt);
+ best_cpu = p->cpu;
+ min_load = task_rq(p)->nr_running;
+ for (nn = pool_ptr[best_node]; nn < pool_ptr[best_node+1]; nn++) {
+ cpu = cpu_logical_map(pool_cpus[nn]);
+ if (p->cpus_allowed & (1UL << cpu))
+ continue;
+ load = cpu_rq(cpu)->nr_running;
+ if (load < min_load) {
+ best_cpu = cpu;
+ min_load = load;
}
}
- res = 1;
- out:
- double_rq_unlock(src, tgt);
- local_irq_restore(flags);
- return res;
+ p->cpu = best_cpu;
+}
+
+/*
+ * Find the (relatively) least loaded node for the given process taking into
+ * account the cpus_allowed mask. Try to Round Robin search for the best node.
+ */
+void sched_best_node(struct task_struct *p)
+{
+ int i, nn, node=0, best_node, load, min_load;
+ int pool_load[NR_NODES] = { [0 ... NR_NODES-1] = 0};
+ int cpus_per_node[NR_NODES] = { [0 ... NR_NODES-1] = 0};
+ unsigned long mask = p->cpus_allowed & cpu_online_map;
+
+ do {
+ best_node = atomic_inc_return(&sched_node) % numpools;
+ } while (!(pool_mask[best_node] & mask));
+
+ for (i = 0; i < smp_num_cpus; i++) {
+ int cpu = cpu_logical_map(i);
+ if (!(mask & (1<<cpu)))
+ continue;
+ load = cpu_rq(cpu)->nr_running;
+ node = CPU_TO_NODE(cpu);
+ cpus_per_node[node]++;
+ pool_load[node] += 1000*load;
+ }
+ min_load = pool_load[best_node] / cpus_per_node[node];
+ for (nn = 1; nn < numpools; nn++) {
+ node = (best_node + nn) % numpools;
+ if (cpus_per_node[node] > 0) {
+ load = pool_load[node] / cpus_per_node[node];
+ if (load < min_load) {
+ min_load = load;
+ best_node = node;
+ }
+ }
+ }
+ p->node = best_node;
+ atomic_set(&sched_node, best_node);
}

+void pools_info(void)
+{
+ int i, j;
+
+ printk("CPU pools : %d\n",numpools);
+ for (i=0;i<numpools;i++) {
+ printk("pool %d :",i);
+ for (j=pool_ptr[i];j<pool_ptr[i+1];j++)
+ printk("%d ",pool_cpus[j]);
+ printk("\n");
+ }
+}
+
+void bld_pools(void)
+{
+ int i, j, ptr;
+ int flag[NR_CPUS] = { [ 0 ... NR_CPUS-1] = 0 };
+ unsigned long mask;
+
+ numpools = 0;
+ ptr = 0;
+ for (i = 0; i < smp_num_cpus; i++) {
+ if (!(cpu_online_map & (1<<i))) continue;
+ if (!flag[i]) {
+ pool_ptr[numpools]=ptr;
+ mask = 0UL;
+ for (j = 0; j < smp_num_cpus; j++) {
+ if (! (cpu_online_map & (1<<j))) continue;
+ if (i == j || CPU_TO_NODE(i) == CPU_TO_NODE(j)) {
+ pool_cpus[ptr++]=j;
+ flag[j]=1;
+ mask |= (1<<j);
+ }
+ }
+ pool_mask[numpools]=mask;
+ numpools++;
+ }
+ }
+ pool_ptr[numpools]=ptr;
+ pools_info();
+}

void __init init_idle(task_t *idle, int cpu)
{
@@ -1477,6 +1678,7 @@
idle->prio = MAX_PRIO;
idle->state = TASK_RUNNING;
idle->cpu = cpu;
+ idle->node = SAPICID_TO_NODE(cpu_physical_id(cpu));
double_rq_unlock(idle_rq, rq);
idle->need_resched = 1;
__restore_flags(flags);
@@ -1510,6 +1712,8 @@
// delimiter for bitsearch
__set_bit(MAX_PRIO, array->bitmap);
}
+ for (j = 0; j < NR_NODES; j++)
+ rq->nr_homenode[j]=0;
}
/*
* We have to do a little magic to get the first
diff -urN 2.4.17-ia64-kdbv2.1-K3ia64/kernel/sys.c 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/sys.c
--- 2.4.17-ia64-kdbv2.1-K3ia64/kernel/sys.c Fri Feb 8 12:02:06 2002
+++ 2.4.17-ia64-kdbv2.1-K3ia64-nod3/kernel/sys.c Mon Feb 25 17:56:55 2002
@@ -1205,6 +1205,8 @@
{
int error = 0;
int sig;
+ int pid;
+ struct task_struct *child;

switch (option) {
case PR_SET_PDEATHSIG:
@@ -1272,6 +1274,35 @@
}
current->keep_capabilities = arg2;
break;
+ case PR_GET_NODE:
+ pid = (int) arg3;
+ read_lock(&tasklist_lock);
+ child = find_task_by_pid(pid);
+ if (child) {
+ error = put_user(child->node,(int *)arg2);
+ } else {
+ printk("prctl: could not find process %d\n",pid);
+ error = -EINVAL;
+ }
+ read_unlock(&tasklist_lock);
+ break;
+ case PR_SET_NODE:
+ pid = (int) arg3;
+ read_lock(&tasklist_lock);
+ child = find_task_by_pid(pid);
+ if (child) {
+ if (child->uid == current->uid || \
+ current->uid == 0) {
+ printk("setting node of process %d to %d\n",pid,(int)arg2);
+ child->node = (int)arg2;
+ }
+ } else {
+ printk("prctl: could not find process %d\n",pid);
+ error = -EINVAL;
+ }
+ read_unlock(&tasklist_lock);
+ break;
+
default:
error = -EINVAL;
break;



2002-02-25 18:55:05

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

> - The load_balancing() concept is different:
> - there are no special time intervals for balancing across pool
> boundaries, the need for this can occur very quickly and I
> have the feeling that 2*250ms is a long time for keeping the
> nodes unbalanced. This means: each time load_balance() is called
> it _can_ balance across pool boundaries (but doesn't have to).

Imagine for a moment that there's a short spike in workload on one node.
By agressively balancing across nodes, won't you incur a high cost
in terms of migrating all the cache data to the remote node (destroying
the cache on both the remote and local node), when it would be cheaper
to wait for a few more ms, and run on the local node? This is a
non-trivial problem to solve, and I'm not saying either approach is
correct, just that there are some disadvantages of being too agressive.
Perhaps it's architecture dependant (I'm used to NUMA-Q, which has
caches on the interconnect, and a cache-miss access speed ratio of
about 20:1 remote:local).

> Would be interesting to hear oppinions on initial balancing. What are the
> pros and cons of balancing at do_fork() or do_execve()? And it would be
> interesting to learn about other approaches, too...

Presumably exec-time balancing is cheaper, since there are fewer shared
pages to be bounced around between nodes, but less effective if the main
load on the machine is one large daemon app, which just forks a few copies
of itself ... I would have though that'd get sorted out a little later anyway
by the background rebalancing though?

M.

2002-02-25 19:03:55

by Larry McVoy

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

On Mon, Feb 25, 2002 at 10:55:03AM -0800, Martin J. Bligh wrote:
> > - The load_balancing() concept is different:
> > - there are no special time intervals for balancing across pool
> > boundaries, the need for this can occur very quickly and I
> > have the feeling that 2*250ms is a long time for keeping the
> > nodes unbalanced. This means: each time load_balance() is called
> > it _can_ balance across pool boundaries (but doesn't have to).
>
> Imagine for a moment that there's a short spike in workload on one node.
> By agressively balancing across nodes, won't you incur a high cost
> in terms of migrating all the cache data to the remote node (destroying
> the cache on both the remote and local node), when it would be cheaper
> to wait for a few more ms, and run on the local node?

Great question! The answer is that you are absolutely right. SGI tried
a pile of things in this area, both on NUMA and on traditional SMPs (the
NUMA stuff was more page migration and the SMP stuff was more process
migration, but the problems are the same, you screw up the cache). They
never got the page migration to give them better performance while I was
there and I doubt they have today. And the process "migration" from CPU
to CPU didn't work either, people tended to lock processes to processors
for exactly the reason you alluded to.

If you read the early hardware papers on SMP, they all claim "Symmetric
Multi Processor", i.e., you can run any process on any CPU. Skip forward
3 years, now read the cache affinity papers from the same hardware people.
You have to step back and squint but what you'll see is that these papers
could be summarized on one sentence:

"Oops, we lied, it's not really symmetric at all"

You should treat each CPU as a mini system and think of a process reschedule
someplace else as a checkpoint/restart and assume that is heavy weight. In
fact, I'd love to see the scheduler code forcibly sleep the process for
500 milliseconds each time it lands on a different CPU. Tune the system
to work well with that, then take out the sleep, and you'll have the right
answer.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-02-25 19:26:39

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

On Mon, 25 Feb 2002, Larry McVoy wrote:

> On Mon, Feb 25, 2002 at 10:55:03AM -0800, Martin J. Bligh wrote:
> > > - The load_balancing() concept is different:
> > > - there are no special time intervals for balancing across pool
> > > boundaries, the need for this can occur very quickly and I
> > > have the feeling that 2*250ms is a long time for keeping the
> > > nodes unbalanced. This means: each time load_balance() is called
> > > it _can_ balance across pool boundaries (but doesn't have to).
> >
> > Imagine for a moment that there's a short spike in workload on one node.
> > By agressively balancing across nodes, won't you incur a high cost
> > in terms of migrating all the cache data to the remote node (destroying
> > the cache on both the remote and local node), when it would be cheaper
> > to wait for a few more ms, and run on the local node?
>
> Great question! The answer is that you are absolutely right. SGI tried
> a pile of things in this area, both on NUMA and on traditional SMPs (the
> NUMA stuff was more page migration and the SMP stuff was more process
> migration, but the problems are the same, you screw up the cache). They
> never got the page migration to give them better performance while I was
> there and I doubt they have today. And the process "migration" from CPU
> to CPU didn't work either, people tended to lock processes to processors
> for exactly the reason you alluded to.
>
> If you read the early hardware papers on SMP, they all claim "Symmetric
> Multi Processor", i.e., you can run any process on any CPU. Skip forward
> 3 years, now read the cache affinity papers from the same hardware people.
> You have to step back and squint but what you'll see is that these papers
> could be summarized on one sentence:
>
> "Oops, we lied, it's not really symmetric at all"
>
> You should treat each CPU as a mini system and think of a process reschedule
> someplace else as a checkpoint/restart and assume that is heavy weight. In
> fact, I'd love to see the scheduler code forcibly sleep the process for
> 500 milliseconds each time it lands on a different CPU. Tune the system
> to work well with that, then take out the sleep, and you'll have the right
> answer.

I made this test on 8 way NUMA machines ( thx to OSDLAB ). When a CPUs
went idle i let it sample the status/load of the system with 100HZ
frequency and i had a variable trigger time derivate that fired a task
steal if a certain load was observed on the same CPU for a time > K ms
I tested it with kernel builds and surprisingly enough having the idle to
observe a load for more than 60ms was a performance loss on these
machines. The moral of the story is:

"Cache trashing wheighs but wasted CPU time too ..."




- Davide


2002-02-25 19:36:30

by Timothy D. Witham

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

On Mon, 2002-02-25 at 11:03, Larry McVoy wrote:
Snip
>
> If you read the early hardware papers on SMP, they all claim "Symmetric
> Multi Processor", i.e., you can run any process on any CPU. Skip forward
> 3 years, now read the cache affinity papers from the same hardware people.
> You have to step back and squint but what you'll see is that these papers
> could be summarized on one sentence:
>
> "Oops, we lied, it's not really symmetric at all"
>

In the interests of historical accuracy. It was a case of underlying
technology changes. The original research was done with processors
with 8k caches (total) with the processors external and
internal bus speeds being the same and those equal to the backplane
bus speed. Shift forward 4 years and you had 512 KB external caches,
and a ratio of the difference between the external processor cache
and main memory of over 20x. This resulted in a very different
set of optimizations resulting in the cache affinity schedulers.

Tim

processor boards with 512K caches. In
> You should treat each CPU as a mini system and think of a process reschedule
> someplace else as a checkpoint/restart and assume that is heavy weight. In
> fact, I'd love to see the scheduler code forcibly sleep the process for
> 500 milliseconds each time it lands on a different CPU. Tune the system
> to work well with that, then take out the sleep, and you'll have the right
> answer.
> --
> ---
> Larry McVoy lm at bitmover.com http://www.bitmover.com/lm
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Timothy D. Witham - Lab Director - [email protected]
Open Source Development Lab Inc - A non-profit corporation
15275 SW Koll Parkway - Suite H - Beaverton OR, 97006
(503)-626-2455 x11 (office) (503)-702-2871 (cell)
(503)-626-2436 (fax)

2002-02-25 19:43:10

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

On Mon, 25 Feb 2002, Davide Libenzi wrote:

> On Mon, 25 Feb 2002, Larry McVoy wrote:
>
> > On Mon, Feb 25, 2002 at 10:55:03AM -0800, Martin J. Bligh wrote:
> > > > - The load_balancing() concept is different:
> > > > - there are no special time intervals for balancing across pool
> > > > boundaries, the need for this can occur very quickly and I
> > > > have the feeling that 2*250ms is a long time for keeping the
> > > > nodes unbalanced. This means: each time load_balance() is called
> > > > it _can_ balance across pool boundaries (but doesn't have to).
> > >
> > > Imagine for a moment that there's a short spike in workload on one node.
> > > By agressively balancing across nodes, won't you incur a high cost
> > > in terms of migrating all the cache data to the remote node (destroying
> > > the cache on both the remote and local node), when it would be cheaper
> > > to wait for a few more ms, and run on the local node?
> >
> > Great question! The answer is that you are absolutely right. SGI tried
> > a pile of things in this area, both on NUMA and on traditional SMPs (the
> > NUMA stuff was more page migration and the SMP stuff was more process
> > migration, but the problems are the same, you screw up the cache). They
> > never got the page migration to give them better performance while I was
> > there and I doubt they have today. And the process "migration" from CPU
> > to CPU didn't work either, people tended to lock processes to processors
> > for exactly the reason you alluded to.
> >
> > If you read the early hardware papers on SMP, they all claim "Symmetric
> > Multi Processor", i.e., you can run any process on any CPU. Skip forward
> > 3 years, now read the cache affinity papers from the same hardware people.
> > You have to step back and squint but what you'll see is that these papers
> > could be summarized on one sentence:
> >
> > "Oops, we lied, it's not really symmetric at all"
> >
> > You should treat each CPU as a mini system and think of a process reschedule
> > someplace else as a checkpoint/restart and assume that is heavy weight. In
> > fact, I'd love to see the scheduler code forcibly sleep the process for
> > 500 milliseconds each time it lands on a different CPU. Tune the system
> > to work well with that, then take out the sleep, and you'll have the right
> > answer.
>
> I made this test on 8 way NUMA machines ( thx to OSDLAB ). When a CPUs

s/NUMA/SMP/



- Davide


2002-02-25 19:54:30

by Bill Davidsen

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

On Mon, 25 Feb 2002, Larry McVoy wrote:

> If you read the early hardware papers on SMP, they all claim "Symmetric
> Multi Processor", i.e., you can run any process on any CPU. Skip forward
> 3 years, now read the cache affinity papers from the same hardware people.
> You have to step back and squint but what you'll see is that these papers
> could be summarized on one sentence:
>
> "Oops, we lied, it's not really symmetric at all"
>
> You should treat each CPU as a mini system and think of a process reschedule
> someplace else as a checkpoint/restart and assume that is heavy weight. In
> fact, I'd love to see the scheduler code forcibly sleep the process for
> 500 milliseconds each time it lands on a different CPU. Tune the system
> to work well with that, then take out the sleep, and you'll have the right
> answer.

Unfortunately this is an overly simple view of how SMP works. The only
justification for CPU latency is to preserve cache contents. Trying to
express this as a single number is bound to produce suboptimal results.
Consider:

+ the benefit from staying on the same processor drops with the number of
cache lines reloaded, not the the time. On some systems you probably can
measure this, I don't know of any use currently made of it. Actually given
varying cache sizes you care about the lines not changed, still another
thing (big cache doesn't devaluate as fast).

+ the cost of going to another processor is hardware and application
dependent. It depends on memory bandwidth and the size of the working set.
In addition, some hardware has several processors on a card, with
both individual and shared cache. And the Intel hyperthreading provides
two CPUs with totally shared cache (if I read the blurb right). This means
that the cost of going to one processor isn't always the same as going to
another.

+ tuning depends on what you want to optimize. If I have low memory and
few processes, having a process using memory but not getting CPU is bad, I
don't ever want a CPU idle when I don't have enough to go around. I don't
even want a CPU idle if I have tons of memory, switch is actually not all
THAT heavy an operation, if I have a small working set the cost is low, if
I have a working set larger than cache the benefit of addinity is reduced.
And, if I have lots of memory and processes, and not much CPU, and limited
memory bandwidth, then waiting for a CPU is fine, it keeps the CPUs
running as fast as possible.

I believe there is room for improvement here, but I don't think setting
affinity to some large values and then tuning the system to work well with
that is even possible, given the benefits in responsiveness I see with low
latency and preempt changes, I think it would lead to a dog-slow system no
matter how you tune it.

To paraphrase one of my old sig files, simplify the algorithm as much as
possible. Then stop.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-25 20:03:10

by Larry McVoy

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

On Mon, Feb 25, 2002 at 02:49:40PM -0500, Bill Davidsen wrote:
> On Mon, 25 Feb 2002, Larry McVoy wrote:
>
> > If you read the early hardware papers on SMP, they all claim "Symmetric
> > Multi Processor", i.e., you can run any process on any CPU. Skip forward
> > 3 years, now read the cache affinity papers from the same hardware people.
> > You have to step back and squint but what you'll see is that these papers
> > could be summarized on one sentence:
> >
> > "Oops, we lied, it's not really symmetric at all"
> >
> > You should treat each CPU as a mini system and think of a process reschedule
> > someplace else as a checkpoint/restart and assume that is heavy weight. In
> > fact, I'd love to see the scheduler code forcibly sleep the process for
> > 500 milliseconds each time it lands on a different CPU. Tune the system
> > to work well with that, then take out the sleep, and you'll have the right
> > answer.
>
> Unfortunately this is an overly simple view of how SMP works. The only
> justification for CPU latency is to preserve cache contents. Trying to
> express this as a single number is bound to produce suboptimal results.

And here is the other side of the coin. Remember what we are doing.
We're in the middle of a context switch, trying to figure out where we
should run this process. We would like context switches to be fast.
Any work we do here is at direct odds with our goals. SGI took the
approach that your statements would imply, i.e., approximate the
cache footprint, figure out if it was big or small, and use that to
decide where to land the process. This has two fatal flaws:
a) Because there is no generic hardware interface to say "how many cache
lines are mine", you approximate that by looking at how much of the
process timeslice this process used, if it used a lot, you guess it
filled the cache. This doesn't work at all for I/O bound processes,
who typically run in short bursts. So IRIX would bounce these around
for no good reason, resulting in crappy I/O perf. I got about another
20% in BDS by locking down the processes (BDS delivered 3.2GBytes/sec
of NFS traffic, sustained, in 1996).
b) all of the "thinking" you do to figure out where to land the process
contributes directly to the cost of the context switch. Linux has
nice light context switches, let's keep it that way.

Summary: SGI managed to get optimal usage out of their caches for long
running, CPU bound fortran jobs at the expense of time sharing and
I/O jobs. I'm happy to let SGI win in that space.
--
---
Larry McVoy lm at bitmover.com http://www.bitmover.com/lm

2002-02-25 20:16:44

by Davide Libenzi

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

On Mon, 25 Feb 2002, Larry McVoy wrote:

> On Mon, Feb 25, 2002 at 02:49:40PM -0500, Bill Davidsen wrote:
> > On Mon, 25 Feb 2002, Larry McVoy wrote:
> >
> > > If you read the early hardware papers on SMP, they all claim "Symmetric
> > > Multi Processor", i.e., you can run any process on any CPU. Skip forward
> > > 3 years, now read the cache affinity papers from the same hardware people.
> > > You have to step back and squint but what you'll see is that these papers
> > > could be summarized on one sentence:
> > >
> > > "Oops, we lied, it's not really symmetric at all"
> > >
> > > You should treat each CPU as a mini system and think of a process reschedule
> > > someplace else as a checkpoint/restart and assume that is heavy weight. In
> > > fact, I'd love to see the scheduler code forcibly sleep the process for
> > > 500 milliseconds each time it lands on a different CPU. Tune the system
> > > to work well with that, then take out the sleep, and you'll have the right
> > > answer.
> >
> > Unfortunately this is an overly simple view of how SMP works. The only
> > justification for CPU latency is to preserve cache contents. Trying to
> > express this as a single number is bound to produce suboptimal results.
>
> And here is the other side of the coin. Remember what we are doing.
> We're in the middle of a context switch, trying to figure out where we
> should run this process. We would like context switches to be fast.
> Any work we do here is at direct odds with our goals. SGI took the
> approach that your statements would imply, i.e., approximate the
> cache footprint, figure out if it was big or small, and use that to
> decide where to land the process. This has two fatal flaws:
> a) Because there is no generic hardware interface to say "how many cache
> lines are mine", you approximate that by looking at how much of the
> process timeslice this process used, if it used a lot, you guess it
> filled the cache. This doesn't work at all for I/O bound processes,
> who typically run in short bursts. So IRIX would bounce these around
> for no good reason, resulting in crappy I/O perf. I got about another
> 20% in BDS by locking down the processes (BDS delivered 3.2GBytes/sec
> of NFS traffic, sustained, in 1996).
> b) all of the "thinking" you do to figure out where to land the process
> contributes directly to the cost of the context switch. Linux has
> nice light context switches, let's keep it that way.

Obviously Larry you do not want to do such work on an active CPU. Idle
time is balancing time ...




- Davide


2002-02-25 23:35:40

by Andy Pfiffer

[permalink] [raw]
Subject: Re: [Lse-tech] [rebalance at: do_fork() vs. do_execve()] NUMA scheduling

> > Would be interesting to hear oppinions on initial balancing. What are the
> > pros and cons of balancing at do_fork() or do_execve()? And it would be
> > interesting to learn about other approaches, too...

I worked on a system several years ago that supported single-system
image and shared no memory between nodes (NORMA == NO Remote Memory
Access), but did have a very high performance, low-latency interconnect
(100's of megabytes/sec, a few 10's of ns latency for com startup).

The ratios between CPU Clock Rate / Local Memory / Offboard Memory were
(at a gross level) similar to today's GHz CPU's with on-chip L1,
off-die L2, local dram, wires + state machines + dram on some other
node.

There was initially much debate about load balancing at fork time or at
exec time (static), followed by when and how often to rebalance already
running processes (dynamic).

We eventually chose to statically balance at exec time, using a possibly
stale metric, because we wouldn't have to spend time to create address
space remotely (parent on node A, child on node B), only to have it torn
down a few clocks later by a subsequent exec. (Our workload consisted
almost entirely of fork+exec rather than fork+fork+fork... ).

The analogy here is that commiting modified dcache lines owned by CPU A
and reheating the cache with them on CPU B, only to throw them away by
an exec a few clocks later may be similar to the "rfork() vs. rexec()"
choice we faced.

If you rebalance do_exec(), you are starting with an empty working set
and a cold cache.

To balance at exec time, we used a separate process that would
periodically query (via a spanning tree) nodes within the "interactive
partition" for their current load average, compute which was the least
loaded (a heuristic that used a combination of factors: cpu utilization,
# of processes, memory utilization, is there a shell over there, etc.),
and then update the nodes (again via a spanning tree) with the current
global opinion as to the least loaded node.

In the context of this discussion, computing the loading metric at
regular intervals *when otherwise idle* would appear to be similar to
the approach we used.

The static inter-node load leveling worked pretty well in practice
(although some said it wasn't much better than pure round-robin across
nodes), and it was non-fatal if the initial node selection was a poor
choice

The main problem was minimizing the storm of inbound rexec()'s when a
sudden burst of activity (say, with a make -j) on multiple nodes at once
could cause N-1 nodes to throw *everything* at the least loaded node. I
don't think this would be a problem in this case because there is a
single entity making a single choice, not mutiple entities making
multiple choices in isolation.

Dynamic load leveling (moving an entire process from one node to
another) was always problematic for highly interactive workloads and a
rash of complexity issues well off topic from this discussion, but
worked well for long-running CPU bound tasks.

Andy




2002-02-26 05:19:11

by Bill Davidsen

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

On Mon, 25 Feb 2002, Larry McVoy wrote:

> On Mon, Feb 25, 2002 at 02:49:40PM -0500, Bill Davidsen wrote:
> > Unfortunately this is an overly simple view of how SMP works. The only
> > justification for CPU latency is to preserve cache contents. Trying to
> > express this as a single number is bound to produce suboptimal results.
>
> And here is the other side of the coin. Remember what we are doing.
> We're in the middle of a context switch, trying to figure out where we
> should run this process. We would like context switches to be fast.

I hope we're not in the middle of a context switch... hopefully any
decision to move a process is done either (a) during load balancing, or
(b) when you have an idle CPU which needs work to do. Otherwise why
consider changing CPU?

I would think the place to do the work is when it needs doing, not on
every context switch. The CPU selection is costly, as opposed to "which
process to run." Of course when a process moves from blocked to ready it
might be time to consider how long it's been waiting and if anything in
the cache is worth using.

> Any work we do here is at direct odds with our goals. SGI took the
> approach that your statements would imply,

I wasn't really implying anything except the problem being more complex
than "set the affinity to 500ms and tune." You can't tune, the system will
be unresponsive. Since complex decisions would be made infrequently, it
should be better to do them right than to do the wrong thing quickly.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-26 10:33:45

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

On Mon, 25 Feb 2002, Martin J. Bligh wrote:

> > - The load_balancing() concept is different:
> > - there are no special time intervals for balancing across pool
> > boundaries, the need for this can occur very quickly and I
> > have the feeling that 2*250ms is a long time for keeping the
> > nodes unbalanced. This means: each time load_balance() is called
> > it _can_ balance across pool boundaries (but doesn't have to).
>
> Imagine for a moment that there's a short spike in workload on one node.
> By agressively balancing across nodes, won't you incur a high cost
> in terms of migrating all the cache data to the remote node (destroying
> the cache on both the remote and local node), when it would be cheaper
> to wait for a few more ms, and run on the local node? This is a
> non-trivial problem to solve, and I'm not saying either approach is
> correct, just that there are some disadvantages of being too agressive.
> Perhaps it's architecture dependant (I'm used to NUMA-Q, which has
> caches on the interconnect, and a cache-miss access speed ratio of
> about 20:1 remote:local).

Well, maybe my description was a little bit misleading. My approach is not
balancing much more aggressively, the difference is actually minimal,
e.g. for 1ms ticks:

Mike's approach:
- idle CPU : load_balance() every 1ms (only within local node)
balance_cpu_sets() every 2ms (balance across nodes)
- busy CPU : load_balance() every 250ms
balance_cpu_sets() every 500ms
- schedule() : load_balance() if idle (only within local node)

Erich's approach:
- idle CPU : load_balance() every 1ms (first try balancing the local
node, if already balanced (no CPU exceeds the
current load by >25%) try to find a remote
node with larger load than on the current one
(>25% again)).
- busy CPU : load_balance() every 250ms (same comment as above)
- schedule() : load_balance() if idle (same comment as above).

So the functional difference is not really that big here, I am also trying
to balance locally first. If that fails (no imbalance), I try
globally. The factor of 2 in the times is not so relevant, I think, and
also I don't consider my approach significantly more aggressive.

More significant is the difference in the data used for the balance
decision:

Mike: calculate load of a particular cpu set in the corresponding
load_balance() call.
Advantage: cheap (if spinlocks don't hurt there)
Disadvantage: for busy CPUs it can be really old (250ms)

Erich: calculate load when needed, at the load_balance() call, but not
more than needed (normally only local node data, global data if needed,
all lockless).
Advantage: fresh, lockless
Disadvantage: sometimes slower (when balancing across nodes)

As Mike has mainly the cache affinity in mind, it doesn't really matter
where a task is scheduled as long as it stays there long enough and the
nodes are well balanced. A wrong scheduling decision (based on old
data) will be fixed sooner or later (after x*250ms or so).

My problem is that I need to decide on the home node of a process and will
allocate all of its memory on that node. Therefore the decision must be as
good as possible, I really want to keep the task as near to the home node
as possible. If a task gets away from its home node (because of imbalance
between the nodes) the scheduler will try to return it. Of course this
should give us less load on the crossbar between the nodes and best memory
access latency.


> Presumably exec-time balancing is cheaper, since there are fewer shared
> pages to be bounced around between nodes, but less effective if the main
> load on the machine is one large daemon app, which just forks a few copies
> of itself ... I would have though that'd get sorted out a little later anyway
> by the background rebalancing though?

OK, thanks. I agree with the first part of your reply. The last sentence
is true for Mike's approach but a bit more complicated with the boundary
condition of a process having its memory allocated on a particular node...

Thank you very much for your comments!

Best regards,
Erich

2002-02-26 15:59:52

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

> Well, maybe my description was a little bit misleading. My approach is not
> balancing much more aggressively, the difference is actually minimal,
> e.g. for 1ms ticks:
>
> Mike's approach:
> - idle CPU : load_balance() every 1ms (only within local node)
> balance_cpu_sets() every 2ms (balance across nodes)
> - busy CPU : load_balance() every 250ms
> balance_cpu_sets() every 500ms
> - schedule() : load_balance() if idle (only within local node)
>
> Erich's approach:
> - idle CPU : load_balance() every 1ms (first try balancing the local
> node, if already balanced (no CPU exceeds the
> current load by >25%) try to find a remote
> node with larger load than on the current one
> (>25% again)).
> - busy CPU : load_balance() every 250ms (same comment as above)
> - schedule() : load_balance() if idle (same comment as above).

Actually, if I understand what you're doing correctly, what
you really want is to only shift when the load average on
the source cpu (the one you're shifting from) has been
above "limit" for a specified amount of time, rather than
making the interval when we inspect longer (in order to
avoid bouncing stuff around unnecessarily).

> So the functional difference is not really that big here, I am also trying
> to balance locally first. If that fails (no imbalance), I try
> globally. The factor of 2 in the times is not so relevant, I think, and
> also I don't consider my approach significantly more aggressive.

When you say "balance", are you trying to balance the number of
tasks in each queue, or the % of time each CPU is busy? It would
seem OK to me to have 100 IO bound tasks on one node, and 4 cpu
bound tasks on another (substitute "CPU" for "node" above at will).

In fact, shifting around IO bound tasks will win you far less than
moving around CPU bound tasks in general.

> More significant is the difference in the data used for the balance
> decision:
>
> Mike: calculate load of a particular cpu set in the corresponding
> load_balance() call.
> Advantage: cheap (if spinlocks don't hurt there)
> Disadvantage: for busy CPUs it can be really old (250ms)
>
> Erich: calculate load when needed, at the load_balance() call, but not
> more than needed (normally only local node data, global data if needed,
> all lockless).
> Advantage: fresh, lockless
> Disadvantage: sometimes slower (when balancing across nodes)

How are you calculating load? Load average? Over what period of time?
Please forgive my idleness in not going and reading the code ;-)

> As Mike has mainly the cache affinity in mind, it doesn't really matter
> where a task is scheduled as long as it stays there long enough and the
> nodes are well balanced. A wrong scheduling decision (based on old
> data) will be fixed sooner or later (after x*250ms or so).

Not sure I understand that - the wrong decision will cost you two
moves, blowing not only the cache for your current process, but
also whoever else's cache you accidentally trampled upon (which
admittedly might be nothing useful).

>> Presumably exec-time balancing is cheaper, since there are fewer shared
>> pages to be bounced around between nodes, but less effective if the main
>> load on the machine is one large daemon app, which just forks a few
>> copies of itself ... I would have though that'd get sorted out a little
>> later anyway by the background rebalancing though?
>
> OK, thanks. I agree with the first part of your reply. The last sentence
> is true for Mike's approach but a bit more complicated with the boundary
> condition of a process having its memory allocated on a particular node...

Ah, now I see why you're doing what you're doing ;-) Maybe we're
getting into the realm of processes providing hints to the kernel
at fork time ... I need to think on this one some more ...

Thanks,

Martin.

2002-02-26 19:04:32

by Mike Kravetz

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

Erich,

I'm glad to see you are also exploring NUMA scheduling. The
more the merrier.

On Tue, Feb 26, 2002 at 11:33:14AM +0100, Erich Focht wrote:
>
> Well, maybe my description was a little bit misleading. My approach is not
> balancing much more aggressively, the difference is actually minimal,
> e.g. for 1ms ticks:
>
> Mike's approach:
> - idle CPU : load_balance() every 1ms (only within local node)
> balance_cpu_sets() every 2ms (balance across nodes)
> - busy CPU : load_balance() every 250ms
> balance_cpu_sets() every 500ms
> - schedule() : load_balance() if idle (only within local node)
>
> Erich's approach:
> - idle CPU : load_balance() every 1ms (first try balancing the local
> node, if already balanced (no CPU exceeds the
> current load by >25%) try to find a remote
> node with larger load than on the current one
> (>25% again)).
> - busy CPU : load_balance() every 250ms (same comment as above)
> - schedule() : load_balance() if idle (same comment as above).
>
> So the functional difference is not really that big here, I am also trying
> to balance locally first. If that fails (no imbalance), I try
> globally. The factor of 2 in the times is not so relevant, I think, and
> also I don't consider my approach significantly more aggressive.

My factor of 'two' is really a 'distance' factor. My thoughts were
along the lines that node rebalancing would occur at different rates
based on the distance between nodes. On the Sequent NUMA-Q machines
with really high remote memory latencies, you might want to be less
agressive than on machines with lower latencies. I was playing with
the idea that you would discover distances during topology discovery,
and the rate of rebalancing would somehow correspond to these distances.

> More significant is the difference in the data used for the balance
> decision:
>
> Mike: calculate load of a particular cpu set in the corresponding
> load_balance() call.
> Advantage: cheap (if spinlocks don't hurt there)
> Disadvantage: for busy CPUs it can be really old (250ms)
>
> Erich: calculate load when needed, at the load_balance() call, but not
> more than needed (normally only local node data, global data if needed,
> all lockless).
> Advantage: fresh, lockless
> Disadvantage: sometimes slower (when balancing across nodes)
>
> As Mike has mainly the cache affinity in mind, it doesn't really matter
> where a task is scheduled as long as it stays there long enough and the
> nodes are well balanced. A wrong scheduling decision (based on old
> data) will be fixed sooner or later (after x*250ms or so).

Agreed. I also played with the idea of keeping a load average over
time, which seems to be something we may want. However, I couldn't
think of an efficient way to accomplish this, and my first attempts
showed little promise. Perhaps, I will investigate this more.

--
Mike

2002-02-27 16:57:37

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] NUMA scheduling

> > Erich's approach:
> > - idle CPU : load_balance() every 1ms (first try balancing the local
> > node, if already balanced (no CPU exceeds the
> > current load by >25%) try to find a remote
> > node with larger load than on the current one
> > (>25% again)).
> > - busy CPU : load_balance() every 250ms (same comment as above)
> > - schedule() : load_balance() if idle (same comment as above).
>
> Actually, if I understand what you're doing correctly, what
> you really want is to only shift when the load average on
> the source cpu (the one you're shifting from) has been
> above "limit" for a specified amount of time, rather than
> making the interval when we inspect longer (in order to
> avoid bouncing stuff around unnecessarily).

I just tried to keep Ingo's original time intervals for load
balancing and the 25% (which he imposes) were taken as a balance condition
for the nodes, too. The delayed balancing is not implemented, but that
would be very nice to have. Davide Libenzi was using that in his scheduler
(incrementing a counter for a cpu which was busier than the current one,
resetting it when it wasn't, select that cpu if the counter reached some
number larger than the "distance" between the cpus). He was balancing only
when the cpu went idle, I'm not sure what would be the corresponding
treatment for busy balancing.

> When you say "balance", are you trying to balance the number of
> tasks in each queue, or the % of time each CPU is busy? It would
> seem OK to me to have 100 IO bound tasks on one node, and 4 cpu
> bound tasks on another (substitute "CPU" for "node" above at will).

Just as primitive as rq->nr_running (or the minimum of that value and the
one at the previous balance tick, as Ingo implemented it).

> In fact, shifting around IO bound tasks will win you far less than
> moving around CPU bound tasks in general.

Agreed. Not quite sure how to distinguish the IO bound tasks currently,
maybe somehow by their priority? Anyhow, tasks with state not equal to
TASK_RUNNING are deactivated by schedule() and don't show up in
rq->nr_running, also they can't be migrated by the load balancer. So
there's no problem with shells, etc...


> How are you calculating load? Load average? Over what period of time?
> Please forgive my idleness in not going and reading the code ;-)

Again: it's just nr_running (minimum of last two balance ticks).


> > As Mike has mainly the cache affinity in mind, it doesn't really matter
> > where a task is scheduled as long as it stays there long enough and the
> > nodes are well balanced. A wrong scheduling decision (based on old
> > data) will be fixed sooner or later (after x*250ms or so).
>
> Not sure I understand that - the wrong decision will cost you two
> moves, blowing not only the cache for your current process, but
> also whoever else's cache you accidentally trampled upon (which
> admittedly might be nothing useful).

Yes, you are right. That's why I think that the relatively old data which
Mike uses (can be 250ms old!) could lead to such a degradation.

Thanks,

best regards,

Erich