> Erich,
>
> I played with this today on my 4 node (16 CPU) NUMAQ. Spent most
> of the time working with the first three patches. What I found was
> that rebalancing was happening too much between nodes. I tried a
> few things to change this, but have not yet settled on the best
> approach. A key item to work with is the check in find_busiest_node
> to determine if the found node is busier enough to warrant stealing
> from it. Currently the check is that the node has 125% of the load
> of the current node. I think that, for my system at least, we need
> to add in a constant to this equation. I tried using 4 and that
> helped a little.
Michael,
in:
+static int find_busiest_node(int this_node)
+{
+ int i, node = this_node, load, this_load, maxload;
+
+ this_load = maxload = atomic_read(&node_nr_running[this_node]);
+ for (i = 0; i < numnodes; i++) {
+ if (i == this_node)
+ continue;
+ load = atomic_read(&node_nr_running[i]);
+ if (load > maxload && (4*load > ((5*4*this_load)/4))) {
+ maxload = load;
+ node = i;
+ }
+ }
+ return node;
+}
You changed ((5*4*this_load)/4) to:
(5*4*(this_load+4)/4)
or
(4+(5*4*(this_load)/4)) ?
We def need some constant to avoid low load ping pong, right?
Finally I added in the 04 patch, and that helped
> a lot. Still, there is too much process movement between nodes.
perhaps increase INTERNODE_LB?
-Andrew Theurer
>> I played with this today on my 4 node (16 CPU) NUMAQ. Spent most
>> of the time working with the first three patches. What I found was
>> that rebalancing was happening too much between nodes. I tried a
>> few things to change this, but have not yet settled on the best
>> approach. A key item to work with is the check in find_busiest_node
>> to determine if the found node is busier enough to warrant stealing
>> from it. Currently the check is that the node has 125% of the load
>> of the current node. I think that, for my system at least, we need
>> to add in a constant to this equation. I tried using 4 and that
>> helped a little.
>
> Michael,
>
> in:
>
> +static int find_busiest_node(int this_node)
> +{
> + int i, node = this_node, load, this_load, maxload;
> +
> + this_load = maxload = atomic_read(&node_nr_running[this_node]);
> + for (i = 0; i < numnodes; i++) {
> + if (i == this_node)
> + continue;
> + load = atomic_read(&node_nr_running[i]);
> + if (load > maxload && (4*load > ((5*4*this_load)/4))) {
> + maxload = load;
> + node = i;
> + }
> + }
> + return node;
> +}
>
> You changed ((5*4*this_load)/4) to:
> (5*4*(this_load+4)/4)
> or
> (4+(5*4*(this_load)/4)) ?
>
> We def need some constant to avoid low load ping pong, right?
>
> Finally I added in the 04 patch, and that helped
>> a lot. Still, there is too much process movement between nodes.
>
> perhaps increase INTERNODE_LB?
Before we tweak this too much, how about using the global load
average for this? I can envisage a situation where we have two
nodes with 8 tasks per node, one with 12 tasks, and one with four.
You really don't want the ones with 8 tasks pulling stuff from
the 12 ... only for the least loaded node to start pulling stuff
later.
What about if we take the global load average, and multiply by
num_cpus_on_this_node / num_cpus_globally ... that'll give us
roughly what we should have on this node. If we're significantly
out underloaded compared to that, we start pulling stuff from
the busiest node? And you get the damping over time for free.
I think it'd be best if we stay fairly conservative for this, all
we're trying to catch is the corner case where a bunch of stuff
has forked, but not execed, and we have a long term imbalance.
Agressive rebalance might win a few benchmarks, but you'll just
cause inter-node task bounce on others.
M.
On Mon, 2003-01-13 at 20:45, Andrew Theurer wrote:
> > Erich,
> >
> > I played with this today on my 4 node (16 CPU) NUMAQ. Spent most
> > of the time working with the first three patches. What I found was
> > that rebalancing was happening too much between nodes. I tried a
> > few things to change this, but have not yet settled on the best
> > approach. A key item to work with is the check in find_busiest_node
> > to determine if the found node is busier enough to warrant stealing
> > from it. Currently the check is that the node has 125% of the load
> > of the current node. I think that, for my system at least, we need
> > to add in a constant to this equation. I tried using 4 and that
> > helped a little.
>
> Michael,
>
> in:
>
> +static int find_busiest_node(int this_node)
> +{
> + int i, node = this_node, load, this_load, maxload;
> +
> + this_load = maxload = atomic_read(&node_nr_running[this_node]);
> + for (i = 0; i < numnodes; i++) {
> + if (i == this_node)
> + continue;
> + load = atomic_read(&node_nr_running[i]);
> + if (load > maxload && (4*load > ((5*4*this_load)/4))) {
> + maxload = load;
> + node = i;
> + }
> + }
> + return node;
> +}
>
> You changed ((5*4*this_load)/4) to:
> (5*4*(this_load+4)/4)
> or
> (4+(5*4*(this_load)/4)) ?
I suppose I should not have been so dang lazy and cut-n-pasted
the line I changed. The change was (((5*4*this_load)/4) + 4)
which should be the same as your second choice.
>
> We def need some constant to avoid low load ping pong, right?
Yep. Without the constant, one could have 6 processes on node
A and 4 on node B, and node B would end up stealing. While making
a perfect balance, the expense of the off-node traffic does not
justify it. At least on the NUMAQ box. It might be justified
for a different NUMA architecture, which is why I propose putting
this check in a macro that can be defined in topology.h for each
architecture.
>
> Finally I added in the 04 patch, and that helped
> > a lot. Still, there is too much process movement between nodes.
>
> perhaps increase INTERNODE_LB?
That is on the list to try. Martin was mumbling something about
use the system wide load average to help make the inter-node
balance decision. I'd like to give that a try before tweaking
ITERNODE_LB.
>
> -Andrew Theurer
>
Michael Hohnbaum
[email protected]
Hi Martin,
> Before we tweak this too much, how about using the global load
> average for this? I can envisage a situation where we have two
> nodes with 8 tasks per node, one with 12 tasks, and one with four.
> You really don't want the ones with 8 tasks pulling stuff from
> the 12 ... only for the least loaded node to start pulling stuff
> later.
Hmmm, yet another idea from the old NUMA scheduler coming back,
therefore it has my full support ;-). Though we can't do it as I did
it there: in the old NUMA approach every cross-node steal was delayed,
only 1-2ms if the own node was underloaded, a lot more if the own
node's load was average or above average. We might need to finally
steal something even if we're having "above average" load, because the
cpu mask of the tasks on the overloaded node might only allow them to
migrate to us... But this is also a special case which we should solve
later.
> What about if we take the global load average, and multiply by
> num_cpus_on_this_node / num_cpus_globally ... that'll give us
> roughly what we should have on this node. If we're significantly
> out underloaded compared to that, we start pulling stuff from
> the busiest node? And you get the damping over time for free.
Patch 05 is going towards this direction but the constants I chose
were very aggressive. I'll update the whole set of patches and try to
send out something today.
Best regards,
Erich
Here's the new version of the NUMA scheduler built on top of the
miniature scheduler of Martin. I incorporated Michael's ideas and
Christoph's suggestions and rediffed for 2.5.58.
The whole patch is really tiny: 9.5k. This time I attached the numa
scheduler in form of two patches:
numa-sched-2.5.58.patch (7k) : components 01, 02, 03
numa-sched-add-2.5.58.patch (3k) : components 04, 05
The single components are also attached in a small tgz archive:
01-minisched-2.5.58.patch : the miniature scheduler from
Martin. Balances strictly within a node. Removed the
find_busiest_in_mask() function.
02-initial-lb-2.5.58.patch : Michael's initial load balancer at
exec(). Cosmetic corrections sugegsted by Christoph.
03-internode-lb-2.5.58.patch : internode load balancer core. Called
after NODE_BALANCE_RATE calls of the inter-node load balancer. Tunable
parameters:
NODE_BALANCE_RATE (default: 10)
NODE_THRESHOLD (default: 125) : consider only nodes with load
above NODE_THRESHOLD/100 * own_node_load
I added the constant factor of 4 suggested by Michael, but I'm not
really happy with it. This should be nr_cpus_in_node, but we don't
have that info in topology.h
04-smooth-node-load-2.5.58.patch : The node load measure is smoothed
by adding half of the previous node load (and 1/4 of the one before,
etc..., as discussed in the LSE call). This should improve a bit the
behavior in case of short timed load peaks and avoid bouncing tasks
between nodes.
05-var-intnode-lb-2.5.58.patch : Replaces the fixed NODE_BALANCE_RATE
interval (between cross-node balancer calls) by a variable
node-specific interval. Currently only two values used:
NODE_BALANCE_MIN : 10
NODE_BALANCE_MAX : 40
If the node load is less than avg_node_load/2, we switch to
NODE_BALANCE_MIN, otherwise we use the large interval.
I also added a function to reduce the number of tasks stolen from
remote nodes.
Regards,
Erich
On Tue, Jan 14, 2003 at 04:55:06PM +0100, Erich Focht wrote:
> Here's the new version of the NUMA scheduler built on top of the
> miniature scheduler of Martin. I incorporated Michael's ideas and
> Christoph's suggestions and rediffed for 2.5.58.
This one looks a lot nicer. You might also want to take the different
nr_running stuff from the patch I posted, I think it looks a lot nicer.
The patch (not updated yet) is below again for reference.
--- 1.62/fs/exec.c Fri Jan 10 08:21:00 2003
+++ edited/fs/exec.c Mon Jan 13 15:33:32 2003
@@ -1031,6 +1031,8 @@
int retval;
int i;
+ sched_balance_exec();
+
file = open_exec(filename);
retval = PTR_ERR(file);
--- 1.119/include/linux/sched.h Sat Jan 11 07:44:15 2003
+++ edited/include/linux/sched.h Mon Jan 13 15:58:11 2003
@@ -444,6 +444,14 @@
# define set_cpus_allowed(p, new_mask) do { } while (0)
#endif
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#else
+# define sched_balance_exec() do { } while (0)
+# define node_nr_running_init() do { } while (0)
+#endif
+
extern void set_user_nice(task_t *p, long nice);
extern int task_prio(task_t *p);
extern int task_nice(task_t *p);
--- 1.91/init/main.c Mon Jan 6 04:08:49 2003
+++ edited/init/main.c Mon Jan 13 15:33:33 2003
@@ -495,6 +495,7 @@
migration_init();
#endif
+ node_nr_running_init();
spawn_ksoftirqd();
}
--- 1.148/kernel/sched.c Sat Jan 11 07:44:22 2003
+++ edited/kernel/sched.c Mon Jan 13 16:17:34 2003
@@ -67,6 +67,7 @@
#define INTERACTIVE_DELTA 2
#define MAX_SLEEP_AVG (2*HZ)
#define STARVATION_LIMIT (2*HZ)
+#define NODE_BALANCE_RATIO 10
/*
* If a task is 'interactive' then we reinsert it in the active
@@ -154,6 +155,11 @@
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
+#ifdef CONFIG_NUMA
+ atomic_t *node_nr_running;
+ int nr_balanced;
+#endif
+
task_t *migration_thread;
struct list_head migration_queue;
@@ -178,6 +184,38 @@
#endif
/*
+ * Keep track of running tasks.
+ */
+#if CONFIG_NUMA
+
+/* XXX(hch): this should go into a struct sched_node_data */
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+ {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
+static inline void nr_running_init(struct runqueue *rq)
+{
+ rq->node_nr_running = &node_nr_running[0];
+}
+
+static inline void nr_running_inc(struct runqueue *rq)
+{
+ atomic_inc(rq->node_nr_running);
+ rq->nr_running++;
+}
+
+static inline void nr_running_dec(struct runqueue *rq)
+{
+ atomic_dec(rq->node_nr_running);
+ rq->nr_running--;
+}
+
+#else
+# define nr_running_init(rq) do { } while (0)
+# define nr_running_inc(rq) do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq) do { (rq)->nr_running--; } while (0)
+#endif /* CONFIG_NUMA */
+
+/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
* explicitly disabling preemption.
@@ -294,7 +332,7 @@
p->prio = effective_prio(p);
}
enqueue_task(p, array);
- rq->nr_running++;
+ nr_running_inc(rq);
}
/*
@@ -302,7 +340,7 @@
*/
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
- rq->nr_running--;
+ nr_running_dec(rq);
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
dequeue_task(p, p->array);
@@ -624,9 +662,108 @@
spin_unlock(&rq2->lock);
}
-#if CONFIG_SMP
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu. Then
+ * the cpu_allowed mask is restored.
+ *
+ * Note: This isn't actually numa-specific, but just not used otherwise.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+ unsigned long old_mask = p->cpus_allowed;
+
+ if (old_mask & (1UL << dest_cpu)) {
+ unsigned long flags;
+ struct runqueue *rq;
+
+ /* force the process onto the specified CPU */
+ set_cpus_allowed(p, 1UL << dest_cpu);
+
+ /* restore the cpus allowed mask */
+ rq = task_rq_lock(p, &flags);
+ p->cpus_allowed = old_mask;
+ task_rq_unlock(rq, &flags);
+ }
+}
/*
+ * Find the least loaded CPU. Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+ int i, minload, load, best_cpu, node = 0;
+ unsigned long cpumask;
+
+ best_cpu = task_cpu(p);
+ if (cpu_rq(best_cpu)->nr_running <= 2)
+ return best_cpu;
+
+ minload = 10000000;
+ for (i = 0; i < numnodes; i++) {
+ load = atomic_read(&node_nr_running[i]);
+ if (load < minload) {
+ minload = load;
+ node = i;
+ }
+ }
+
+ minload = 10000000;
+ cpumask = __node_to_cpu_mask(node);
+ for (i = 0; i < NR_CPUS; ++i) {
+ if (!(cpumask & (1UL << i)))
+ continue;
+ if (cpu_rq(i)->nr_running < minload) {
+ best_cpu = i;
+ minload = cpu_rq(i)->nr_running;
+ }
+ }
+ return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+ int new_cpu;
+
+ if (numnodes > 1) {
+ new_cpu = sched_best_cpu(current);
+ if (new_cpu != smp_processor_id())
+ sched_migrate_task(current, new_cpu);
+ }
+}
+
+static int find_busiest_node(int this_node)
+{
+ int i, node = this_node, load, this_load, maxload;
+
+ this_load = maxload = atomic_read(&node_nr_running[this_node]);
+ for (i = 0; i < numnodes; i++) {
+ if (i == this_node)
+ continue;
+ load = atomic_read(&node_nr_running[i]);
+ if (load > maxload && (4*load > ((5*4*this_load)/4))) {
+ maxload = load;
+ node = i;
+ }
+ }
+
+ return node;
+}
+
+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++)
+ cpu_rq(i)->node_nr_running = node_nr_running + __cpu_to_node(i);
+}
+#endif /* CONFIG_NUMA */
+
+#if CONFIG_SMP
+/*
* double_lock_balance - lock the busiest runqueue
*
* this_rq is locked already. Recalculate nr_running if we have to
@@ -652,9 +789,10 @@
}
/*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_queue - find the busiest runqueue among the cpus in cpumask
*/
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu,
+ int idle, int *imbalance, unsigned long cpumask)
{
int nr_running, load, max_load, i;
runqueue_t *busiest, *rq_src;
@@ -689,7 +827,7 @@
busiest = NULL;
max_load = 1;
for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
+ if (!cpu_online(i) || !((1UL << i) & cpumask))
continue;
rq_src = cpu_rq(i);
@@ -736,9 +874,9 @@
static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
{
dequeue_task(p, src_array);
- src_rq->nr_running--;
+ nr_running_dec(src_rq);
set_task_cpu(p, this_cpu);
- this_rq->nr_running++;
+ nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
@@ -758,13 +896,27 @@
*/
static void load_balance(runqueue_t *this_rq, int idle)
{
- int imbalance, idx, this_cpu = smp_processor_id();
+ int imbalance, idx, this_cpu, this_node;
+ unsigned long cpumask;
runqueue_t *busiest;
prio_array_t *array;
struct list_head *head, *curr;
task_t *tmp;
- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance);
+ this_cpu = smp_processor_id();
+ this_node = __cpu_to_node(this_cpu);
+ cpumask = __node_to_cpu_mask(this_node);
+
+#if CONFIG_NUMA
+ /*
+ * Avoid rebalancing between nodes too often.
+ */
+ if (!(++this_rq->nr_balanced % NODE_BALANCE_RATIO))
+ cpumask |= __node_to_cpu_mask(find_busiest_node(this_node));
+#endif
+
+ busiest = find_busiest_queue(this_rq, this_cpu, idle,
+ &imbalance, cpumask);
if (!busiest)
goto out;
@@ -2231,6 +2383,7 @@
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
atomic_set(&rq->nr_iowait, 0);
+ nr_running_init(rq);
for (j = 0; j < 2; j++) {
array = rq->arrays + j;
In the previous email the patch 02-initial-lb-2.5.58.patch had a bug
and this was present in the numa-sched-2.5.58.patch and
numa-sched-add-2.5.58.patch, too. Please use the patches attached to
this email! Sorry for the silly mistake...
Christoph, I used your way of coding nr_running_inc/dec now.
Regards,
Erich
On Tuesday 14 January 2003 16:55, Erich Focht wrote:
> Here's the new version of the NUMA scheduler built on top of the
> miniature scheduler of Martin. I incorporated Michael's ideas and
> Christoph's suggestions and rediffed for 2.5.58.
>
> The whole patch is really tiny: 9.5k. This time I attached the numa
> scheduler in form of two patches:
>
> numa-sched-2.5.58.patch (7k) : components 01, 02, 03
> numa-sched-add-2.5.58.patch (3k) : components 04, 05
>
> The single components are also attached in a small tgz archive:
>
> 01-minisched-2.5.58.patch : the miniature scheduler from
> Martin. Balances strictly within a node. Removed the
> find_busiest_in_mask() function.
>
> 02-initial-lb-2.5.58.patch : Michael's initial load balancer at
> exec(). Cosmetic corrections sugegsted by Christoph.
>
> 03-internode-lb-2.5.58.patch : internode load balancer core. Called
> after NODE_BALANCE_RATE calls of the inter-node load balancer. Tunable
> parameters:
> NODE_BALANCE_RATE (default: 10)
> NODE_THRESHOLD (default: 125) : consider only nodes with load
> above NODE_THRESHOLD/100 * own_node_load
> I added the constant factor of 4 suggested by Michael, but I'm not
> really happy with it. This should be nr_cpus_in_node, but we don't
> have that info in topology.h
>
> 04-smooth-node-load-2.5.58.patch : The node load measure is smoothed
> by adding half of the previous node load (and 1/4 of the one before,
> etc..., as discussed in the LSE call). This should improve a bit the
> behavior in case of short timed load peaks and avoid bouncing tasks
> between nodes.
>
> 05-var-intnode-lb-2.5.58.patch : Replaces the fixed NODE_BALANCE_RATE
> interval (between cross-node balancer calls) by a variable
> node-specific interval. Currently only two values used:
> NODE_BALANCE_MIN : 10
> NODE_BALANCE_MAX : 40
> If the node load is less than avg_node_load/2, we switch to
> NODE_BALANCE_MIN, otherwise we use the large interval.
> I also added a function to reduce the number of tasks stolen from
> remote nodes.
>
> Regards,
> Erich
Aargh, I should have gone home earlier...
For those who really care about patch 05, it's attached. It's all
untested as I don't have a ia32 NUMA machine running 2.5.58...
Erich
On Tuesday 14 January 2003 17:23, Erich Focht wrote:
> In the previous email the patch 02-initial-lb-2.5.58.patch had a bug
> and this was present in the numa-sched-2.5.58.patch and
> numa-sched-add-2.5.58.patch, too. Please use the patches attached to
> this email! Sorry for the silly mistake...
>
> Christoph, I used your way of coding nr_running_inc/dec now.
>
> Regards,
> Erich
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#else
+#define sched_balance_exec() {}
+#endif
You accidentally (?) removed the stub for node_nr_running_init.
Also sched.h used # define inside ifdefs.
+#ifdef CONFIG_NUMA
+ atomic_t * node_ptr;
The name is still a bit non-descriptive and the * placed wrong :)
What about atomic_t *nr_running_at_node?
+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
+ {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
+
I think my two comments here were pretty usefull :)
+static inline void nr_running_dec(runqueue_t *rq)
+{
+ atomic_dec(rq->node_ptr);
+ rq->nr_running--;
+}
+
+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++)
+ cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+}
+#else
+# define nr_running_init(rq) do { } while (0)
+# define nr_running_inc(rq) do { (rq)->nr_running++; } while (0)
+# define nr_running_dec(rq) do { (rq)->nr_running--; } while (0)
+#endif /* CONFIG_NUMA */
+
/*
@@ -689,7 +811,7 @@ static inline runqueue_t *find_busiest_q
busiest = NULL;
max_load = 1;
for (i = 0; i < NR_CPUS; i++) {
- if (!cpu_online(i))
+ if (!cpu_online(i) || !((1UL << i) & cpumask) )
spurious whitespace before the closing brace.
prio_array_t *array;
struct list_head *head, *curr;
task_t *tmp;
+ int this_node = __cpu_to_node(this_cpu);
+ unsigned long cpumask = __node_to_cpu_mask(this_node);
If that's not to much style nitpicking: put this_node on one line with all the
other local ints and initialize all three vars after the declarations (like
in my patch *duck*)
+#if CONFIG_NUMA
+ rq->node_ptr = &node_nr_running[0];
+#endif /* CONFIG_NUMA */
I had a nr_running_init() abstraction for this, but you only took it
partially. It would be nice to merge the last bit to get rid of this ifdef.
Else the patch looks really, really good and I'm looking forward to see
it in mainline real soon!
On Tue, 2003-01-14 at 08:43, Erich Focht wrote:
> Aargh, I should have gone home earlier...
> For those who really care about patch 05, it's attached. It's all
> untested as I don't have a ia32 NUMA machine running 2.5.58...
One more minor problem - the first two patches are missing the
following defines, and result in compile issues:
#define MAX_INTERNODE_LB 40
#define MIN_INTERNODE_LB 4
#define NODE_BALANCE_RATIO 10
Looking through previous patches, and the 05 patch, I found
these defines and put them under the #if CONFIG_NUMA in sched.c
that defines node_nr_running and friends.
With these three lines added, I have a kernel built and booted
using the first numa-sched and numa-sched-add patches.
Test results will follow later in the day.
Michael
>
> Erich
>
>
> On Tuesday 14 January 2003 17:23, Erich Focht wrote:
> > In the previous email the patch 02-initial-lb-2.5.58.patch had a bug
> > and this was present in the numa-sched-2.5.58.patch and
> > numa-sched-add-2.5.58.patch, too. Please use the patches attached to
> > this email! Sorry for the silly mistake...
> >
> > Christoph, I used your way of coding nr_running_inc/dec now.
> >
> > Regards,
> > Erich
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486
On Tue, 2003-01-14 at 11:02, Michael Hohnbaum wrote:
> On Tue, 2003-01-14 at 08:43, Erich Focht wrote:
> > Aargh, I should have gone home earlier...
> > For those who really care about patch 05, it's attached. It's all
> > untested as I don't have a ia32 NUMA machine running 2.5.58...
>
> One more minor problem - the first two patches are missing the
> following defines, and result in compile issues:
>
> #define MAX_INTERNODE_LB 40
> #define MIN_INTERNODE_LB 4
> #define NODE_BALANCE_RATIO 10
>
> Looking through previous patches, and the 05 patch, I found
> these defines and put them under the #if CONFIG_NUMA in sched.c
> that defines node_nr_running and friends.
>
> With these three lines added, I have a kernel built and booted
> using the first numa-sched and numa-sched-add patches.
>
> Test results will follow later in the day.
Trying to apply the 05 patch, I discovered that it was already
in there. Something is messed up with the combined patches, so
I went back to the tgz file you provided and started over. I'm
not sure what the kernel is that I built and tested earlier today,
but I suspect it was, for the most part, the complete patchset
(i.e., patches 1-5). Building a kernel with patches 1-4 from
the tgz file does not need the additional defines mentioned in
my previous email.
Testing is starting from scratch with a known patch base. The
plan is to test with patches 1-4, then add in 05. I should have
some numbers for you before the end of my day. btw, the numbers
looked real good for the runs on whatever kernel it was that I
built this morning.
>
> Michael
>
> > Erich
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486
On Tuesday 14 January 2003 16:55, Erich Focht wrote:
> Here's the new version of the NUMA scheduler built on top of the
> miniature scheduler of Martin. I incorporated Michael's ideas and
> Christoph's suggestions and rediffed for 2.5.58.
>
> The whole patch is really tiny: 9.5k. This time I attached the numa
> scheduler in form of two patches:
Ran tests on three different kernels:
stock58 - linux 2.5.58 with the cputime_stats patch
sched1-4-58 - stock58 with the first 4 NUMA scheduler patches
sched1-5-58 - stock58 with all 5 NUMA scheduler patches
Kernbench:
Elapsed User System CPU
stock58 31.656s 305.85s 89.232s 1248.2%
sched1-4-58 29.886s 287.506s 82.94s 1239%
sched1-5-58 29.994s 288.796s 84.706s 1245%
Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
stock58 27.73 42.80 110.96 0.85
sched1-4-58 32.86 46.41 131.47 0.85
sched1-5-58 32.37 45.34 129.52 0.89
Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
stock58 45.97 61.87 367.81 2.11
sched1-4-58 31.39 49.18 251.22 2.15
sched1-5-58 37.52 61.32 300.22 2.06
Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
stock58 60.91 83.63 974.71 6.18
sched1-4-58 54.31 62.11 869.11 3.84
sched1-5-58 51.60 59.05 825.72 4.74
Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
stock58 84.26 195.16 2696.65 16.53
sched1-4-58 61.49 140.51 1968.06 9.57
sched1-5-58 55.23 117.32 1767.71 7.78
Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
stock58 123.27 511.77 7889.77 27.78
sched1-4-58 63.39 266.40 4057.92 20.55
sched1-5-58 59.57 250.25 3813.39 17.05
One anomaly noted was that the kernbench system time went up
about 5% with the 2.5.58 kernel from what it was on the last
version I tested with (2.5.55). This increase is in both stock
and with the NUMA scheduler, so is not caused by the NUMA scheduler.
Now that I've got baselines for these, I plan to next look at
tweaking various parameters within the scheduler and see how what
happens. Also, I owe Erich numbers running hackbench. Overall, I am
pleased with these results.
And just for grins, here are the detailed results for running
numa_test 32
sched1-4-58:
Executing 32 times: ./randupdt 1000000
Running 'hackbench 20' in the background: Time: 9.044
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 100.0 0.0 0.0 | 1 1 | 55.04
2 0.0 0.0 4.8 95.2 | 2 3 *| 70.38
3 0.0 0.0 3.2 96.8 | 2 3 *| 72.72
4 0.0 26.4 2.8 70.7 | 2 3 *| 72.99
5 100.0 0.0 0.0 0.0 | 0 0 | 57.03
6 100.0 0.0 0.0 0.0 | 0 0 | 55.06
7 100.0 0.0 0.0 0.0 | 0 0 | 57.18
8 0.0 100.0 0.0 0.0 | 1 1 | 55.38
9 100.0 0.0 0.0 0.0 | 0 0 | 54.37
10 0.0 100.0 0.0 0.0 | 1 1 | 56.06
11 0.0 13.2 0.0 86.8 | 3 3 | 64.33
12 0.0 0.0 0.0 100.0 | 3 3 | 62.35
13 1.7 0.0 98.3 0.0 | 2 2 | 67.47
14 100.0 0.0 0.0 0.0 | 0 0 | 55.94
15 0.0 29.4 61.9 8.6 | 3 2 *| 78.76
16 0.0 100.0 0.0 0.0 | 1 1 | 56.42
17 18.9 0.0 74.9 6.2 | 3 2 *| 70.57
18 0.0 0.0 100.0 0.0 | 2 2 | 63.01
19 0.0 100.0 0.0 0.0 | 1 1 | 55.97
20 0.0 0.0 92.7 7.3 | 3 2 *| 65.62
21 0.0 0.0 100.0 0.0 | 2 2 | 62.70
22 0.0 100.0 0.0 0.0 | 1 1 | 55.53
23 0.0 1.5 0.0 98.5 | 3 3 | 56.95
24 0.0 100.0 0.0 0.0 | 1 1 | 55.75
25 0.0 30.0 2.3 67.7 | 2 3 *| 77.78
26 0.0 0.0 0.0 100.0 | 3 3 | 57.71
27 13.6 0.0 86.4 0.0 | 0 2 *| 66.55
28 100.0 0.0 0.0 0.0 | 0 0 | 55.43
29 0.0 100.0 0.0 0.0 | 1 1 | 56.12
30 19.8 0.0 62.5 17.6 | 3 2 *| 66.92
31 100.0 0.0 0.0 0.0 | 0 0 | 54.90
32 100.0 0.0 0.0 0.0 | 0 0 | 54.70
AverageUserTime 61.49 seconds
ElapsedTime 140.51
TotalUserTime 1968.06
TotalSysTime 9.57
sched1-5-58:
Executing 32 times: ./randupdt 1000000
Running 'hackbench 20' in the background: Time: 9.145
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 100.0 0.0 0.0 0.0 | 0 0 | 54.88
2 0.0 100.0 0.0 0.0 | 1 1 | 54.08
3 0.0 0.0 0.0 100.0 | 3 3 | 55.48
4 0.0 0.0 0.0 100.0 | 3 3 | 55.47
5 100.0 0.0 0.0 0.0 | 0 0 | 53.84
6 100.0 0.0 0.0 0.0 | 0 0 | 53.37
7 0.0 0.0 0.0 100.0 | 3 3 | 55.41
8 90.9 9.1 0.0 0.0 | 1 0 *| 55.58
9 0.0 100.0 0.0 0.0 | 1 1 | 55.61
10 0.0 100.0 0.0 0.0 | 1 1 | 54.56
11 0.0 0.0 98.1 1.9 | 2 2 | 56.25
12 0.0 0.0 0.0 100.0 | 3 3 | 55.07
13 0.0 0.0 0.0 100.0 | 3 3 | 54.92
14 0.0 100.0 0.0 0.0 | 1 1 | 54.59
15 100.0 0.0 0.0 0.0 | 0 0 | 55.10
16 5.0 0.0 95.0 0.0 | 2 2 | 56.97
17 0.0 0.0 100.0 0.0 | 2 2 | 55.51
18 100.0 0.0 0.0 0.0 | 0 0 | 53.97
19 0.0 4.7 95.3 0.0 | 2 2 | 57.21
20 0.0 0.0 100.0 0.0 | 2 2 | 55.53
21 0.0 0.0 100.0 0.0 | 2 2 | 56.46
22 0.0 0.0 100.0 0.0 | 2 2 | 55.48
23 0.0 0.0 0.0 100.0 | 3 3 | 55.99
24 0.0 100.0 0.0 0.0 | 1 1 | 55.32
25 0.0 6.2 93.8 0.0 | 2 2 | 57.66
26 0.0 0.0 0.0 100.0 | 3 3 | 55.60
27 0.0 100.0 0.0 0.0 | 1 1 | 54.65
28 0.0 0.0 0.0 100.0 | 3 3 | 56.39
29 0.0 100.0 0.0 0.0 | 1 1 | 54.91
30 100.0 0.0 0.0 0.0 | 0 0 | 53.58
31 100.0 0.0 0.0 0.0 | 0 0 | 53.53
32 31.5 68.4 0.0 0.0 | 0 1 *| 54.41
AverageUserTime 55.23 seconds
ElapsedTime 117.32
TotalUserTime 1767.71
TotalSysTime 7.78
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486
OK, ran some tests on incremental version of the stack.
newsched3 = patches 1+2+3 ... etc.
oldsched = Erich's old code + Michael's ilb.
newsched4-tune = patches 1,2,3,4 + tuning patch below:
Tuning seems to help quite a bit ... need to stick this into arch topo code.
Sleep time now ;-)
Kernbench:
Elapsed User System CPU
2.5.58-mjb1 19.522s 186.566s 41.516s 1167.8%
2.5.58-mjb1-oldsched 19.488s 186.73s 42.382s 1175.6%
2.5.58-mjb1-newsched2 19.286s 186.418s 40.998s 1178.8%
2.5.58-mjb1-newsched3 19.58s 187.658s 43.694s 1181.2%
2.5.58-mjb1-newsched4 19.266s 187.772s 42.984s 1197.4%
2.5.58-mjb1-newsched4-tune 19.424s 186.664s 41.422s 1173.6%
2.5.58-mjb1-newsched5 19.462s 187.692s 43.02s 1185%
NUMA schedbench 4:
AvgUser Elapsed TotalUser TotalSys
2.5.58-mjb1
2.5.58-mjb1-oldsched 0.00 35.16 88.55 0.68
2.5.58-mjb1-newsched2 0.00 19.12 63.71 0.48
2.5.58-mjb1-newsched3 0.00 35.73 88.26 0.58
2.5.58-mjb1-newsched4 0.00 35.64 88.46 0.60
2.5.58-mjb1-newsched4-tune 0.00 37.10 91.99 0.58
2.5.58-mjb1-newsched5 0.00 35.34 88.60 0.64
NUMA schedbench 8:
AvgUser Elapsed TotalUser TotalSys
2.5.58-mjb1 0.00 35.34 88.60 0.64
2.5.58-mjb1-oldsched 0.00 64.01 338.77 1.50
2.5.58-mjb1-newsched2 0.00 31.56 227.72 1.03
2.5.58-mjb1-newsched3 0.00 35.44 220.63 1.36
2.5.58-mjb1-newsched4 0.00 35.47 223.86 1.33
2.5.58-mjb1-newsched4-tune 0.00 37.04 232.92 1.14
2.5.58-mjb1-newsched5 0.00 36.11 223.14 1.39
NUMA schedbench 16:
AvgUser Elapsed TotalUser TotalSys
2.5.58-mjb1 0.00 36.11 223.14 1.39
2.5.58-mjb1-oldsched 0.00 62.60 834.67 4.85
2.5.58-mjb1-newsched2 0.00 57.24 850.12 2.64
2.5.58-mjb1-newsched3 0.00 64.15 870.25 3.18
2.5.58-mjb1-newsched4 0.00 64.01 875.17 3.10
2.5.58-mjb1-newsched4-tune 0.00 57.84 841.48 2.96
2.5.58-mjb1-newsched5 0.00 61.87 828.37 3.47
NUMA schedbench 32:
AvgUser Elapsed TotalUser TotalSys
2.5.58-mjb1 0.00 61.87 828.37 3.47
2.5.58-mjb1-oldsched 0.00 154.30 2031.93 9.35
2.5.58-mjb1-newsched2 0.00 117.75 1798.53 5.52
2.5.58-mjb1-newsched3 0.00 122.87 1771.71 8.33
2.5.58-mjb1-newsched4 0.00 134.86 1863.51 8.27
2.5.58-mjb1-newsched4-tune 0.00 118.18 1809.38 6.58
2.5.58-mjb1-newsched5 0.00 134.36 1853.94 8.33
NUMA schedbench 64:
AvgUser Elapsed TotalUser TotalSys
2.5.58-mjb1 0.00 134.36 1853.94 8.33
2.5.58-mjb1-oldsched 0.00 318.68 4852.81 21.47
2.5.58-mjb1-newsched2 0.00 241.11 3603.29 12.70
2.5.58-mjb1-newsched3 0.00 258.72 3977.50 16.88
2.5.58-mjb1-newsched4 0.00 252.87 3850.55 18.51
2.5.58-mjb1-newsched4-tune 0.00 235.43 3627.28 15.90
2.5.58-mjb1-newsched5 0.00 265.09 3967.70 18.81
--- sched.c.premjb4 2003-01-14 22:12:36.000000000 -0800
+++ sched.c 2003-01-14 22:20:19.000000000 -0800
@@ -85,7 +85,7 @@
#define NODE_THRESHOLD 125
#define MAX_INTERNODE_LB 40
#define MIN_INTERNODE_LB 4
-#define NODE_BALANCE_RATIO 10
+#define NODE_BALANCE_RATIO 250
/*
* If a task is 'interactive' then we reinsert it in the active
@@ -763,7 +763,8 @@
+ atomic_read(&node_nr_running[i]);
this_rq()->prev_node_load[i] = load;
if (load > maxload &&
- (100*load > ((NODE_THRESHOLD*100*this_load)/100))) {
+ (100*load > (NODE_THRESHOLD*this_load))
+ && load > this_load + 4) {
maxload = load;
node = i;
}
Thanks for the patience with the problems in the yesterday patches. I
resend the patches in the same form. Made following changes:
- moved NODE_BALANCE_{RATE,MIN,MAX} to topology.h
- removed the divide in the find_busiest_node() loop (thanks, Martin)
- removed the modulo (%) in in the cross-node balancing trigger
- re-added node_nr_running_init() stub, nr_running_init() and comments
from Christoph
- removed the constant factor 4 in find_busiest_node. The
find_busiest_queue routine will take care of the case where the
busiest_node is running only few processes (at most one per CPU) and
return busiest=NULL .
I hope we can start tuning the parameters now. In the basic NUMA
scheduler part these are:
NODE_THRESHOLD : minimum percentage of node overload to
trigger cross-node balancing
NODE_BALANCE_RATE : arch specific, cross-node balancing is called
after this many intra-node load balance calls
In the extended NUMA scheduler the fixed value of NODE_BALANCE_RATE is
replaced by a variable rate set to :
NODE_BALANCE_MIN : if own node load is less then avg_load/2
NODE_BALANCE_MAX : if load is larger than avg_load/2
Together with the reduced number of steals across nodes this might
help us in achieving equal load among nodes. I'm not aware of any
simple benchmark which can demonstrate this...
Regards,
Erich
On Wed, 2003-01-15 at 07:10, Erich Focht wrote:
> Thanks for the patience with the problems in the yesterday patches. I
> resend the patches in the same form. Made following changes:
>
> - moved NODE_BALANCE_{RATE,MIN,MAX} to topology.h
> - removed the divide in the find_busiest_node() loop (thanks, Martin)
> - removed the modulo (%) in in the cross-node balancing trigger
> - re-added node_nr_running_init() stub, nr_running_init() and comments
> from Christoph
> - removed the constant factor 4 in find_busiest_node. The
> find_busiest_queue routine will take care of the case where the
> busiest_node is running only few processes (at most one per CPU) and
> return busiest=NULL .
These applied clean and looked good. I ran numbers for kernels
with patches 1-4 and patches 1-5. Results below.
>
> I hope we can start tuning the parameters now. In the basic NUMA
> scheduler part these are:
> NODE_THRESHOLD : minimum percentage of node overload to
> trigger cross-node balancing
> NODE_BALANCE_RATE : arch specific, cross-node balancing is called
> after this many intra-node load balance calls
I need to spend a some time staring at this code and putting together
a series of tests to try. I think the basic mechanism looks good, so
hopefully we are down to finding the best numbers for the architecture.
>
> In the extended NUMA scheduler the fixed value of NODE_BALANCE_RATE is
> replaced by a variable rate set to :
> NODE_BALANCE_MIN : if own node load is less then avg_load/2
> NODE_BALANCE_MAX : if load is larger than avg_load/2
> Together with the reduced number of steals across nodes this might
> help us in achieving equal load among nodes. I'm not aware of any
> simple benchmark which can demonstrate this...
>
> Regards,
> Erich
> ----
>
results:
stock58h: linux 2.5.58 with cputime stats patch
sched14r2-58: stock58h with latest NUMA sched patches 1 through 4
sched15r2-58: stock58h with latest NUMA sched patches 1 through 5
sched1-4-58: stock58h with previous NUMA sched patches 1 through 4
Kernbench:
Elapsed User System CPU
sched14r2-58 29.488s 284.02s 82.132s 1241.8%
sched15r2-58 29.778s 282.792s 83.478s 1229.8%
stock58h 31.656s 305.85s 89.232s 1248.2%
sched1-4-58 29.886s 287.506s 82.94s 1239%
Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
sched14r2-58 22.50 37.20 90.03 0.94
sched15r2-58 16.63 24.23 66.56 0.69
stock58h 27.73 42.80 110.96 0.85
sched1-4-58 32.86 46.41 131.47 0.85
Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
sched14r2-58 30.27 43.22 242.23 1.75
sched15r2-58 30.90 42.46 247.28 1.48
stock58h 45.97 61.87 367.81 2.11
sched1-4-58 31.39 49.18 251.22 2.15
Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
sched14r2-58 52.78 57.28 844.61 3.70
sched15r2-58 48.44 65.31 775.25 3.30
stock58h 60.91 83.63 974.71 6.18
sched1-4-58 54.31 62.11 869.11 3.84
Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
sched14r2-58 56.60 116.99 1811.56 5.94
sched15r2-58 56.42 116.75 1805.82 6.45
stock58h 84.26 195.16 2696.65 16.53
sched1-4-58 61.49 140.51 1968.06 9.57
Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
sched14r2-58 56.48 232.63 3615.55 16.02
sched15r2-58 56.38 236.25 3609.03 15.41
stock58h 123.27 511.77 7889.77 27.78
sched1-4-58 63.39 266.40 4057.92 20.55
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486
I'm not keen on the way the minisched patch got reformatted. I changed
it into a seperate function, which I think is much cleaner by the time
you've added the third patch - no #ifdef CONFIG_NUMA in load_balance.
Rejigged patches attatched, no functional changes.
Anyway, I perf tested this, and it comes out more or less the same as
the tuned version I was poking at last night (ie best of the bunch).
Looks pretty good to me.
M.
PS. The fourth patch was so small, and touching the same stuff as 3
that I rolled it into the third one here. Seems like a universal
benefit ;-)
Hi Martin and Michael,
thanks for testing again!
On Thursday 16 January 2003 07:05, Martin J. Bligh wrote:
> I'm not keen on the way the minisched patch got reformatted. I changed
> it into a seperate function, which I think is much cleaner by the time
> you've added the third patch - no #ifdef CONFIG_NUMA in load_balance.
Fine. This form is also nearer to the codingstyle rule: "functions
should do only one thing" (I'm reading those more carefully now ;-)
> Anyway, I perf tested this, and it comes out more or less the same as
> the tuned version I was poking at last night (ie best of the bunch).
> Looks pretty good to me.
Great!
> PS. The fourth patch was so small, and touching the same stuff as 3
> that I rolled it into the third one here. Seems like a universal
> benefit ;-)
Yes, it's a much smaller step than patch #5. It would make sense to
have this included right from the start.
Regards,
Erich
On Thu, 2003-01-16 at 11:47, Erich Focht wrote:
> Fine. This form is also nearer to the codingstyle rule: "functions
> should do only one thing" (I'm reading those more carefully now ;-)
Good ;)
This is looking good. Thanks hch for going over it with your fine tooth
comb.
Erich and Martin, what more needs to be done prior to inclusion? Do you
still want an exec balancer in place?
Robert Love
> On Thu, 2003-01-16 at 11:47, Erich Focht wrote:
>
>> Fine. This form is also nearer to the codingstyle rule: "functions
>> should do only one thing" (I'm reading those more carefully now ;-)
>
> Good ;)
>
> This is looking good. Thanks hch for going over it with your fine tooth
> comb.
>
> Erich and Martin, what more needs to be done prior to inclusion? Do you
> still want an exec balancer in place?
Yup, that's in patch 2 already. I just pushed it ... will see what
happens.
M.
> well, it needs to settle down a bit more, we are technically in a
> codefreeze :-)
The current codeset is *completely* non-invasive to non-NUMA systems.
It can't break anything.
M.
On Thu, Jan 16, 2003 at 08:07:09PM +0100, Ingo Molnar wrote:
> well, it needs to settle down a bit more, we are technically in a
> codefreeze :-)
We're in feature freeze. Not sure whether fixing the scheduler for
one type of hardware supported by Linux is a feature 8)
Anyway, patch 1 should certainly merged ASAP, for the other we can wait
a bit more to settle, but I don't think it's really worth the wait.
On 16 Jan 2003, Robert Love wrote:
> > Fine. This form is also nearer to the codingstyle rule: "functions
> > should do only one thing" (I'm reading those more carefully now ;-)
>
> Good ;)
>
> This is looking good. Thanks hch for going over it with your fine tooth
> comb.
>
> Erich and Martin, what more needs to be done prior to inclusion? Do you
> still want an exec balancer in place?
well, it needs to settle down a bit more, we are technically in a
codefreeze :-)
Ingo
On Thu, 16 Jan 2003, Christoph Hellwig wrote:
> > well, it needs to settle down a bit more, we are technically in a
> > codefreeze :-)
>
> We're in feature freeze. Not sure whether fixing the scheduler for one
> type of hardware supported by Linux is a feature 8)
>
> Anyway, patch 1 should certainly merged ASAP, for the other we can wait
> a bit more to settle, but I don't think it's really worth the wait.
agreed, the patch is unintrusive, but by settling down i mean things like
this:
+/* XXX(hch): this should go into a struct sched_node_data */
should be decided one way or another.
i'm also not quite happy about the conceptual background of
rq->nr_balanced. This load-balancing rate-limit is arbitrary and not
neutral at all. The way this should be done is to move the inter-node
balancing conditional out of load_balance(), and only trigger it from the
timer interrupt, with a given rate. On basically all NUMA hardware i
suspect it's much better to do inter-node balancing only on a very slow
scale. Making it dependnet on an arbitrary portion of the idle-CPU
rebalancing act makes the frequency of inter-node rebalancing almost
arbitrarily high.
ie. there are two basic types of rebalancing acts in multiprocessor
environments: 'synchronous balancing' and 'asynchronous balancing'.
Synchronous balancing is done whenever a CPU runs idle - this can happen
at a very high rate, so it needs to be low overhead and unintrusive. This
was already so when i did the SMP balancer. The asynchronous blancing
component (currently directly triggered from every CPU's own timer
interrupt), has a fixed frequency, and thus can be almost arbitrarily
complex. It's the one that is aware of the global scheduling picture. For
NUMA i'd suggest two asynchronous frequencies: one intra-node frequency,
and an inter-node frequency - configured by the architecture and roughly
in the same proportion to each other as cachemiss latencies.
(this all means that unless there's empirical data showing the opposite,
->nr_balanced can be removed completely, and fixed frequency balancing can
be done from the timer tick. This should further simplify the patch.)
Ingo
> > > well, it needs to settle down a bit more, we are technically in a
> > > codefreeze :-)
> >
> > We're in feature freeze. Not sure whether fixing the scheduler for one
> > type of hardware supported by Linux is a feature 8)
Yes, we are definitely _not_ in a code freeze yet, and I doubt that we
will be for at least a few months.
John.
> complex. It's the one that is aware of the global scheduling picture. For
> NUMA i'd suggest two asynchronous frequencies: one intra-node frequency,
> and an inter-node frequency - configured by the architecture and roughly
> in the same proportion to each other as cachemiss latencies.
That's exactly what's in the latest set of patches - admittedly it's a
multiplier of when we run load_balance, not the tick multiplier, but
that's very easy to fix. Can you check out the stuff I posted last night?
I think it's somewhat cleaner ...
M.
On Thu, 16 Jan 2003, Martin J. Bligh wrote:
> > complex. It's the one that is aware of the global scheduling picture. For
> > NUMA i'd suggest two asynchronous frequencies: one intra-node frequency,
> > and an inter-node frequency - configured by the architecture and roughly
> > in the same proportion to each other as cachemiss latencies.
>
> That's exactly what's in the latest set of patches - admittedly it's a
> multiplier of when we run load_balance, not the tick multiplier, but
> that's very easy to fix. Can you check out the stuff I posted last
> night? I think it's somewhat cleaner ...
yes, i saw it, it has the same tying between idle-CPU-rebalance and
inter-node rebalance, as Erich's patch. You've put it into
cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
coupled balancing act. There are two synchronous balancing acts currently:
the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
must involve any 'heavy' balancing, only local balancing. The inter-node
balancing (which is heavier than even the global SMP balancer), should
never be triggered from the high-frequency path. [whether it's high
frequency or not depends on the actual workload, but it can be potentially
_very_ high frequency, easily on the order of 1 million times a second -
then you'll call the inter-node balancer 100K times a second.]
I'd strongly suggest to decouple the heavy NUMA load-balancing code from
the fastpath and re-check the benchmark numbers.
Ingo
(*) whether sched_balance_exec() is a high-frequency path or not is up to
debate. Right now it's not possible to get much more than a couple of
thousand exec()'s per second on fast CPUs. Hopefully that will change in
the future though, so exec() events could become really fast. So i'd
suggest to only do local (ie. SMP-alike) balancing in the exec() path, and
only do NUMA cross-node balancing with a fixed frequency, from the timer
tick. But exec()-time is really special, since the user task usually has
zero cached state at this point, so we _can_ do cheap cross-node balancing
as well. So it's a boundary thing - probably doing the full-blown
balancing is the right thing.
[whether it's high
frequency or not depends on the actual workload, but it can be potentially
_very_ high frequency, easily on the order of 1 million times a second -
then you'll call the inter-node balancer 100K times a second.]
If this is due to thread creation/death, though, you might want this level
of inter-node balancing (or at least checking). It could represent a lot
of fork/execs that are now overloading one or more nodes. Is it reasonable
to expect this sort of load on a relatively proc/thread-stable machine?
Rick
On Thu, 2003-01-16 at 12:19, Ingo Molnar wrote:
>
> Ingo
>
> (*) whether sched_balance_exec() is a high-frequency path or not is up to
> debate. Right now it's not possible to get much more than a couple of
> thousand exec()'s per second on fast CPUs. Hopefully that will change in
> the future though, so exec() events could become really fast. So i'd
> suggest to only do local (ie. SMP-alike) balancing in the exec() path, and
> only do NUMA cross-node balancing with a fixed frequency, from the timer
> tick. But exec()-time is really special, since the user task usually has
> zero cached state at this point, so we _can_ do cheap cross-node balancing
> as well. So it's a boundary thing - probably doing the full-blown
> balancing is the right thing.
>
The reason for doing load balancing on every exec is that, as you say,
it is cheap to do the balancing at this point - no cache state, minimal
memory allocations. If we did not balance at this point and relied on
balancing from the timer tick, there would be much more movement of
established processes between nodes, which is expensive. Ideally, the
exec balancing is good enough so that on a well functioning system there
is no inter-node balancing taking place at the timer tick. Our testing
has shown that the exec load balance code does a very good job of
spreading processes across nodes, and thus, very little timer tick
balancing. The timer tick internode balancing is there as a safety
valve for those cases where exec balancing is not adequate. Workloads
with long running processes, and workloads with processes that do lots
of forks but not execs, are examples.
An earlier version of Erich's initial load balancer provided for
the option of balancing at either fork or exec, and the capability
of selecting on a per-process basis which to use. That could be
used if workloads that do forks with no execs become a problem on
NUMA boxes.
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486
> yes, i saw it, it has the same tying between idle-CPU-rebalance and
> inter-node rebalance, as Erich's patch. You've put it into
> cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
> coupled balancing act. There are two synchronous balancing acts currently:
> the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
> must involve any 'heavy' balancing, only local balancing. The inter-node
If I understand that correctly (and I'm not sure I do), you're saying
you don't think the exec time balance should go global? That would
break most of the concept ... *something* has to distribute stuff around
nodes, and the exec point is the cheapest time to do that (least "weight"
to move. I'd like to base it off cached load-averages, rather than going
sniffing every runqueue, but if you're meaning we should only exec time
balance inside a node, I disagree. Am I just misunderstanding you?
At the moment, the high-freq balancer is only inside a node. Exec balancing
is global, and the "low-frequency" balancer is global. WRT the idle-time
balancing, I agree with what I *think* you're saying ... this shouldn't
clock up the rq->nr_balanced counter ... this encourages too much cross-node
stealing. I'll hack that change out and see what it does to the numbers.
Would appreciate more feedback on the first paragraph. Thanks,
M.
On Thu, 16 Jan 2003, Martin J. Bligh wrote:
> If I understand that correctly (and I'm not sure I do), you're saying
> you don't think the exec time balance should go global? That would break
> most of the concept ... *something* has to distribute stuff around
> nodes, and the exec point is the cheapest time to do that (least
> "weight" to move. [...]
the exec()-time balancing is special, since it only moves the task in
question - so the 'push' should indeed be a global decision. _But_, exec()
is also a natural balancing point for the local node (we potentially just
got rid of a task, which might create imbalance within the node), so it
might make sense to do a 'local' balancing run as well, if the exec()-ing
task was indeed pushed to another node.
> At the moment, the high-freq balancer is only inside a node. Exec
> balancing is global, and the "low-frequency" balancer is global. WRT the
> idle-time balancing, I agree with what I *think* you're saying ... this
> shouldn't clock up the rq->nr_balanced counter ... this encourages too
> much cross-node stealing. I'll hack that change out and see what it does
> to the numbers.
yes, this should also further unify the SMP and NUMA balancing code.
Ingo
Martin, Erich,
could you give the attached patch a go, it implements my
cleanup/reorganization suggestions ontop of 2.5.59:
- decouple the 'slow' rebalancer from the 'fast' rebalancer and attach it
to the scheduler tick. Remove rq->nr_balanced.
- do idle-rebalancing every 1 msec, intra-node rebalancing every 200
msecs and inter-node rebalancing every 400 msecs.
- move the tick-based rebalancing logic into rebalance_tick(), it looks
more organized this way and we have all related code in one spot.
- clean up the topology.h include file structure. Since generic kernel
code uses all the defines already, there's no reason to keep them in
asm-generic.h. I've created a linux/topology.h file that includes
asm/topology.h and takes care of the default and generic definitions.
Moved over a generic topology definition from mmzone.h.
- renamed rq->prev_nr_running[] to rq->prev_cpu_load[] - this further
unifies the SMP and NUMA balancers and is more in line with the
prev_node_load name.
If performance drops due to this patch then the benchmarking goal would be
to tune the following frequencies:
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
#define NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
In theory NODE_REBALANCE_TICK could be defined by the NUMA platform,
although in the past such per-platform scheduler tunings used to end
'orphaned' after some time. 400 msecs is pretty conservative at the
moment, it could be made more frequent if benchmark results support it.
the patch compiles and boots on UP and SMP, it compiles on x86-NUMA.
Ingo
--- linux/drivers/base/cpu.c.orig 2003-01-17 10:02:19.000000000 +0100
+++ linux/drivers/base/cpu.c 2003-01-17 10:02:25.000000000 +0100
@@ -6,8 +6,7 @@
#include <linux/module.h>
#include <linux/init.h>
#include <linux/cpu.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
static int cpu_add_device(struct device * dev)
--- linux/drivers/base/node.c.orig 2003-01-17 10:02:50.000000000 +0100
+++ linux/drivers/base/node.c 2003-01-17 10:03:03.000000000 +0100
@@ -7,8 +7,7 @@
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
static int node_add_device(struct device * dev)
--- linux/drivers/base/memblk.c.orig 2003-01-17 10:02:33.000000000 +0100
+++ linux/drivers/base/memblk.c 2003-01-17 10:02:38.000000000 +0100
@@ -7,8 +7,7 @@
#include <linux/init.h>
#include <linux/memblk.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
static int memblk_add_device(struct device * dev)
--- linux/include/asm-generic/topology.h.orig 2003-01-17 09:49:38.000000000 +0100
+++ linux/include/asm-generic/topology.h 2003-01-17 10:02:08.000000000 +0100
@@ -1,56 +0,0 @@
-/*
- * linux/include/asm-generic/topology.h
- *
- * Written by: Matthew Dobson, IBM Corporation
- *
- * Copyright (C) 2002, IBM Corp.
- *
- * All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
- * NON INFRINGEMENT. See the GNU General Public License for more
- * details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- *
- * Send feedback to <[email protected]>
- */
-#ifndef _ASM_GENERIC_TOPOLOGY_H
-#define _ASM_GENERIC_TOPOLOGY_H
-
-/* Other architectures wishing to use this simple topology API should fill
- in the below functions as appropriate in their own <asm/topology.h> file. */
-#ifndef __cpu_to_node
-#define __cpu_to_node(cpu) (0)
-#endif
-#ifndef __memblk_to_node
-#define __memblk_to_node(memblk) (0)
-#endif
-#ifndef __parent_node
-#define __parent_node(node) (0)
-#endif
-#ifndef __node_to_first_cpu
-#define __node_to_first_cpu(node) (0)
-#endif
-#ifndef __node_to_cpu_mask
-#define __node_to_cpu_mask(node) (cpu_online_map)
-#endif
-#ifndef __node_to_memblk
-#define __node_to_memblk(node) (0)
-#endif
-
-/* Cross-node load balancing interval. */
-#ifndef NODE_BALANCE_RATE
-#define NODE_BALANCE_RATE 10
-#endif
-
-#endif /* _ASM_GENERIC_TOPOLOGY_H */
--- linux/include/asm-ppc64/topology.h.orig 2003-01-17 09:54:46.000000000 +0100
+++ linux/include/asm-ppc64/topology.h 2003-01-17 09:55:18.000000000 +0100
@@ -46,18 +46,6 @@
return mask;
}
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
-#else /* !CONFIG_NUMA */
-
-#define __cpu_to_node(cpu) (0)
-#define __memblk_to_node(memblk) (0)
-#define __parent_node(nid) (0)
-#define __node_to_first_cpu(node) (0)
-#define __node_to_cpu_mask(node) (cpu_online_map)
-#define __node_to_memblk(node) (0)
-
#endif /* CONFIG_NUMA */
#endif /* _ASM_PPC64_TOPOLOGY_H */
--- linux/include/linux/topology.h.orig 2003-01-17 09:57:20.000000000 +0100
+++ linux/include/linux/topology.h 2003-01-17 10:09:38.000000000 +0100
@@ -0,0 +1,43 @@
+/*
+ * linux/include/linux/topology.h
+ */
+#ifndef _LINUX_TOPOLOGY_H
+#define _LINUX_TOPOLOGY_H
+
+#include <asm/topology.h>
+
+/*
+ * The default (non-NUMA) topology definitions:
+ */
+#ifndef __cpu_to_node
+#define __cpu_to_node(cpu) (0)
+#endif
+#ifndef __memblk_to_node
+#define __memblk_to_node(memblk) (0)
+#endif
+#ifndef __parent_node
+#define __parent_node(node) (0)
+#endif
+#ifndef __node_to_first_cpu
+#define __node_to_first_cpu(node) (0)
+#endif
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu) (cpu_online_map)
+#endif
+#ifndef __node_to_cpu_mask
+#define __node_to_cpu_mask(node) (cpu_online_map)
+#endif
+#ifndef __node_to_memblk
+#define __node_to_memblk(node) (0)
+#endif
+
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu) __node_to_cpu_mask(__cpu_to_node(cpu))
+#endif
+
+/*
+ * Generic defintions:
+ */
+#define numa_node_id() (__cpu_to_node(smp_processor_id()))
+
+#endif /* _LINUX_TOPOLOGY_H */
--- linux/include/linux/mmzone.h.orig 2003-01-17 09:58:20.000000000 +0100
+++ linux/include/linux/mmzone.h 2003-01-17 10:01:17.000000000 +0100
@@ -255,9 +255,7 @@
#define MAX_NR_MEMBLKS 1
#endif /* CONFIG_NUMA */
-#include <asm/topology.h>
-/* Returns the number of the current Node. */
-#define numa_node_id() (__cpu_to_node(smp_processor_id()))
+#include <linux/topology.h>
#ifndef CONFIG_DISCONTIGMEM
extern struct pglist_data contig_page_data;
--- linux/include/asm-ia64/topology.h.orig 2003-01-17 09:54:33.000000000 +0100
+++ linux/include/asm-ia64/topology.h 2003-01-17 09:54:38.000000000 +0100
@@ -60,7 +60,4 @@
*/
#define __node_to_memblk(node) (node)
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
#endif /* _ASM_IA64_TOPOLOGY_H */
--- linux/include/asm-i386/topology.h.orig 2003-01-17 09:55:28.000000000 +0100
+++ linux/include/asm-i386/topology.h 2003-01-17 09:56:27.000000000 +0100
@@ -61,17 +61,6 @@
/* Returns the number of the first MemBlk on Node 'node' */
#define __node_to_memblk(node) (node)
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 100
-
-#else /* !CONFIG_NUMA */
-/*
- * Other i386 platforms should define their own version of the
- * above macros here.
- */
-
-#include <asm-generic/topology.h>
-
#endif /* CONFIG_NUMA */
#endif /* _ASM_I386_TOPOLOGY_H */
--- linux/include/asm-i386/cpu.h.orig 2003-01-17 10:03:22.000000000 +0100
+++ linux/include/asm-i386/cpu.h 2003-01-17 10:03:31.000000000 +0100
@@ -3,8 +3,8 @@
#include <linux/device.h>
#include <linux/cpu.h>
+#include <linux/topology.h>
-#include <asm/topology.h>
#include <asm/node.h>
struct i386_cpu {
--- linux/include/asm-i386/node.h.orig 2003-01-17 10:04:02.000000000 +0100
+++ linux/include/asm-i386/node.h 2003-01-17 10:04:08.000000000 +0100
@@ -4,8 +4,7 @@
#include <linux/device.h>
#include <linux/mmzone.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
struct i386_node {
struct node node;
--- linux/include/asm-i386/memblk.h.orig 2003-01-17 10:03:51.000000000 +0100
+++ linux/include/asm-i386/memblk.h 2003-01-17 10:03:56.000000000 +0100
@@ -4,8 +4,8 @@
#include <linux/device.h>
#include <linux/mmzone.h>
#include <linux/memblk.h>
+#include <linux/topology.h>
-#include <asm/topology.h>
#include <asm/node.h>
struct i386_memblk {
--- linux/kernel/sched.c.orig 2003-01-17 09:22:24.000000000 +0100
+++ linux/kernel/sched.c 2003-01-17 10:29:53.000000000 +0100
@@ -153,10 +153,9 @@
nr_uninterruptible;
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
- int prev_nr_running[NR_CPUS];
+ int prev_cpu_load[NR_CPUS];
#ifdef CONFIG_NUMA
atomic_t *node_nr_running;
- unsigned int nr_balanced;
int prev_node_load[MAX_NUMNODES];
#endif
task_t *migration_thread;
@@ -765,29 +764,6 @@
return node;
}
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
- int this_node = __cpu_to_node(this_cpu);
- /*
- * Avoid rebalancing between nodes too often.
- * We rebalance globally once every NODE_BALANCE_RATE load balances.
- */
- if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
- int node = find_busiest_node(this_node);
- this_rq->nr_balanced = 0;
- if (node >= 0)
- return (__node_to_cpu_mask(node) | (1UL << this_cpu));
- }
- return __node_to_cpu_mask(this_node);
-}
-
-#else /* !CONFIG_NUMA */
-
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
- return cpu_online_map;
-}
-
#endif /* CONFIG_NUMA */
#if CONFIG_SMP
@@ -807,10 +783,10 @@
spin_lock(&busiest->lock);
spin_lock(&this_rq->lock);
/* Need to recalculate nr_running */
- if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+ if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ nr_running = this_rq->prev_cpu_load[this_cpu];
} else
spin_lock(&busiest->lock);
}
@@ -847,10 +823,10 @@
* that case we are less picky about moving a task across CPUs and
* take what can be taken.
*/
- if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+ if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ nr_running = this_rq->prev_cpu_load[this_cpu];
busiest = NULL;
max_load = 1;
@@ -859,11 +835,11 @@
continue;
rq_src = cpu_rq(i);
- if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
+ if (idle || (rq_src->nr_running < this_rq->prev_cpu_load[i]))
load = rq_src->nr_running;
else
- load = this_rq->prev_nr_running[i];
- this_rq->prev_nr_running[i] = rq_src->nr_running;
+ load = this_rq->prev_cpu_load[i];
+ this_rq->prev_cpu_load[i] = rq_src->nr_running;
if ((load > max_load) && (rq_src != this_rq)) {
busiest = rq_src;
@@ -922,7 +898,7 @@
* We call this with the current runqueue locked,
* irqs disabled.
*/
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int idle, unsigned long cpumask)
{
int imbalance, idx, this_cpu = smp_processor_id();
runqueue_t *busiest;
@@ -930,8 +906,7 @@
struct list_head *head, *curr;
task_t *tmp;
- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
- cpus_to_balance(this_cpu, this_rq));
+ busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
if (!busiest)
goto out;
@@ -1006,21 +981,61 @@
* frequency and balancing agressivity depends on whether the CPU is
* idle or not.
*
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
* systems with HZ=100, every 10 msecs.)
+ *
+ * On NUMA, do a node-rebalance every 400 msecs.
*/
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
+#define NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
-static inline void idle_tick(runqueue_t *rq)
+static void rebalance_tick(runqueue_t *this_rq, int idle)
{
- if (jiffies % IDLE_REBALANCE_TICK)
+ int this_cpu = smp_processor_id();
+ unsigned long j = jiffies, cpumask, this_cpumask = 1UL << this_cpu;
+
+ if (idle) {
+ if (!(j % IDLE_REBALANCE_TICK)) {
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, 1, this_cpumask);
+ spin_unlock(&this_rq->lock);
+ }
return;
- spin_lock(&rq->lock);
- load_balance(rq, 1);
- spin_unlock(&rq->lock);
+ }
+ /*
+ * First do inter-node rebalancing, then intra-node rebalancing,
+ * if both events happen in the same tick. The inter-node
+ * rebalancing does not necessarily have to create a perfect
+ * balance within the node, since we load-balance the most loaded
+ * node with the current CPU. (ie. other CPUs in the local node
+ * are not balanced.)
+ */
+#if CONFIG_NUMA
+ if (!(j % NODE_REBALANCE_TICK)) {
+ int node = find_busiest_node(__cpu_to_node(this_cpu));
+ if (node >= 0) {
+ cpumask = __node_to_cpu_mask(node) | this_cpumask;
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, 1, cpumask);
+ spin_unlock(&this_rq->lock);
+ }
+ }
+#endif
+ if (!(j % BUSY_REBALANCE_TICK)) {
+ cpumask = __cpu_to_node_mask(this_cpu);
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, 1, cpumask);
+ spin_unlock(&this_rq->lock);
+ }
+}
+#else
+/*
+ * on UP we do not need to balance between CPUs:
+ */
+static inline void rebalance_tick(runqueue_t *this_rq, int idle)
+{
}
-
#endif
DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
@@ -1063,9 +1078,7 @@
kstat_cpu(cpu).cpustat.iowait += sys_ticks;
else
kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
- idle_tick(rq);
-#endif
+ rebalance_tick(rq, 1);
return;
}
if (TASK_NICE(p) > 0)
@@ -1121,11 +1134,8 @@
enqueue_task(p, rq->active);
}
out:
-#if CONFIG_SMP
- if (!(jiffies % BUSY_REBALANCE_TICK))
- load_balance(rq, 0);
-#endif
spin_unlock(&rq->lock);
+ rebalance_tick(rq, 0);
}
void scheduling_functions_start_here(void) { }
@@ -1184,7 +1194,7 @@
pick_next_task:
if (unlikely(!rq->nr_running)) {
#if CONFIG_SMP
- load_balance(rq, 1);
+ load_balance(rq, 1, __cpu_to_node_mask(smp_processor_id()));
if (rq->nr_running)
goto pick_next_task;
#endif
--- linux/mm/page_alloc.c.orig 2003-01-17 10:01:29.000000000 +0100
+++ linux/mm/page_alloc.c 2003-01-17 10:01:35.000000000 +0100
@@ -28,8 +28,7 @@
#include <linux/blkdev.h>
#include <linux/slab.h>
#include <linux/notifier.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
DECLARE_BITMAP(node_online_map, MAX_NUMNODES);
DECLARE_BITMAP(memblk_online_map, MAX_NR_MEMBLKS);
--- linux/mm/vmscan.c.orig 2003-01-17 10:01:44.000000000 +0100
+++ linux/mm/vmscan.c 2003-01-17 10:01:52.000000000 +0100
@@ -27,10 +27,10 @@
#include <linux/pagevec.h>
#include <linux/backing-dev.h>
#include <linux/rmap-locking.h>
+#include <linux/topology.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
-#include <asm/topology.h>
#include <asm/div64.h>
#include <linux/swapops.h>
Hi Ingo,
On Thursday 16 January 2003 21:19, Ingo Molnar wrote:
> On Thu, 16 Jan 2003, Martin J. Bligh wrote:
> > > complex. It's the one that is aware of the global scheduling picture.
> > > For NUMA i'd suggest two asynchronous frequencies: one intra-node
> > > frequency, and an inter-node frequency - configured by the architecture
> > > and roughly in the same proportion to each other as cachemiss
> > > latencies.
> >
> > That's exactly what's in the latest set of patches - admittedly it's a
> > multiplier of when we run load_balance, not the tick multiplier, but
> > that's very easy to fix. Can you check out the stuff I posted last
> > night? I think it's somewhat cleaner ...
>
> yes, i saw it, it has the same tying between idle-CPU-rebalance and
> inter-node rebalance, as Erich's patch. You've put it into
> cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
> coupled balancing act. There are two synchronous balancing acts currently:
> the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
> must involve any 'heavy' balancing, only local balancing.
I prefer a single point of entry called load_balance() to multiple
functionally different balancers. The reason is that the later choice
might lead to balancers competing or working against each other. Not
now but the design could mislead to such developments.
But the main other reasons for calling the cross-node balancer after
NODE_BALANCE_RATE calls to the intra-node balancer (you call it
synchronous balancing) is performance:
Davide Libenzi showed quite a while ago that one benefits a lot if the
idle CPUs stay idle for rather short time. IIRC, his conclusion for
the multi-queue scheduler was that an order of magnitude of 10ms is
long enough, below you start feeling the balancing overhead, above you
waste useful cycles.
On a NUMA system this is even more important: the longer you leave
fresh tasks on an overloaded node, the more probable it is that they
allocate their memory there. And then they will run with poor
performance on the node which stayed idle for 200-400ms before
stealing them. So one wastes 200-400ms on each CPU of the idle node
and at the end gets tasks which perform poorly, anyway. If the tasks
are "old", at least we didn't waste too much time beeing idle. The
long-term target should be that the tasks should remember where their
memory is and return to that node.
> The inter-node
> balancing (which is heavier than even the global SMP balancer), should
> never be triggered from the high-frequency path.
Hmmm, we made it really slim. Actually the cross-node balancing might
even be cheaper than the global SMP balancer:
- it first loops over the nodes (loop length 4 on a 16 CPU NUMA-Q & Azusa)
- then it loops over the cpumask of the most loaded node + the current
CPU (loop length 5 on a NUMA-Q & Azusa).
This has to be compared with the loop length of 16 when doing the
global SMP rebalance. The additional work done for averaging is
minimal. The more nodes, the cheaper the NUMA cross-node balancing
compared to the global SMP balancing.
Besides: the CPU is idle anyway! So who cares whether it just
unsuccessfully scans its own empty node or looks at the other nodes
from time to time? It does this lockless and doesn't modify any
variables in other runqueues, so doesn't create cache misses for other
CPUs.
> [whether it's high
> frequency or not depends on the actual workload, but it can be potentially
> _very_ high frequency, easily on the order of 1 million times a second -
> then you'll call the inter-node balancer 100K times a second.]
You mean because cpu_idle() loops over schedule()? The code is:
while (1) {
void (*idle)(void) = pm_idle;
if (!idle)
idle = default_idle;
irq_stat[smp_processor_id()].idle_timestamp = jiffies;
while (!need_resched())
idle();
schedule();
}
So if the CPU is idle, it won't go through schedule(), except we get
an interrupt from time to time... And then, it doesn't really
matter. Or do you want to keep idle CPUs free for serving interrupts?
That could be legitimate, but is not the typical load I had in
mind and is an issue not related to the NUMA scheduler. But maybe you
have something else in mind, that I didn't consider, yet.
Under normal conditions the rebalancing I though about would work the
following way:
Busy CPU:
- intra-node rebalance every 200ms (interval timer controlled)
- cross-node rebalance every NODE_BALANCE_RATE*200ms (2s)
- when about to go idle, rebalance internally or across nodes, 10
times more often within the node
Idle CPU:
- intra-node rebalance every 1ms
- cross-node rebalance every NODE_REBALANCE_RATE * 1ms (10ms)
This doesn't appear to be too frequent for me... after all the cpu
is idle and couldn't steal anything from it's own node.
I don't insist too much on this design, but I can't see any serious
reasons against it. Of course, the performance should decide.
I'm about to test the two versions in discussion on an NEC Asama
(small configuration with 4 nodes, good memory latency ratio between
nodes (1.6), no node-level cache).
Best regards,
Erich
On Thursday 16 January 2003 21:19, Ingo Molnar wrote:
> On Thu, 16 Jan 2003, Martin J. Bligh wrote:
> > > complex. It's the one that is aware of the global scheduling picture.
> > > For NUMA i'd suggest two asynchronous frequencies: one intra-node
> > > frequency, and an inter-node frequency - configured by the architecture
> > > and roughly in the same proportion to each other as cachemiss
> > > latencies.
> >
> > That's exactly what's in the latest set of patches - admittedly it's a
> > multiplier of when we run load_balance, not the tick multiplier, but
> > that's very easy to fix. Can you check out the stuff I posted last
> > night? I think it's somewhat cleaner ...
>
> yes, i saw it, it has the same tying between idle-CPU-rebalance and
> inter-node rebalance, as Erich's patch. You've put it into
> cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
> coupled balancing act. There are two synchronous balancing acts currently:
> the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
> must involve any 'heavy' balancing, only local balancing. The inter-node
> balancing (which is heavier than even the global SMP balancer), should
> never be triggered from the high-frequency path. [whether it's high
> frequency or not depends on the actual workload, but it can be potentially
> _very_ high frequency, easily on the order of 1 million times a second -
> then you'll call the inter-node balancer 100K times a second.]
>
> I'd strongly suggest to decouple the heavy NUMA load-balancing code from
> the fastpath and re-check the benchmark numbers.
>
> Ingo
>
> (*) whether sched_balance_exec() is a high-frequency path or not is up to
> debate. Right now it's not possible to get much more than a couple of
> thousand exec()'s per second on fast CPUs. Hopefully that will change in
> the future though, so exec() events could become really fast. So i'd
> suggest to only do local (ie. SMP-alike) balancing in the exec() path, and
> only do NUMA cross-node balancing with a fixed frequency, from the timer
> tick. But exec()-time is really special, since the user task usually has
> zero cached state at this point, so we _can_ do cheap cross-node balancing
> as well. So it's a boundary thing - probably doing the full-blown
> balancing is the right thing.
On Fri, 17 Jan 2003, Erich Focht wrote:
> I prefer a single point of entry called load_balance() to multiple
> functionally different balancers. [...]
agreed - my cleanup patch keeps that property.
> [...] IIRC, his conclusion for the multi-queue scheduler was that an
> order of magnitude of 10ms is long enough, below you start feeling the
> balancing overhead, above you waste useful cycles.
this is one reason why we do the idle rebalancing at 1 msec granularity
right now.
> On a NUMA system this is even more important: the longer you leave fresh
> tasks on an overloaded node, the more probable it is that they allocate
> their memory there. And then they will run with poor performance on the
> node which stayed idle for 200-400ms before stealing them. So one wastes
> 200-400ms on each CPU of the idle node and at the end gets tasks which
> perform poorly, anyway. If the tasks are "old", at least we didn't waste
> too much time beeing idle. The long-term target should be that the tasks
> should remember where their memory is and return to that node.
i'd much rather vote for fork() and exec() time 'pre-balancing' and then
making it quite hard to move a task across nodes.
> > The inter-node balancing (which is heavier than even the global SMP
> > balancer), should never be triggered from the high-frequency path.
>
> Hmmm, we made it really slim. [...]
this is a misunderstanding. I'm not worried about the algorithmic overhead
_at all_, i'm worried about the effect of too frequent balancing - tasks
being moved between runqueues too often. That has shown to be a problem on
SMP. The prev_load type of statistic measurement relies on a constant
frequency - it can lead to over-balancing if it's called too often.
> So if the CPU is idle, it won't go through schedule(), except we get an
> interrupt from time to time... [...]
(no, it's even better than that, we never leave the idle loop except when
we _know_ that there is scheduling work to be done. Hence the
need_resched() test. But i'm not worried about balancing overhead at all.)
Ingo
Hi Ingo,
I tried your patch on the small NEC TX7 I can access easilly (8
Itanium2 CPUs on 4 nodes). I actually used your patch on top of the
2.5.52-ia64 kernel, but that shouldn't matter.
I like the cleanup of the topology.h. Also the renaming to
prev_cpu_load. There was a mistake (I think) in the call to
load_balance() in the idle path, guess you wanted to have:
+ load_balance(this_rq, 1, __node_to_cpu_mask(this_node));
instead of
+ load_balance(this_rq, 1, this_cpumask);
otherwise you won't load balance at all for idle cpus.
Here are the results:
kernbench (average of 5 kernel compiles) (standard error in brackets)
---------
Elapsed UserTime SysTime
orig 134.43(1.79) 944.79(0.43) 21.41(0.28)
ingo 136.74(1.58) 951.55(0.73) 21.16(0.32)
ingofix 135.22(0.59) 952.17(0.78) 21.16(0.19)
hackbench (chat benchmark alike) (elapsed time for N groups of 20
--------- senders & receivers, stats from 10 measurements)
N=10 N=25 N=50 N=100
orig 0.77(0.03) 1.91(0.06) 3.77(0.06) 7.78(0.21)
ingo 1.70(0.35) 3.11(0.47) 4.85(0.55) 8.80(0.98)
ingofix 1.16(0.14) 2.67(0.53) 5.05(0.26) 9.99(0.13)
numabench (N memory intensive tasks running in parallel, disturbed for
--------- a short time by a "hackbench 10" call)
numa_test N=4 ElapsedTime TotalUserTime TotalSysTime
orig: 26.13(2.54) 86.10(4.47) 0.09(0.01)
ingo: 27.60(2.16) 88.06(4.58) 0.11(0.01)
ingofix: 25.51(3.05) 83.55(2.78) 0.10(0.01)
numa_test N=8 ElapsedTime TotalUserTime TotalSysTime
orig: 24.81(2.71) 164.94(4.82) 0.17(0.01)
ingo: 27.38(3.01) 170.06(5.60) 0.30(0.03)
ingofix: 29.08(2.79) 172.10(4.48) 0.32(0.03)
numa_test N=16 ElapsedTime TotalUserTime TotalSysTime
orig: 45.19(3.42) 332.07(5.89) 0.32(0.01)
ingo: 50.18(0.38) 359.46(9.31) 0.46(0.04)
ingofix: 50.30(0.42) 357.38(9.12) 0.46(0.01)
numa_test N=32 ElapsedTime TotalUserTime TotalSysTime
orig: 86.84(1.83) 671.99(9.98) 0.65(0.02)
ingo: 93.44(2.13) 704.90(16.91) 0.82(0.06)
ingofix: 93.92(1.28) 727.58(9.26) 0.77(0.03)
>From these results I would prefer to either leave the numa scheduler
as it is or to introduce an IDLE_NODEBALANCE_TICK and a
BUSY_NODEBALANCE_TICK instead of just having one NODE_REBALANCE_TICK
which balances very rarely. In that case it would make sense to keep
the cpus_to_balance() function.
Best regards,
Erich
On Friday 17 January 2003 09:47, Ingo Molnar wrote:
> Martin, Erich,
>
> could you give the attached patch a go, it implements my
> cleanup/reorganization suggestions ontop of 2.5.59:
>
> - decouple the 'slow' rebalancer from the 'fast' rebalancer and attach it
> to the scheduler tick. Remove rq->nr_balanced.
>
> - do idle-rebalancing every 1 msec, intra-node rebalancing every 200
> msecs and inter-node rebalancing every 400 msecs.
>
> - move the tick-based rebalancing logic into rebalance_tick(), it looks
> more organized this way and we have all related code in one spot.
>
> - clean up the topology.h include file structure. Since generic kernel
> code uses all the defines already, there's no reason to keep them in
> asm-generic.h. I've created a linux/topology.h file that includes
> asm/topology.h and takes care of the default and generic definitions.
> Moved over a generic topology definition from mmzone.h.
>
> - renamed rq->prev_nr_running[] to rq->prev_cpu_load[] - this further
> unifies the SMP and NUMA balancers and is more in line with the
> prev_node_load name.
>
> If performance drops due to this patch then the benchmarking goal would be
> to tune the following frequencies:
>
> #define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
> #define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
> #define NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
>
> In theory NODE_REBALANCE_TICK could be defined by the NUMA platform,
> although in the past such per-platform scheduler tunings used to end
> 'orphaned' after some time. 400 msecs is pretty conservative at the
> moment, it could be made more frequent if benchmark results support it.
>
> the patch compiles and boots on UP and SMP, it compiles on x86-NUMA.
>
> Ingo
On Fri, 17 Jan 2003, Erich Focht wrote:
> I like the cleanup of the topology.h. Also the renaming to
> prev_cpu_load. There was a mistake (I think) in the call to
> load_balance() in the idle path, guess you wanted to have:
> + load_balance(this_rq, 1, __node_to_cpu_mask(this_node));
> instead of
> + load_balance(this_rq, 1, this_cpumask);
> otherwise you won't load balance at all for idle cpus.
indeed - there was another bug as well, the 'idle' parameter to
load_balance() was 1 even in the busy branch, causing too slow balancing.
> From these results I would prefer to either leave the numa scheduler as
> it is or to introduce an IDLE_NODEBALANCE_TICK and a
> BUSY_NODEBALANCE_TICK instead of just having one NODE_REBALANCE_TICK
> which balances very rarely.
agreed, i've attached the -B0 patch that does this. The balancing rates
are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
busy-global).
Ingo
--- linux/drivers/base/cpu.c.orig 2003-01-17 10:02:19.000000000 +0100
+++ linux/drivers/base/cpu.c 2003-01-17 10:02:25.000000000 +0100
@@ -6,8 +6,7 @@
#include <linux/module.h>
#include <linux/init.h>
#include <linux/cpu.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
static int cpu_add_device(struct device * dev)
--- linux/drivers/base/node.c.orig 2003-01-17 10:02:50.000000000 +0100
+++ linux/drivers/base/node.c 2003-01-17 10:03:03.000000000 +0100
@@ -7,8 +7,7 @@
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
static int node_add_device(struct device * dev)
--- linux/drivers/base/memblk.c.orig 2003-01-17 10:02:33.000000000 +0100
+++ linux/drivers/base/memblk.c 2003-01-17 10:02:38.000000000 +0100
@@ -7,8 +7,7 @@
#include <linux/init.h>
#include <linux/memblk.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
static int memblk_add_device(struct device * dev)
--- linux/include/asm-generic/topology.h.orig 2003-01-17 09:49:38.000000000 +0100
+++ linux/include/asm-generic/topology.h 2003-01-17 10:02:08.000000000 +0100
@@ -1,56 +0,0 @@
-/*
- * linux/include/asm-generic/topology.h
- *
- * Written by: Matthew Dobson, IBM Corporation
- *
- * Copyright (C) 2002, IBM Corp.
- *
- * All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
- * NON INFRINGEMENT. See the GNU General Public License for more
- * details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- *
- * Send feedback to <[email protected]>
- */
-#ifndef _ASM_GENERIC_TOPOLOGY_H
-#define _ASM_GENERIC_TOPOLOGY_H
-
-/* Other architectures wishing to use this simple topology API should fill
- in the below functions as appropriate in their own <asm/topology.h> file. */
-#ifndef __cpu_to_node
-#define __cpu_to_node(cpu) (0)
-#endif
-#ifndef __memblk_to_node
-#define __memblk_to_node(memblk) (0)
-#endif
-#ifndef __parent_node
-#define __parent_node(node) (0)
-#endif
-#ifndef __node_to_first_cpu
-#define __node_to_first_cpu(node) (0)
-#endif
-#ifndef __node_to_cpu_mask
-#define __node_to_cpu_mask(node) (cpu_online_map)
-#endif
-#ifndef __node_to_memblk
-#define __node_to_memblk(node) (0)
-#endif
-
-/* Cross-node load balancing interval. */
-#ifndef NODE_BALANCE_RATE
-#define NODE_BALANCE_RATE 10
-#endif
-
-#endif /* _ASM_GENERIC_TOPOLOGY_H */
--- linux/include/asm-ppc64/topology.h.orig 2003-01-17 09:54:46.000000000 +0100
+++ linux/include/asm-ppc64/topology.h 2003-01-17 09:55:18.000000000 +0100
@@ -46,18 +46,6 @@
return mask;
}
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
-#else /* !CONFIG_NUMA */
-
-#define __cpu_to_node(cpu) (0)
-#define __memblk_to_node(memblk) (0)
-#define __parent_node(nid) (0)
-#define __node_to_first_cpu(node) (0)
-#define __node_to_cpu_mask(node) (cpu_online_map)
-#define __node_to_memblk(node) (0)
-
#endif /* CONFIG_NUMA */
#endif /* _ASM_PPC64_TOPOLOGY_H */
--- linux/include/linux/topology.h.orig 2003-01-17 09:57:20.000000000 +0100
+++ linux/include/linux/topology.h 2003-01-17 10:09:38.000000000 +0100
@@ -0,0 +1,43 @@
+/*
+ * linux/include/linux/topology.h
+ */
+#ifndef _LINUX_TOPOLOGY_H
+#define _LINUX_TOPOLOGY_H
+
+#include <asm/topology.h>
+
+/*
+ * The default (non-NUMA) topology definitions:
+ */
+#ifndef __cpu_to_node
+#define __cpu_to_node(cpu) (0)
+#endif
+#ifndef __memblk_to_node
+#define __memblk_to_node(memblk) (0)
+#endif
+#ifndef __parent_node
+#define __parent_node(node) (0)
+#endif
+#ifndef __node_to_first_cpu
+#define __node_to_first_cpu(node) (0)
+#endif
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu) (cpu_online_map)
+#endif
+#ifndef __node_to_cpu_mask
+#define __node_to_cpu_mask(node) (cpu_online_map)
+#endif
+#ifndef __node_to_memblk
+#define __node_to_memblk(node) (0)
+#endif
+
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu) __node_to_cpu_mask(__cpu_to_node(cpu))
+#endif
+
+/*
+ * Generic defintions:
+ */
+#define numa_node_id() (__cpu_to_node(smp_processor_id()))
+
+#endif /* _LINUX_TOPOLOGY_H */
--- linux/include/linux/mmzone.h.orig 2003-01-17 09:58:20.000000000 +0100
+++ linux/include/linux/mmzone.h 2003-01-17 10:01:17.000000000 +0100
@@ -255,9 +255,7 @@
#define MAX_NR_MEMBLKS 1
#endif /* CONFIG_NUMA */
-#include <asm/topology.h>
-/* Returns the number of the current Node. */
-#define numa_node_id() (__cpu_to_node(smp_processor_id()))
+#include <linux/topology.h>
#ifndef CONFIG_DISCONTIGMEM
extern struct pglist_data contig_page_data;
--- linux/include/asm-ia64/topology.h.orig 2003-01-17 09:54:33.000000000 +0100
+++ linux/include/asm-ia64/topology.h 2003-01-17 09:54:38.000000000 +0100
@@ -60,7 +60,4 @@
*/
#define __node_to_memblk(node) (node)
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
#endif /* _ASM_IA64_TOPOLOGY_H */
--- linux/include/asm-i386/topology.h.orig 2003-01-17 09:55:28.000000000 +0100
+++ linux/include/asm-i386/topology.h 2003-01-17 09:56:27.000000000 +0100
@@ -61,17 +61,6 @@
/* Returns the number of the first MemBlk on Node 'node' */
#define __node_to_memblk(node) (node)
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 100
-
-#else /* !CONFIG_NUMA */
-/*
- * Other i386 platforms should define their own version of the
- * above macros here.
- */
-
-#include <asm-generic/topology.h>
-
#endif /* CONFIG_NUMA */
#endif /* _ASM_I386_TOPOLOGY_H */
--- linux/include/asm-i386/cpu.h.orig 2003-01-17 10:03:22.000000000 +0100
+++ linux/include/asm-i386/cpu.h 2003-01-17 10:03:31.000000000 +0100
@@ -3,8 +3,8 @@
#include <linux/device.h>
#include <linux/cpu.h>
+#include <linux/topology.h>
-#include <asm/topology.h>
#include <asm/node.h>
struct i386_cpu {
--- linux/include/asm-i386/node.h.orig 2003-01-17 10:04:02.000000000 +0100
+++ linux/include/asm-i386/node.h 2003-01-17 10:04:08.000000000 +0100
@@ -4,8 +4,7 @@
#include <linux/device.h>
#include <linux/mmzone.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
struct i386_node {
struct node node;
--- linux/include/asm-i386/memblk.h.orig 2003-01-17 10:03:51.000000000 +0100
+++ linux/include/asm-i386/memblk.h 2003-01-17 10:03:56.000000000 +0100
@@ -4,8 +4,8 @@
#include <linux/device.h>
#include <linux/mmzone.h>
#include <linux/memblk.h>
+#include <linux/topology.h>
-#include <asm/topology.h>
#include <asm/node.h>
struct i386_memblk {
--- linux/kernel/sched.c.orig 2003-01-17 09:22:24.000000000 +0100
+++ linux/kernel/sched.c 2003-01-17 17:01:47.000000000 +0100
@@ -153,10 +153,9 @@
nr_uninterruptible;
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
- int prev_nr_running[NR_CPUS];
+ int prev_cpu_load[NR_CPUS];
#ifdef CONFIG_NUMA
atomic_t *node_nr_running;
- unsigned int nr_balanced;
int prev_node_load[MAX_NUMNODES];
#endif
task_t *migration_thread;
@@ -765,29 +764,6 @@
return node;
}
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
- int this_node = __cpu_to_node(this_cpu);
- /*
- * Avoid rebalancing between nodes too often.
- * We rebalance globally once every NODE_BALANCE_RATE load balances.
- */
- if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
- int node = find_busiest_node(this_node);
- this_rq->nr_balanced = 0;
- if (node >= 0)
- return (__node_to_cpu_mask(node) | (1UL << this_cpu));
- }
- return __node_to_cpu_mask(this_node);
-}
-
-#else /* !CONFIG_NUMA */
-
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
- return cpu_online_map;
-}
-
#endif /* CONFIG_NUMA */
#if CONFIG_SMP
@@ -807,10 +783,10 @@
spin_lock(&busiest->lock);
spin_lock(&this_rq->lock);
/* Need to recalculate nr_running */
- if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+ if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ nr_running = this_rq->prev_cpu_load[this_cpu];
} else
spin_lock(&busiest->lock);
}
@@ -847,10 +823,10 @@
* that case we are less picky about moving a task across CPUs and
* take what can be taken.
*/
- if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+ if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ nr_running = this_rq->prev_cpu_load[this_cpu];
busiest = NULL;
max_load = 1;
@@ -859,11 +835,11 @@
continue;
rq_src = cpu_rq(i);
- if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
+ if (idle || (rq_src->nr_running < this_rq->prev_cpu_load[i]))
load = rq_src->nr_running;
else
- load = this_rq->prev_nr_running[i];
- this_rq->prev_nr_running[i] = rq_src->nr_running;
+ load = this_rq->prev_cpu_load[i];
+ this_rq->prev_cpu_load[i] = rq_src->nr_running;
if ((load > max_load) && (rq_src != this_rq)) {
busiest = rq_src;
@@ -922,7 +898,7 @@
* We call this with the current runqueue locked,
* irqs disabled.
*/
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int idle, unsigned long cpumask)
{
int imbalance, idx, this_cpu = smp_processor_id();
runqueue_t *busiest;
@@ -930,8 +906,7 @@
struct list_head *head, *curr;
task_t *tmp;
- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
- cpus_to_balance(this_cpu, this_rq));
+ busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
if (!busiest)
goto out;
@@ -1006,21 +981,75 @@
* frequency and balancing agressivity depends on whether the CPU is
* idle or not.
*
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
* systems with HZ=100, every 10 msecs.)
+ *
+ * On NUMA, do a node-rebalance every 400 msecs.
*/
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
+#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 2)
+#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
-static inline void idle_tick(runqueue_t *rq)
+#if CONFIG_NUMA
+static void balance_node(runqueue_t *this_rq, int idle, int this_cpu)
{
- if (jiffies % IDLE_REBALANCE_TICK)
- return;
- spin_lock(&rq->lock);
- load_balance(rq, 1);
- spin_unlock(&rq->lock);
+ int node = find_busiest_node(__cpu_to_node(this_cpu));
+ unsigned long cpumask, this_cpumask = 1UL << this_cpu;
+
+ if (node >= 0) {
+ cpumask = __node_to_cpu_mask(node) | this_cpumask;
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, idle, cpumask);
+ spin_unlock(&this_rq->lock);
+ }
}
+#endif
+static void rebalance_tick(runqueue_t *this_rq, int idle)
+{
+#if CONFIG_NUMA
+ int this_cpu = smp_processor_id();
+#endif
+ unsigned long j = jiffies;
+
+ /*
+ * First do inter-node rebalancing, then intra-node rebalancing,
+ * if both events happen in the same tick. The inter-node
+ * rebalancing does not necessarily have to create a perfect
+ * balance within the node, since we load-balance the most loaded
+ * node with the current CPU. (ie. other CPUs in the local node
+ * are not balanced.)
+ */
+ if (idle) {
+#if CONFIG_NUMA
+ if (!(j % IDLE_NODE_REBALANCE_TICK))
+ balance_node(this_rq, idle, this_cpu);
+#endif
+ if (!(j % IDLE_REBALANCE_TICK)) {
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, 0, __cpu_to_node_mask(this_cpu));
+ spin_unlock(&this_rq->lock);
+ }
+ return;
+ }
+#if CONFIG_NUMA
+ if (!(j % BUSY_NODE_REBALANCE_TICK))
+ balance_node(this_rq, idle, this_cpu);
+#endif
+ if (!(j % BUSY_REBALANCE_TICK)) {
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, idle, __cpu_to_node_mask(this_cpu));
+ spin_unlock(&this_rq->lock);
+ }
+}
+#else
+/*
+ * on UP we do not need to balance between CPUs:
+ */
+static inline void rebalance_tick(runqueue_t *this_rq, int idle)
+{
+}
#endif
DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
@@ -1063,9 +1092,7 @@
kstat_cpu(cpu).cpustat.iowait += sys_ticks;
else
kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
- idle_tick(rq);
-#endif
+ rebalance_tick(rq, 1);
return;
}
if (TASK_NICE(p) > 0)
@@ -1121,11 +1148,8 @@
enqueue_task(p, rq->active);
}
out:
-#if CONFIG_SMP
- if (!(jiffies % BUSY_REBALANCE_TICK))
- load_balance(rq, 0);
-#endif
spin_unlock(&rq->lock);
+ rebalance_tick(rq, 0);
}
void scheduling_functions_start_here(void) { }
@@ -1184,7 +1208,7 @@
pick_next_task:
if (unlikely(!rq->nr_running)) {
#if CONFIG_SMP
- load_balance(rq, 1);
+ load_balance(rq, 1, __cpu_to_node_mask(smp_processor_id()));
if (rq->nr_running)
goto pick_next_task;
#endif
--- linux/mm/page_alloc.c.orig 2003-01-17 10:01:29.000000000 +0100
+++ linux/mm/page_alloc.c 2003-01-17 10:01:35.000000000 +0100
@@ -28,8 +28,7 @@
#include <linux/blkdev.h>
#include <linux/slab.h>
#include <linux/notifier.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
DECLARE_BITMAP(node_online_map, MAX_NUMNODES);
DECLARE_BITMAP(memblk_online_map, MAX_NR_MEMBLKS);
--- linux/mm/vmscan.c.orig 2003-01-17 10:01:44.000000000 +0100
+++ linux/mm/vmscan.c 2003-01-17 10:01:52.000000000 +0100
@@ -27,10 +27,10 @@
#include <linux/pagevec.h>
#include <linux/backing-dev.h>
#include <linux/rmap-locking.h>
+#include <linux/topology.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
-#include <asm/topology.h>
#include <asm/div64.h>
#include <linux/swapops.h>
On Friday 17 January 2003 16:11, Ingo Molnar wrote:
> On Fri, 17 Jan 2003, Erich Focht wrote:
> > I like the cleanup of the topology.h. Also the renaming to
> > prev_cpu_load. There was a mistake (I think) in the call to
> > load_balance() in the idle path, guess you wanted to have:
> > + load_balance(this_rq, 1, __node_to_cpu_mask(this_node));
> > instead of
> > + load_balance(this_rq, 1, this_cpumask);
> > otherwise you won't load balance at all for idle cpus.
>
> indeed - there was another bug as well, the 'idle' parameter to
> load_balance() was 1 even in the busy branch, causing too slow balancing.
I didn't see that, but it's impact is only that a busy cpu is stealing
at most one task from another node, otherwise the idle=1 leads to more
aggressive balancing.
> > From these results I would prefer to either leave the numa scheduler as
> > it is or to introduce an IDLE_NODEBALANCE_TICK and a
> > BUSY_NODEBALANCE_TICK instead of just having one NODE_REBALANCE_TICK
> > which balances very rarely.
>
> agreed, i've attached the -B0 patch that does this. The balancing rates
> are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
> busy-global).
This looks good! I'll see if I can rerun the tests today, anyway I'm
more optimistic about this version.
Regards,
Erich
> agreed, i've attached the -B0 patch that does this. The balancing rates
> are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
> busy-global).
Hmmm... something is drastically wrong here, looks like we're thrashing
tasks between nodes?
Kernbench:
Elapsed User System CPU
2.5.59 20.032s 186.66s 47.73s 1170%
2.5.59-ingo2 23.578s 198.874s 90.648s 1227.4%
NUMA schedbench 4:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 36.38 90.70 0.62
2.5.59-ingo2 0.00 47.62 127.13 1.89
NUMA schedbench 8:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 42.78 249.77 1.85
2.5.59-ingo2 0.00 59.45 358.31 5.23
NUMA schedbench 16:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 56.84 848.00 2.78
2.5.59-ingo2 0.00 114.70 1430.95 21.21
diffprofile:
44770 total
27840 do_anonymous_page
3180 buffered_rmqueue
1814 default_idle
1534 __copy_from_user_ll
1465 free_hot_cold_page
1172 __copy_to_user_ll
919 page_remove_rmap
881 zap_pte_range
730 do_wp_page
687 do_no_page
601 __alloc_pages
527 vm_enough_memory
432 __set_page_dirty_buffers
426 page_add_rmap
322 release_pages
311 __pagevec_lru_add_active
233 prep_new_page
202 clear_page_tables
181 schedule
132 current_kernel_time
127 __block_prepare_write
103 kmap_atomic
100 __wake_up
97 pte_alloc_one
95 may_open
87 find_get_page
76 dget_locked
72 bad_range
69 pgd_alloc
62 __fput
50 copy_strings
...
-66 open_namei
-68 path_lookup
-115 .text.lock.file_table
-255 .text.lock.dec_and_lock
-469 .text.lock.namei
> I like the cleanup of the topology.h.
And the rest of Ingo's second version:
diff -urpN -X /home/fletch/.diff.exclude ingo-A/kernel/sched.c ingo-B/kernel/sched.c
--- ingo-A/kernel/sched.c Fri Jan 17 09:18:32 2003
+++ ingo-B/kernel/sched.c Fri Jan 17 09:19:42 2003
@@ -153,10 +153,9 @@ struct runqueue {
nr_uninterruptible;
task_t *curr, *idle;
prio_array_t *active, *expired, arrays[2];
- int prev_nr_running[NR_CPUS];
+ int prev_cpu_load[NR_CPUS];
#ifdef CONFIG_NUMA
atomic_t *node_nr_running;
- unsigned int nr_balanced;
int prev_node_load[MAX_NUMNODES];
#endif
task_t *migration_thread;
@@ -765,29 +764,6 @@ static int find_busiest_node(int this_no
return node;
}
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
- int this_node = __cpu_to_node(this_cpu);
- /*
- * Avoid rebalancing between nodes too often.
- * We rebalance globally once every NODE_BALANCE_RATE load balances.
- */
- if (++(this_rq->nr_balanced) == NODE_BALANCE_RATE) {
- int node = find_busiest_node(this_node);
- this_rq->nr_balanced = 0;
- if (node >= 0)
- return (__node_to_cpu_mask(node) | (1UL << this_cpu));
- }
- return __node_to_cpu_mask(this_node);
-}
-
-#else /* !CONFIG_NUMA */
-
-static inline unsigned long cpus_to_balance(int this_cpu, runqueue_t *this_rq)
-{
- return cpu_online_map;
-}
-
#endif /* CONFIG_NUMA */
#if CONFIG_SMP
@@ -807,10 +783,10 @@ static inline unsigned int double_lock_b
spin_lock(&busiest->lock);
spin_lock(&this_rq->lock);
/* Need to recalculate nr_running */
- if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+ if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ nr_running = this_rq->prev_cpu_load[this_cpu];
} else
spin_lock(&busiest->lock);
}
@@ -847,10 +823,10 @@ static inline runqueue_t *find_busiest_q
* that case we are less picky about moving a task across CPUs and
* take what can be taken.
*/
- if (idle || (this_rq->nr_running > this_rq->prev_nr_running[this_cpu]))
+ if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
nr_running = this_rq->nr_running;
else
- nr_running = this_rq->prev_nr_running[this_cpu];
+ nr_running = this_rq->prev_cpu_load[this_cpu];
busiest = NULL;
max_load = 1;
@@ -859,11 +835,11 @@ static inline runqueue_t *find_busiest_q
continue;
rq_src = cpu_rq(i);
- if (idle || (rq_src->nr_running < this_rq->prev_nr_running[i]))
+ if (idle || (rq_src->nr_running < this_rq->prev_cpu_load[i]))
load = rq_src->nr_running;
else
- load = this_rq->prev_nr_running[i];
- this_rq->prev_nr_running[i] = rq_src->nr_running;
+ load = this_rq->prev_cpu_load[i];
+ this_rq->prev_cpu_load[i] = rq_src->nr_running;
if ((load > max_load) && (rq_src != this_rq)) {
busiest = rq_src;
@@ -922,7 +898,7 @@ static inline void pull_task(runqueue_t
* We call this with the current runqueue locked,
* irqs disabled.
*/
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int idle, unsigned long cpumask)
{
int imbalance, idx, this_cpu = smp_processor_id();
runqueue_t *busiest;
@@ -930,8 +906,7 @@ static void load_balance(runqueue_t *thi
struct list_head *head, *curr;
task_t *tmp;
- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance,
- cpus_to_balance(this_cpu, this_rq));
+ busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
if (!busiest)
goto out;
@@ -1006,21 +981,75 @@ out:
* frequency and balancing agressivity depends on whether the CPU is
* idle or not.
*
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * busy-rebalance every 200 msecs. idle-rebalance every 1 msec. (or on
* systems with HZ=100, every 10 msecs.)
+ *
+ * On NUMA, do a node-rebalance every 400 msecs.
*/
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
+#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 2)
+#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
-static inline void idle_tick(runqueue_t *rq)
+#if CONFIG_NUMA
+static void balance_node(runqueue_t *this_rq, int idle, int this_cpu)
{
- if (jiffies % IDLE_REBALANCE_TICK)
- return;
- spin_lock(&rq->lock);
- load_balance(rq, 1);
- spin_unlock(&rq->lock);
+ int node = find_busiest_node(__cpu_to_node(this_cpu));
+ unsigned long cpumask, this_cpumask = 1UL << this_cpu;
+
+ if (node >= 0) {
+ cpumask = __node_to_cpu_mask(node) | this_cpumask;
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, idle, cpumask);
+ spin_unlock(&this_rq->lock);
+ }
}
+#endif
+static void rebalance_tick(runqueue_t *this_rq, int idle)
+{
+#if CONFIG_NUMA
+ int this_cpu = smp_processor_id();
+#endif
+ unsigned long j = jiffies;
+
+ /*
+ * First do inter-node rebalancing, then intra-node rebalancing,
+ * if both events happen in the same tick. The inter-node
+ * rebalancing does not necessarily have to create a perfect
+ * balance within the node, since we load-balance the most loaded
+ * node with the current CPU. (ie. other CPUs in the local node
+ * are not balanced.)
+ */
+ if (idle) {
+#if CONFIG_NUMA
+ if (!(j % IDLE_NODE_REBALANCE_TICK))
+ balance_node(this_rq, idle, this_cpu);
+#endif
+ if (!(j % IDLE_REBALANCE_TICK)) {
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, 0, __cpu_to_node_mask(this_cpu));
+ spin_unlock(&this_rq->lock);
+ }
+ return;
+ }
+#if CONFIG_NUMA
+ if (!(j % BUSY_NODE_REBALANCE_TICK))
+ balance_node(this_rq, idle, this_cpu);
+#endif
+ if (!(j % BUSY_REBALANCE_TICK)) {
+ spin_lock(&this_rq->lock);
+ load_balance(this_rq, idle, __cpu_to_node_mask(this_cpu));
+ spin_unlock(&this_rq->lock);
+ }
+}
+#else
+/*
+ * on UP we do not need to balance between CPUs:
+ */
+static inline void rebalance_tick(runqueue_t *this_rq, int idle)
+{
+}
#endif
DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
@@ -1063,9 +1092,7 @@ void scheduler_tick(int user_ticks, int
kstat_cpu(cpu).cpustat.iowait += sys_ticks;
else
kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
- idle_tick(rq);
-#endif
+ rebalance_tick(rq, 1);
return;
}
if (TASK_NICE(p) > 0)
@@ -1121,11 +1148,8 @@ void scheduler_tick(int user_ticks, int
enqueue_task(p, rq->active);
}
out:
-#if CONFIG_SMP
- if (!(jiffies % BUSY_REBALANCE_TICK))
- load_balance(rq, 0);
-#endif
spin_unlock(&rq->lock);
+ rebalance_tick(rq, 0);
}
void scheduling_functions_start_here(void) { }
@@ -1184,7 +1208,7 @@ need_resched:
pick_next_task:
if (unlikely(!rq->nr_running)) {
#if CONFIG_SMP
- load_balance(rq, 1);
+ load_balance(rq, 1, __cpu_to_node_mask(smp_processor_id()));
if (rq->nr_running)
goto pick_next_task;
#endif
> I like the cleanup of the topology.h.
Any chance we could keep that broken out as a seperate patch?
Topo cleanups below:
diff -urpN -X /home/fletch/.diff.exclude virgin/drivers/base/cpu.c ingo-A/drivers/base/cpu.c
--- virgin/drivers/base/cpu.c Sun Dec 1 09:59:47 2002
+++ ingo-A/drivers/base/cpu.c Fri Jan 17 09:19:23 2003
@@ -6,8 +6,7 @@
#include <linux/module.h>
#include <linux/init.h>
#include <linux/cpu.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
static int cpu_add_device(struct device * dev)
diff -urpN -X /home/fletch/.diff.exclude virgin/drivers/base/memblk.c ingo-A/drivers/base/memblk.c
--- virgin/drivers/base/memblk.c Mon Dec 16 21:50:42 2002
+++ ingo-A/drivers/base/memblk.c Fri Jan 17 09:19:23 2003
@@ -7,8 +7,7 @@
#include <linux/init.h>
#include <linux/memblk.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
static int memblk_add_device(struct device * dev)
diff -urpN -X /home/fletch/.diff.exclude virgin/drivers/base/node.c ingo-A/drivers/base/node.c
--- virgin/drivers/base/node.c Fri Jan 17 09:18:26 2003
+++ ingo-A/drivers/base/node.c Fri Jan 17 09:19:23 2003
@@ -7,8 +7,7 @@
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
static int node_add_device(struct device * dev)
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-generic/topology.h ingo-A/include/asm-generic/topology.h
--- virgin/include/asm-generic/topology.h Fri Jan 17 09:18:31 2003
+++ ingo-A/include/asm-generic/topology.h Fri Jan 17 09:19:23 2003
@@ -1,56 +0,0 @@
-/*
- * linux/include/asm-generic/topology.h
- *
- * Written by: Matthew Dobson, IBM Corporation
- *
- * Copyright (C) 2002, IBM Corp.
- *
- * All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, GOOD TITLE or
- * NON INFRINGEMENT. See the GNU General Public License for more
- * details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
- *
- * Send feedback to <[email protected]>
- */
-#ifndef _ASM_GENERIC_TOPOLOGY_H
-#define _ASM_GENERIC_TOPOLOGY_H
-
-/* Other architectures wishing to use this simple topology API should fill
- in the below functions as appropriate in their own <asm/topology.h> file. */
-#ifndef __cpu_to_node
-#define __cpu_to_node(cpu) (0)
-#endif
-#ifndef __memblk_to_node
-#define __memblk_to_node(memblk) (0)
-#endif
-#ifndef __parent_node
-#define __parent_node(node) (0)
-#endif
-#ifndef __node_to_first_cpu
-#define __node_to_first_cpu(node) (0)
-#endif
-#ifndef __node_to_cpu_mask
-#define __node_to_cpu_mask(node) (cpu_online_map)
-#endif
-#ifndef __node_to_memblk
-#define __node_to_memblk(node) (0)
-#endif
-
-/* Cross-node load balancing interval. */
-#ifndef NODE_BALANCE_RATE
-#define NODE_BALANCE_RATE 10
-#endif
-
-#endif /* _ASM_GENERIC_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-i386/cpu.h ingo-A/include/asm-i386/cpu.h
--- virgin/include/asm-i386/cpu.h Sun Nov 17 20:29:50 2002
+++ ingo-A/include/asm-i386/cpu.h Fri Jan 17 09:19:23 2003
@@ -3,8 +3,8 @@
#include <linux/device.h>
#include <linux/cpu.h>
+#include <linux/topology.h>
-#include <asm/topology.h>
#include <asm/node.h>
struct i386_cpu {
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-i386/memblk.h ingo-A/include/asm-i386/memblk.h
--- virgin/include/asm-i386/memblk.h Sun Nov 17 20:29:47 2002
+++ ingo-A/include/asm-i386/memblk.h Fri Jan 17 09:19:25 2003
@@ -4,8 +4,8 @@
#include <linux/device.h>
#include <linux/mmzone.h>
#include <linux/memblk.h>
+#include <linux/topology.h>
-#include <asm/topology.h>
#include <asm/node.h>
struct i386_memblk {
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-i386/node.h ingo-A/include/asm-i386/node.h
--- virgin/include/asm-i386/node.h Sun Nov 17 20:29:29 2002
+++ ingo-A/include/asm-i386/node.h Fri Jan 17 09:19:24 2003
@@ -4,8 +4,7 @@
#include <linux/device.h>
#include <linux/mmzone.h>
#include <linux/node.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
struct i386_node {
struct node node;
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-i386/topology.h ingo-A/include/asm-i386/topology.h
--- virgin/include/asm-i386/topology.h Fri Jan 17 09:18:31 2003
+++ ingo-A/include/asm-i386/topology.h Fri Jan 17 09:19:23 2003
@@ -61,17 +61,6 @@ static inline int __node_to_first_cpu(in
/* Returns the number of the first MemBlk on Node 'node' */
#define __node_to_memblk(node) (node)
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 100
-
-#else /* !CONFIG_NUMA */
-/*
- * Other i386 platforms should define their own version of the
- * above macros here.
- */
-
-#include <asm-generic/topology.h>
-
#endif /* CONFIG_NUMA */
#endif /* _ASM_I386_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-ia64/topology.h ingo-A/include/asm-ia64/topology.h
--- virgin/include/asm-ia64/topology.h Fri Jan 17 09:18:31 2003
+++ ingo-A/include/asm-ia64/topology.h Fri Jan 17 09:19:23 2003
@@ -60,7 +60,4 @@
*/
#define __node_to_memblk(node) (node)
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
#endif /* _ASM_IA64_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude virgin/include/asm-ppc64/topology.h ingo-A/include/asm-ppc64/topology.h
--- virgin/include/asm-ppc64/topology.h Fri Jan 17 09:18:31 2003
+++ ingo-A/include/asm-ppc64/topology.h Fri Jan 17 09:19:23 2003
@@ -46,18 +46,6 @@ static inline unsigned long __node_to_cp
return mask;
}
-/* Cross-node load balancing interval. */
-#define NODE_BALANCE_RATE 10
-
-#else /* !CONFIG_NUMA */
-
-#define __cpu_to_node(cpu) (0)
-#define __memblk_to_node(memblk) (0)
-#define __parent_node(nid) (0)
-#define __node_to_first_cpu(node) (0)
-#define __node_to_cpu_mask(node) (cpu_online_map)
-#define __node_to_memblk(node) (0)
-
#endif /* CONFIG_NUMA */
#endif /* _ASM_PPC64_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude virgin/include/linux/mmzone.h ingo-A/include/linux/mmzone.h
--- virgin/include/linux/mmzone.h Fri Jan 17 09:18:31 2003
+++ ingo-A/include/linux/mmzone.h Fri Jan 17 09:19:23 2003
@@ -255,9 +255,7 @@ static inline struct zone *next_zone(str
#define MAX_NR_MEMBLKS 1
#endif /* CONFIG_NUMA */
-#include <asm/topology.h>
-/* Returns the number of the current Node. */
-#define numa_node_id() (__cpu_to_node(smp_processor_id()))
+#include <linux/topology.h>
#ifndef CONFIG_DISCONTIGMEM
extern struct pglist_data contig_page_data;
diff -urpN -X /home/fletch/.diff.exclude virgin/include/linux/topology.h ingo-A/include/linux/topology.h
--- virgin/include/linux/topology.h Wed Dec 31 16:00:00 1969
+++ ingo-A/include/linux/topology.h Fri Jan 17 09:19:23 2003
@@ -0,0 +1,43 @@
+/*
+ * linux/include/linux/topology.h
+ */
+#ifndef _LINUX_TOPOLOGY_H
+#define _LINUX_TOPOLOGY_H
+
+#include <asm/topology.h>
+
+/*
+ * The default (non-NUMA) topology definitions:
+ */
+#ifndef __cpu_to_node
+#define __cpu_to_node(cpu) (0)
+#endif
+#ifndef __memblk_to_node
+#define __memblk_to_node(memblk) (0)
+#endif
+#ifndef __parent_node
+#define __parent_node(node) (0)
+#endif
+#ifndef __node_to_first_cpu
+#define __node_to_first_cpu(node) (0)
+#endif
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu) (cpu_online_map)
+#endif
+#ifndef __node_to_cpu_mask
+#define __node_to_cpu_mask(node) (cpu_online_map)
+#endif
+#ifndef __node_to_memblk
+#define __node_to_memblk(node) (0)
+#endif
+
+#ifndef __cpu_to_node_mask
+#define __cpu_to_node_mask(cpu) __node_to_cpu_mask(__cpu_to_node(cpu))
+#endif
+
+/*
+ * Generic defintions:
+ */
+#define numa_node_id() (__cpu_to_node(smp_processor_id()))
+
+#endif /* _LINUX_TOPOLOGY_H */
diff -urpN -X /home/fletch/.diff.exclude virgin/mm/page_alloc.c ingo-A/mm/page_alloc.c
--- virgin/mm/page_alloc.c Fri Jan 17 09:18:32 2003
+++ ingo-A/mm/page_alloc.c Fri Jan 17 09:19:25 2003
@@ -28,8 +28,7 @@
#include <linux/blkdev.h>
#include <linux/slab.h>
#include <linux/notifier.h>
-
-#include <asm/topology.h>
+#include <linux/topology.h>
DECLARE_BITMAP(node_online_map, MAX_NUMNODES);
DECLARE_BITMAP(memblk_online_map, MAX_NR_MEMBLKS);
diff -urpN -X /home/fletch/.diff.exclude virgin/mm/vmscan.c ingo-A/mm/vmscan.c
--- virgin/mm/vmscan.c Mon Dec 23 23:01:58 2002
+++ ingo-A/mm/vmscan.c Fri Jan 17 09:19:25 2003
@@ -27,10 +27,10 @@
#include <linux/pagevec.h>
#include <linux/backing-dev.h>
#include <linux/rmap-locking.h>
+#include <linux/topology.h>
#include <asm/pgalloc.h>
#include <asm/tlbflush.h>
-#include <asm/topology.h>
#include <asm/div64.h>
#include <linux/swapops.h>
Ingo,
I repeated the tests with your B0 version and it's still not
satisfying. Maybe too aggressive NODE_REBALANCE_IDLE_TICK, maybe the
difference is that the other calls of load_balance() never have the
chance to balance across nodes.
Here are the results:
kernbench (average of 5 kernel compiles) (standard error in brackets)
---------
Elapsed UserTime SysTime
orig 134.43(1.79) 944.79(0.43) 21.41(0.28)
ingo 136.74(1.58) 951.55(0.73) 21.16(0.32)
ingofix 135.22(0.59) 952.17(0.78) 21.16(0.19)
ingoB0 134.69(0.51) 951.63(0.81) 21.12(0.15)
hackbench (chat benchmark alike) (elapsed time for N groups of 20
--------- senders & receivers, stats from 10 measurements)
N=10 N=25 N=50 N=100
orig 0.77(0.03) 1.91(0.06) 3.77(0.06) 7.78(0.21)
ingo 1.70(0.35) 3.11(0.47) 4.85(0.55) 8.80(0.98)
ingofix 1.16(0.14) 2.67(0.53) 5.05(0.26) 9.99(0.13)
ingoB0 0.84(0.03) 2.12(0.12) 4.20(0.22) 8.04(0.16)
numabench (N memory intensive tasks running in parallel, disturbed for
--------- a short time by a "hackbench 10" call)
numa_test N=4 ElapsedTime TotalUserTime TotalSysTime
orig: 26.13(2.54) 86.10(4.47) 0.09(0.01)
ingo: 27.60(2.16) 88.06(4.58) 0.11(0.01)
ingofix: 25.51(3.05) 83.55(2.78) 0.10(0.01)
ingoB0: 27.58(0.08) 90.86(4.42) 0.09(0.01)
numa_test N=8 ElapsedTime TotalUserTime TotalSysTime
orig: 24.81(2.71) 164.94(4.82) 0.17(0.01)
ingo: 27.38(3.01) 170.06(5.60) 0.30(0.03)
ingofix: 29.08(2.79) 172.10(4.48) 0.32(0.03)
ingoB0: 26.05(3.28) 171.61(7.76) 0.18(0.01)
numa_test N=16 ElapsedTime TotalUserTime TotalSysTime
orig: 45.19(3.42) 332.07(5.89) 0.32(0.01)
ingo: 50.18(0.38) 359.46(9.31) 0.46(0.04)
ingofix: 50.30(0.42) 357.38(9.12) 0.46(0.01)
ingoB0: 50.96(1.33) 371.72(18.58) 0.34(0.01)
numa_test N=32 ElapsedTime TotalUserTime TotalSysTime
orig: 86.84(1.83) 671.99(9.98) 0.65(0.02)
ingo: 93.44(2.13) 704.90(16.91) 0.82(0.06)
ingofix: 93.92(1.28) 727.58(9.26) 0.77(0.03)
ingoB0: 99.72(4.13) 759.03(29.41) 0.69(0.01)
The kernbench user time is still too large.
Hackbench improved a lot (understandeable, as idle CPUs steal earlier
from remote nodes).
Numa_test didn't improve, in average we have the same results.
Hmmm, now I really tend towards letting it the way it is in
2.5.59. Except the topology cleanup and renaming, of course. I have no
more time to test a more conservative setting of
IDLE_NODE_REBALANCE_TICK today, but that could help...
Regards,
Erich
On Fri, 2003-01-17 at 07:11, Ingo Molnar wrote:
> agreed, i've attached the -B0 patch that does this. The balancing rates
> are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
> busy-global).
>
> Ingo
Ran this patch on a 4 node (16 CPU, 16 GB memory) NUMAQ. Results don't
look encouraging. I would suggest not applying this patch until the
degradation is worked out.
stock59 = linux 2.5.59
ingo59 = linux 2.5.59 with Ingo's B0 patch
Kernbench:
Elapsed User System CPU
stock59 29.668s 283.762s 82.286s 1233%
ingo59 37.736s 338.162s 153.486s 1302.6%
Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
stock59 0.00 24.44 68.07 0.78
ingo59 0.00 62.14 163.32 1.93
Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
stock59 0.00 48.26 246.75 1.64
ingo59 0.00 68.17 376.85 6.42
Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
stock59 0.00 56.51 845.26 2.98
ingo59 0.00 114.38 1337.65 18.89
Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
stock59 0.00 116.95 1806.33 6.23
ingo59 0.00 243.46 3515.09 43.92
Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
stock59 0.00 237.29 3634.59 15.71
ingo59 0.00 688.31 10605.40 102.71
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486
> I repeated the tests with your B0 version and it's still not
> satisfying. Maybe too aggressive NODE_REBALANCE_IDLE_TICK, maybe the
> difference is that the other calls of load_balance() never have the
> chance to balance across nodes.
Nope, I found the problem. The topo cleanups are broken - we end up
taking all mem accesses, etc to node 0.
Use the second half of the patch (the splitup I already posted),
and fix the obvious compile error. Works fine now ;-)
Matt, you know the topo stuff better than anyone. Can you take a look
at the cleanup Ingo did, and see if it's easily fixable?
M.
PS. Ingo - I love the restructuring of the scheduler bits.
I think we need > 2 multipler though ... I set it to 10 for now.
Tuning will tell ...
>> I repeated the tests with your B0 version and it's still not
>> satisfying. Maybe too aggressive NODE_REBALANCE_IDLE_TICK, maybe the
>> difference is that the other calls of load_balance() never have the
>> chance to balance across nodes.
>
> Nope, I found the problem. The topo cleanups are broken - we end up
> taking all mem accesses, etc to node 0.
Kernbench:
Elapsed User System CPU
2.5.59 20.032s 186.66s 47.73s 1170%
2.5.59-ingo-mjb 19.986s 187.044s 48.592s 1178.8%
NUMA schedbench 4:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 36.38 90.70 0.62
2.5.59-ingo-mjb 0.00 34.70 88.58 0.69
NUMA schedbench 8:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 42.78 249.77 1.85
2.5.59-ingo-mjb 0.00 49.33 256.59 1.69
NUMA schedbench 16:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 56.84 848.00 2.78
2.5.59-ingo-mjb 0.00 65.67 875.05 3.58
NUMA schedbench 32:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 116.36 1807.29 5.75
2.5.59-ingo-mjb 0.00 142.77 2039.47 8.42
NUMA schedbench 64:
AvgUser Elapsed TotalUser TotalSys
2.5.59 0.00 240.01 3634.20 14.57
2.5.59-ingo-mjb 0.00 293.48 4534.99 20.62
System times are little higher (multipliers are set at busy = 10,
idle = 10) .... I'll try setting the idle multipler to 100, but
the other thing to do would be into increase the cross-node migrate
resistance by setting some minimum imbalance offsets. That'll
probably have to be node-specific ... something like the number
of cpus per node ... but probably 0 for the simple HT systems.
M.
Martin J. Bligh wrote:
>>I repeated the tests with your B0 version and it's still not
>>satisfying. Maybe too aggressive NODE_REBALANCE_IDLE_TICK, maybe the
>>difference is that the other calls of load_balance() never have the
>>chance to balance across nodes.
>
>
> Nope, I found the problem. The topo cleanups are broken - we end up
> taking all mem accesses, etc to node 0.
>
> Use the second half of the patch (the splitup I already posted),
> and fix the obvious compile error. Works fine now ;-)
>
> Matt, you know the topo stuff better than anyone. Can you take a look
> at the cleanup Ingo did, and see if it's easily fixable?
Umm.. most of it looks clean. I'm not really sure what the
__cpu_to_node_mask(cpu) macro is supposed to do? it looks to be just an
alias for the __node_to_cpu_mask() macro, which makes little sense to
me. That's the only thing that immediately sticks out. I'm doubly
confused as to why it's defined twice in include/linux/topology.h?
-Matt
>
> M.
>
> PS. Ingo - I love the restructuring of the scheduler bits.
> I think we need > 2 multipler though ... I set it to 10 for now.
> Tuning will tell ...
>
On Fri, 2003-01-17 at 11:26, Martin J. Bligh wrote:
>
> System times are little higher (multipliers are set at busy = 10,
> idle = 10) .... I'll try setting the idle multipler to 100, but
> the other thing to do would be into increase the cross-node migrate
> resistance by setting some minimum imbalance offsets. That'll
> probably have to be node-specific ... something like the number
> of cpus per node ... but probably 0 for the simple HT systems.
>
> M.
I tried several values for the multipliers (IDLE_NODE_REBALANCE_TICK
and BUSY_NODE_REBALANCE_TICK) on a 4 node NUMAQ (16 700MHZ Pentium III
CPUs, 16 GB memory). The interesting results follow:
Kernels:
stock59 - linux 2.5.59 with cputime_stats patch
ingoI50B10 - stock59 with Ingo's B0 patch as modified by Martin
with a IDLE_NODE_REBALANCE_TICK multiplier of 50
and a BUSY_NODE_REBALANCE_TICK multiplier of 10
ingoI2B2 - stock59 with Ingos B0 patch with the original multipliers (2)
Kernbench:
Elapsed User System CPU
ingoI50B10 29.574s 284.922s 84.542s 1249%
stock59 29.498s 283.976s 83.05s 1243.8%
ingoI2B2 30.212s 289.718s 84.926s 1240%
Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
ingoI50B10 22.49 37.04 90.00 0.86
stock59 22.25 35.94 89.06 0.81
ingoI2B2 24.98 40.71 99.98 1.03
Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
ingoI50B10 31.16 49.63 249.32 1.86
stock59 28.40 42.25 227.26 1.67
ingoI2B2 36.81 59.38 294.57 2.02
Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
ingoI50B10 52.73 56.61 843.91 3.56
stock59 52.97 57.19 847.70 3.29
ingoI2B2 57.88 71.32 926.29 5.51
Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
ingoI50B10 58.85 139.76 1883.59 8.07
stock59 56.57 118.05 1810.53 5.97
ingoI2B2 61.48 137.38 1967.52 10.92
Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
ingoI50B10 70.74 297.25 4528.25 19.09
stock59 56.75 234.12 3632.72 15.70
ingoI2B2 70.56 298.45 4516.67 21.41
Martin already posted numbers with 10/10 for the multipliers.
It looks like on the NUMAQ the 50/10 values give the best results,
at least on these tests. I suspect that on other architectures
the optimum numbers will vary. Most likely on machines with lower
latency to off node memory, lower multipliers will help. It will
be interesting to see numbers from Erich from the NEC IA64 boxes.
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486
On Fri, Jan 17, 2003 at 04:11:59PM +0100, Ingo Molnar wrote:
> agreed, i've attached the -B0 patch that does this. The balancing rates
> are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
> busy-global).
I suspect some of these results may be off on NUMA-Q (or any PAE box)
if CONFIG_MTRR was enabled. Michael, Martin, please doublecheck
/proc/mtrr and whether CONFIG_MTRR=y. If you didn't enable it, or if
you compile times aren't on the order of 5-10 minutes, you're unaffected.
The severity of the MTRR regression in 2.5.59 is apparent from:
$ cat /proc/mtrr
reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1
reg01: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1
$ time make -j bzImage > /dev/null
make -j bzImage > /dev/null 8338.52s user 245.73s system 1268% cpu 11:16.56 total
Fixing it up by hand (after dealing with various bits of pain) to:
$ cat /proc/mtrr
reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1
reg01: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1
reg02: base=0x100000000 (4096MB), size=4096MB: write-back, count=1
reg03: base=0x200000000 (8192MB), size=4096MB: write-back, count=1
reg04: base=0x300000000 (12288MB), size=4096MB: write-back, count=1
reg05: base=0x400000000 (16384MB), size=16384MB: write-back, count=1
reg06: base=0x800000000 (32768MB), size=16384MB: write-back, count=1
reg07: base=0xc00000000 (49152MB), size=16384MB: write-back, count=1
make -j bzImage > /dev/null 361.72s user 546.28s system 2208% cpu 41.109 total
make -j bzImage > /dev/null 364.00s user 575.73s system 2005% cpu 46.858 total
make -j bzImage > /dev/null 366.77s user 568.44s system 2239% cpu 41.765 total
I'll do some bisection search to figure out which patch broke the world.
-- wli
>> agreed, i've attached the -B0 patch that does this. The balancing rates
>> are 1 msec, 2 msec, 200 and 400 msec (idle-local, idle-global, busy-local,
>> busy-global).
>
> I suspect some of these results may be off on NUMA-Q (or any PAE box)
> if CONFIG_MTRR was enabled. Michael, Martin, please doublecheck
> /proc/mtrr and whether CONFIG_MTRR=y. If you didn't enable it, or if
> you compile times aren't on the order of 5-10 minutes, you're unaffected.
>
> The severity of the MTRR regression in 2.5.59 is apparent from:
> $ cat /proc/mtrr
> reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1
> reg01: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1
Works for me, I have MTRR on.
larry:~# cat /proc/mtrr
reg00: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1
reg01: base=0x00000000 ( 0MB), size=16384MB: write-back, count=1
At some point in the past, I wrote:
>> I suspect some of these results may be off on NUMA-Q (or any PAE box)
>> if CONFIG_MTRR was enabled. Michael, Martin, please doublecheck
>> /proc/mtrr and whether CONFIG_MTRR=y. If you didn't enable it, or if
>> you compile times aren't on the order of 5-10 minutes, you're unaffected.
>> The severity of the MTRR regression in 2.5.59 is apparent from:
>> $ cat /proc/mtrr
>> reg00: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1
>> reg01: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1
On Sat, Jan 18, 2003 at 12:12:31AM -0800, Martin J. Bligh wrote:
> Works for me, I have MTRR on.
> larry:~# cat /proc/mtrr
> reg00: base=0xe0000000 (3584MB), size= 512MB: uncachable, count=1
> reg01: base=0x00000000 ( 0MB), size=16384MB: write-back, count=1
Okay, it sounds like the problem needs some extra RAM to trigger. We
can bounce quads back & forth if need be, but I'll at least take a shot
at finding where it happened before you probably need to look into it.
Thanks,
Bill
Hi,
I'm currently scanning the parameter space of IDLE_NODE_REBALANCE_TICK
and BUSY_NODE_REBALANCE_TICK with the help of tunable rebalance
rates. The patch basically does:
-#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 2)
-#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
+int idle_nodebalance_rate = 10;
+int busy_nodebalance_rate = 10;
+#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * idle_nodebalance_rate)
+#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * busy_nodebalance_rate)
and makes the variables accessible in /proc/sys/kernel
We might want to leave these tunable in case it turns out that
different platforms need significantly different values. Right now
it's just a tool for tuning.
Regards,
Erich
Andrew, hopefully this'll give you a cleaner integration point to do
the HT scheduler stuff ... I basically did a rename of "node" to "pool"
on sched.c (OK, it was a little more complex than that), and provided
some hooks for you to attatch to. There's a really hacky version of
the HT stuff in there that I doubt works at all. (sched.h will need
something other than CONFIG_SCHED_NUMA, for starters).
It's not really finished, but I have to go out ... I thought you or
someone else might like to have a play with it in the meantime.
Goes on top of the second half of Ingo's stuff from yesterday
(also attatched).
I think this should result in a much cleaner integration between the HT
aware stuff and the NUMA stuff. Pools is a concept Erich had in his
scheduler a while back, but it got set aside in the paring down for
integration. We should be able to add multiple levels to this fairly
easily, at some point (eg HT + NUMA), but let's get the basics working
first ;-)
M.
Mmm... seems I may have got the ordering of the cpus wrong.
Something like this might work better in sched_topo_ht.h
(yeah, it's ugly. I don't care).
static inline int pool_to_cpu_mask (int pool)
{
return ( (1UL << pool) || (1UL << cpu_sibling_map[pool]));
}
static inline cpu_to_pool (int cpu)
{
return min(cpu, cpu_sibling_map[cpu]);
}
Thanks to Andi, Zwane, and Bill for the corrective baseball bat strike.
I changed the macros to inlines to avoid the risk of double eval.
M.
The scan through a piece of the parameter space delivered quite
unconclusive results. I used the IDLE_NODE_REBALANCE_TICK multipliers
2, 5, 10, 20, 50 and the BUSY_NODE_REBALANCE_TICK multipliers 2, 5,
10, 20.
The benchmarks I tried were kernbench (average and error of 5 kernel
compiles) and hackbench (5 runs for each number of chatter groups
(10,25,50,100). The 2.5.59 scheduler result is printed first, then a
matrix with all combinations of idle and busy rebalance
multipliers. Each value is followed by its standard error (coming from
the 5 measurements). I didn't measure numa_bench, those values depend
mostly on the initial load balancing and showed no clear
tendency/difference.
The machine is an NEC TX7 (small version, 8 Itanium2 CPUs in 4 nodes).
The results:
- kernbench UserTime is best for the 2.5.59 scheduler (623s). IngoB0
best value 627.33s for idle=20ms, busy=2000ms.
- hackbench: 2.5.59 scheduler is significantly better for all
measurements.
I suppose this comes from the fact that the 2.5.59 version has the
chance to load_balance across nodes when a cpu goes idle. No idea what
other reason it could be... Maybe anybody else?
Kernbench:
==========
2.5.59 : Elapsed = 86.29(1.24)
ingo B0 : Elapsed
idle 2 5 10 20 50
busy
2 86.25(0.45) 86.62(1.56) 86.29(0.99) 85.99(0.60) 86.91(1.09)
5 86.87(1.12) 86.38(0.82) 86.00(0.69) 86.14(0.39) 86.47(0.68)
10 86.06(0.18) 86.23(0.38) 86.63(0.57) 86.82(0.95) 86.06(0.15)
20 86.64(1.24) 86.43(0.74) 86.15(0.99) 86.76(1.34) 86.70(0.68)
2.5.59 : UserTime = 623.24(0.46)
ingo B0 : UserTime
idle 2 5 10 20 50
busy
2 629.05(0.32) 628.54(0.53) 628.51(0.32) 628.66(0.23) 628.72(0.20)
5 628.14(0.88) 628.10(0.76) 628.33(0.41) 628.45(0.48) 628.11(0.37)
10 627.97(0.30) 627.77(0.23) 627.75(0.21) 627.33(0.45) 627.63(0.52)
20 627.55(0.36) 627.67(0.58) 627.36(0.67) 627.84(0.28) 627.69(0.59)
2.5.59 : SysTime = 21.83(0.16)
ingo B0 : SysTime
idle 2 5 10 20 50
busy
2 21.99(0.26) 21.89(0.12) 22.12(0.16) 22.06(0.21) 22.44(0.51)
5 22.07(0.21) 22.29(0.54) 22.15(0.08) 22.09(0.26) 21.90(0.18)
10 22.01(0.20) 22.42(0.42) 22.28(0.23) 22.04(0.37) 22.41(0.26)
20 22.03(0.20) 22.08(0.30) 22.31(0.27) 22.03(0.19) 22.35(0.33)
Hackbench 10
=============
2.5.59 : 0.77(0.03)
ingo B0:
idle 2 5 10 20 50
busy
2 0.90(0.07) 0.88(0.05) 0.84(0.05) 0.82(0.04) 0.85(0.06)
5 0.87(0.05) 0.90(0.07) 0.88(0.08) 0.89(0.09) 0.86(0.07)
10 0.85(0.06) 0.83(0.05) 0.86(0.08) 0.84(0.06) 0.87(0.06)
20 0.85(0.05) 0.87(0.07) 0.83(0.05) 0.86(0.07) 0.87(0.05)
Hackbench 25
=============
2.5.59 : 1.96(0.05)
ingo B0:
idle 2 5 10 20 50
busy
2 2.20(0.13) 2.21(0.12) 2.23(0.10) 2.20(0.10) 2.16(0.07)
5 2.13(0.12) 2.17(0.13) 2.18(0.08) 2.10(0.11) 2.16(0.10)
10 2.19(0.08) 2.21(0.12) 2.22(0.09) 2.11(0.10) 2.15(0.10)
20 2.11(0.17) 2.13(0.08) 2.18(0.06) 2.13(0.11) 2.13(0.14)
Hackbench 50
=============
2.5.59 : 3.78(0.10)
ingo B0:
idle 2 5 10 20 50
busy
2 4.31(0.13) 4.30(0.29) 4.29(0.15) 4.23(0.20) 4.14(0.10)
5 4.35(0.16) 4.34(0.24) 4.24(0.24) 4.09(0.18) 4.12(0.14)
10 4.35(0.23) 4.21(0.14) 4.36(0.24) 4.18(0.12) 4.36(0.21)
20 4.34(0.14) 4.27(0.17) 4.18(0.18) 4.29(0.24) 4.08(0.09)
Hackbench 100
==============
2.5.59 : 7.85(0.37)
ingo B0:
idle 2 5 10 20 50
busy
2 8.21(0.42) 8.07(0.25) 8.32(0.30) 8.06(0.26) 8.10(0.13)
5 8.13(0.25) 8.06(0.33) 8.14(0.49) 8.24(0.24) 8.04(0.20)
10 8.05(0.17) 8.16(0.13) 8.13(0.16) 8.05(0.24) 8.01(0.30)
20 8.21(0.25) 8.23(0.24) 8.36(0.41) 8.30(0.37) 8.10(0.30)
Regards,
Erich
On Fri, Jan 17, 2003 at 11:08:08PM -0800, William Lee Irwin III wrote:
> The severity of the MTRR regression in 2.5.59 is apparent from:
This is not a result of userland initscripts botching the MTRR; not
only are printk's in MTRR-setting routines not visible but it's also
apparent from the fact that highmem mem_map initialization suffers a
similar degradation adding almost a full 20 minutes to boot times.
-- wli
On Sun, 19 Jan 2003, Erich Focht wrote:
> The results:
> - kernbench UserTime is best for the 2.5.59 scheduler (623s). IngoB0
> best value 627.33s for idle=20ms, busy=2000ms.
> - hackbench: 2.5.59 scheduler is significantly better for all
> measurements.
>
> I suppose this comes from the fact that the 2.5.59 version has the
> chance to load_balance across nodes when a cpu goes idle. No idea what
> other reason it could be... Maybe anybody else?
this shows that agressive idle-rebalancing is the most important factor. I
think this means that the unification of the balancing code should go into
the other direction: ie. applying the ->nr_balanced logic to the SMP
balancer as well.
kernelbench is the kind of benchmark that is most sensitive to over-eager
global balancing, and since the 2.5.59 ->nr_balanced logic produced the
best results, it clearly shows it's not over-eager. hackbench is one that
is quite sensitive to under-balancing. Ie. trying to maximize both will
lead us to a good balance.
Ingo
On Monday 20 January 2003 10:28, Ingo Molnar wrote:
> On Sun, 19 Jan 2003, Erich Focht wrote:
> > The results:
> > - kernbench UserTime is best for the 2.5.59 scheduler (623s). IngoB0
> > best value 627.33s for idle=20ms, busy=2000ms.
> > - hackbench: 2.5.59 scheduler is significantly better for all
> > measurements.
> >
> > I suppose this comes from the fact that the 2.5.59 version has the
> > chance to load_balance across nodes when a cpu goes idle. No idea what
> > other reason it could be... Maybe anybody else?
>
> this shows that agressive idle-rebalancing is the most important factor. I
> think this means that the unification of the balancing code should go into
> the other direction: ie. applying the ->nr_balanced logic to the SMP
> balancer as well.
Could you please explain your idea? As far as I understand, the SMP
balancer (pre-NUMA) tries a global rebalance at each call. Maybe you
mean something different...
> kernelbench is the kind of benchmark that is most sensitive to over-eager
> global balancing, and since the 2.5.59 ->nr_balanced logic produced the
> best results, it clearly shows it's not over-eager. hackbench is one that
> is quite sensitive to under-balancing. Ie. trying to maximize both will
> lead us to a good balance.
Yes! Actually the currently implemented nr_balanced logic is pretty
dumb: the counter reaches the cross-node balance threshold after a
certain number of calls to intra-node lb, no matter whether these were
successfull or not. I'd like to try incrementing the counter only on
unsuccessfull load balances, this would give a clear priority to
intra-node balancing and a clear and controllable delay for cross-node
balancing. A tiny patch for this (for 2.5.59) is attached. As the name
nr_balanced would be misleading for this kind of usage, I renamed it
to nr_lb_failed.
Regards,
Erich
> kernelbench is the kind of benchmark that is most sensitive to over-eager
> global balancing, and since the 2.5.59 ->nr_balanced logic produced the
> best results, it clearly shows it's not over-eager.
Careful ... what shows well on one machine may not on another - this depends
heavily on the NUMA ratio - for our machine the nr_balanced logic in 59 is
still over-agressive (20:1 NUMA ratio). For low-ratio machines it may work
fine. It actually works best for us when it's switched off altogether I
think (which is obviously not a good solution).
But there's more than one dimension to tune here - we can tune both the
frequency and the level of imbalance required. I had good results specifying
a minimum imbalance of > 4 between the current and busiest nodes before
balancing. Reason (2 nodes, 4 CPUs each): If I have 4 tasks on one node,
and 8 on another, that's still one or two per cpu in all cases whatever
I do (well, provided I'm not stupid enough to make anything idle). So
at that point, I just want lowest task thrash.
Moving tasks between nodes is really expensive, and shouldn't be done
lightly - the only thing the background busy rebalancer should be fixing
is significant long-term imbalances. It would be nice if we also chose
the task with the smallest RSS to migrate, I think that's a worthwhile
optimisation (we'll need to make sure we use realistic benchmarks with
a mix of different task sizes). Working out which ones have the smallest
"on-node RSS - off-node RSS" is another step after that ...
> hackbench is one that is quite sensitive to under-balancing.
> Ie. trying to maximize both will lead us to a good balance.
I'll try to do some hackbench runs on NUMA-Q as well.
Just to add something else to the mix, there's another important factor
as well as the NUMA ratio - the size of the interconnect cache vs the
size of the task migrated. The interconnect cache on the NUMA-Q is 32Mb,
our newer machine has a much lower NUMA ratio, but effectively a much
smaller cache as well. NUMA ratios are often expresssed in terms of
latency, but there's a bandwidth consideration too. Hyperthreading will
want something different again.
I think we definitely need to tune this on a per-arch basis. There's no
way that one-size-fits-all is going to fit a situation as complex as this
(though we can definitely learn from each other's analysis).
M.
On Mon, 20 Jan 2003, Martin J. Bligh wrote:
> I think we definitely need to tune this on a per-arch basis. There's no
> way that one-size-fits-all is going to fit a situation as complex as
> this (though we can definitely learn from each other's analysis).
agreed - although the tunable should be constant (if possible, or
boot-established but not /proc exported), and there should be as few
tunables as possible. We already tune some of our scheduler behavior to
the SMP cachesize. (the cache_decay_ticks SMP logic.)
Ingo
On Mon, 20 Jan 2003, Erich Focht wrote:
> Could you please explain your idea? As far as I understand, the SMP
> balancer (pre-NUMA) tries a global rebalance at each call. Maybe you
> mean something different...
yes, but eg. in the idle-rebalance case we are more agressive at moving
tasks across SMP CPUs. We could perhaps do a similar ->nr_balanced logic
to do this 'agressive' balancing even if not triggered from the
CPU-will-be-idle path. Ie. _perhaps_ the SMP balancer could become a bit
more agressive.
ie. SMP is just the first level in the cache-hierarchy, NUMA is the second
level. (lets hope we dont have to deal with a third caching level anytime
soon - although that could as well happen once SMT CPUs start doing NUMA.)
There's no real reason to do balancing in a different way on each level -
the weight might be different, but the core logic should be synced up.
(one thing that is indeed different for the NUMA step is locality of
uncached memory.)
> > kernelbench is the kind of benchmark that is most sensitive to over-eager
> > global balancing, and since the 2.5.59 ->nr_balanced logic produced the
> > best results, it clearly shows it's not over-eager. hackbench is one that
> > is quite sensitive to under-balancing. Ie. trying to maximize both will
> > lead us to a good balance.
>
> Yes! Actually the currently implemented nr_balanced logic is pretty
> dumb: the counter reaches the cross-node balance threshold after a
> certain number of calls to intra-node lb, no matter whether these were
> successfull or not. I'd like to try incrementing the counter only on
> unsuccessfull load balances, this would give a clear priority to
> intra-node balancing and a clear and controllable delay for cross-node
> balancing. A tiny patch for this (for 2.5.59) is attached. As the name
> nr_balanced would be misleading for this kind of usage, I renamed it to
> nr_lb_failed.
indeed this approach makes much more sense than the simple ->nr_balanced
counter. A similar approach makes sense on the SMP level as well: if the
current 'busy' rebalancer fails to get a new task, we can try the current
'idle' rebalancer. Ie. a CPU going idle would do the less intrusive
rebalancing first.
have you experimented with making the counter limit == 1 actually? Ie.
immediately trying to do a global balancing once the less intrusive
balancing fails?
Ingo
> yes, but eg. in the idle-rebalance case we are more agressive at moving
> tasks across SMP CPUs. We could perhaps do a similar ->nr_balanced logic
> to do this 'agressive' balancing even if not triggered from the
> CPU-will-be-idle path. Ie. _perhaps_ the SMP balancer could become a bit
> more agressive.
Do you think it's worth looking at the initial load-balance code for
standard SMP?
> ie. SMP is just the first level in the cache-hierarchy, NUMA is the second
> level. (lets hope we dont have to deal with a third caching level anytime
> soon - although that could as well happen once SMT CPUs start doing NUMA.)
We have those already (IBM x440) ;-) That's one of the reasons why I prefer
the pools concept I posted at the weekend over just "nodes". Also, there
are NUMA machines where nodes are not all equidistant ... that can be
thought of as multi-level too.
> There's no real reason to do balancing in a different way on each level -
> the weight might be different, but the core logic should be synced up.
> (one thing that is indeed different for the NUMA step is locality of
> uncached memory.)
Right, the current model should work fine, it just needs generalising out
a bit.
M
On Mon, 20 Jan 2003, Ingo Molnar wrote:
> ie. SMP is just the first level in the cache-hierarchy, NUMA is the
> second level. (lets hope we dont have to deal with a third caching level
> anytime soon - although that could as well happen once SMT CPUs start
> doing NUMA.)
or as Arjan points out, like the IBM x440 boxes ...
i think we want to handle SMT on a different level, ie. via the
shared-runqueue approach, so it's not a genuine new level of caching, it's
rather a new concept of multiple logical cores per physical core. (which
needs is own code in the scheduler.)
Ingo
> or as Arjan points out, like the IBM x440 boxes ...
;-)
> i think we want to handle SMT on a different level, ie. via the
> shared-runqueue approach, so it's not a genuine new level of caching, it's
> rather a new concept of multiple logical cores per physical core. (which
> needs is own code in the scheduler.)
Do you have that code working already (presumably needs locking changes)?
I seem to recall something like that existing already, but I don't recall
if it was ever fully working or not ...
I think the large PPC64 boxes have multilevel NUMA as well - two real
phys cores on one die, sharing some cache (L2 but not L1? Anton?).
And SGI have multilevel nodes too I think ... so we'll still need
multilevel NUMA at some point ... but maybe not right now.
M.
On Mon, 20 Jan 2003, Martin J. Bligh wrote:
> Do you have that code working already (presumably needs locking
> changes)? I seem to recall something like that existing already, but I
> don't recall if it was ever fully working or not ...
yes, i have a HT testbox and working code:
http://lwn.net/Articles/8553/
the patch is rather old, i'll update it to 2.5.59.
> I think the large PPC64 boxes have multilevel NUMA as well - two real
> phys cores on one die, sharing some cache (L2 but not L1? Anton?). And
> SGI have multilevel nodes too I think ... so we'll still need multilevel
> NUMA at some point ... but maybe not right now.
Intel's HT is the cleanest case: pure logical cores, which clearly need
special handling. Whether the other SMT solutions want to be handled via
the logical-cores code or via another level of NUMA-balancing code,
depends on benchmarking results i suspect. It will be one more flexibility
that system maintainers will have, it's all set up via the
sched_map_runqueue(cpu1, cpu2) boot-time call that 'merges' a CPU's
runqueue into another CPU's runqueue. It's basically the 0th level of
balancing, which will be fundamentally different. The other levels of
balancing are (or should be) similar to each other - only differing in
weight of balancing, not differing in the actual algorithm.
Ingo
> > I think the large PPC64 boxes have multilevel NUMA as well - two real
> > phys cores on one die, sharing some cache (L2 but not L1? Anton?). And
> > SGI have multilevel nodes too I think ... so we'll still need multilevel
> > NUMA at some point ... but maybe not right now.
>
> Intel's HT is the cleanest case: pure logical cores, which clearly need
> special handling. Whether the other SMT solutions want to be handled via
> the logical-cores code or via another level of NUMA-balancing code,
> depends on benchmarking results i suspect. It will be one more flexibility
> that system maintainers will have, it's all set up via the
> sched_map_runqueue(cpu1, cpu2) boot-time call that 'merges' a CPU's
> runqueue into another CPU's runqueue. It's basically the 0th level of
> balancing, which will be fundamentally different. The other levels of
> balancing are (or should be) similar to each other - only differing in
> weight of balancing, not differing in the actual algorithm.
I have included a very rough patch to do ht-numa topology. I requires to
manually define CONFIG_NUMA and CONFIG_NUMA_SCHED. It also uses num_cpunodes
instead of numnodes and defines MAX_NUM_NODES to 8 if CONFIG_NUMA is defined.
I had to remove the first check in sched_best_cpu() to get decent low load
performance out of this. I am still sorting through some things, but I
though it would be best if I just post what I have now.
-Andrew Theurer
>> > I think the large PPC64 boxes have multilevel NUMA as well - two real
>> > phys cores on one die, sharing some cache (L2 but not L1? Anton?). And
>> > SGI have multilevel nodes too I think ... so we'll still need multilevel
>> > NUMA at some point ... but maybe not right now.
>>
>> Intel's HT is the cleanest case: pure logical cores, which clearly need
>> special handling. Whether the other SMT solutions want to be handled via
>> the logical-cores code or via another level of NUMA-balancing code,
>> depends on benchmarking results i suspect. It will be one more flexibility
>> that system maintainers will have, it's all set up via the
>> sched_map_runqueue(cpu1, cpu2) boot-time call that 'merges' a CPU's
>> runqueue into another CPU's runqueue. It's basically the 0th level of
>> balancing, which will be fundamentally different. The other levels of
>> balancing are (or should be) similar to each other - only differing in
>> weight of balancing, not differing in the actual algorithm.
>
> I have included a very rough patch to do ht-numa topology. I requires to
> manually define CONFIG_NUMA and CONFIG_NUMA_SCHED. It also uses num_cpunodes
> instead of numnodes and defines MAX_NUM_NODES to 8 if CONFIG_NUMA is defined.
Whilst it's fine for benchmarking, I think this kind of overlap is a
very bad idea long-term - the confusion introduced is just asking for
trouble. And think what's going to happen when you mix HT and NUMA.
If we're going to use this for HT, it needs abstracting out.
M.
>> > I have included a very rough patch to do ht-numa topology. I requires to
>> > manually define CONFIG_NUMA and CONFIG_NUMA_SCHED. It also uses
>> > num_cpunodes instead of numnodes and defines MAX_NUM_NODES to 8 if
>> > CONFIG_NUMA is defined.
>>
>> Whilst it's fine for benchmarking, I think this kind of overlap is a
>> very bad idea long-term - the confusion introduced is just asking for
>> trouble. And think what's going to happen when you mix HT and NUMA.
>> If we're going to use this for HT, it needs abstracting out.
>
> I have no issues with using HT specific bits instead of NUMA. Design wise it
> would be nice if it could all be happy together, but if not, then so be it.
That's not what I meant - we can share the code, we just need to abstract
it out so you don't have to turn on CONFIG_NUMA. That was the point of
the pooling patch I posted at the weekend. Anyway, let's decide on the
best approach first, we can clean up the code for merging later.
M.
> > I have included a very rough patch to do ht-numa topology. I requires to
> > manually define CONFIG_NUMA and CONFIG_NUMA_SCHED. It also uses
> > num_cpunodes instead of numnodes and defines MAX_NUM_NODES to 8 if
> > CONFIG_NUMA is defined.
>
> Whilst it's fine for benchmarking, I think this kind of overlap is a
> very bad idea long-term - the confusion introduced is just asking for
> trouble. And think what's going to happen when you mix HT and NUMA.
> If we're going to use this for HT, it needs abstracting out.
I have no issues with using HT specific bits instead of NUMA. Design wise it
would be nice if it could all be happy together, but if not, then so be it.
-Andrew Theurer
the attached patch (against 2.5.59) is my current scheduler tree, it
includes two main areas of changes:
- interactivity improvements, mostly reworked bits from Andrea's tree and
various tunings.
- HT scheduler: 'shared runqueue' concept plus related logic: HT-aware
passive load balancing, active-balancing, HT-aware task pickup,
HT-aware affinity and HT-aware wakeup.
the sched-2.5.59-D7 patch can also be downloaded from:
http://redhat.com/~mingo/O(1)-scheduler/
the NUMA code compiles, but the boot code needs to be updated to make use
of sched_map_runqueue(). Usage is very simple, the following call:
sched_map_runqueue(0, 1);
merges CPU#1's runqueue into CPU#0's runqueue, so both CPU#0 and CPU#1
maps to runqueue#0 - runqueue#1 is inactive from that point on. The NUMA
boot code needs to interpret the topology and merge CPUs that are on the
same physical core, into one runqueue. The scheduler code will take care
of all the rest.
there's a 'test_ht' boot option available on x86, that can be used to
trigger a shared runqueue for the first two CPUs, for those who have no
genuine HT boxes but want to give it a go.
(it complicates things that the interactivity changes to the scheduler are
included here as well, but this was my base tree and i didnt want to go
back.)
Ingo
--- linux/arch/i386/kernel/cpu/proc.c.orig 2003-01-20 22:25:23.000000000 +0100
+++ linux/arch/i386/kernel/cpu/proc.c 2003-01-20 22:58:35.000000000 +0100
@@ -1,4 +1,5 @@
#include <linux/smp.h>
+#include <linux/sched.h>
#include <linux/timex.h>
#include <linux/string.h>
#include <asm/semaphore.h>
@@ -101,6 +102,13 @@
fpu_exception ? "yes" : "no",
c->cpuid_level,
c->wp_works_ok ? "yes" : "no");
+#if CONFIG_SHARE_RUNQUEUE
+{
+ extern long __rq_idx[NR_CPUS];
+
+ seq_printf(m, "\nrunqueue\t: %d\n", __rq_idx[n]);
+}
+#endif
for ( i = 0 ; i < 32*NCAPINTS ; i++ )
if ( test_bit(i, c->x86_capability) &&
--- linux/arch/i386/kernel/smpboot.c.orig 2003-01-20 19:29:09.000000000 +0100
+++ linux/arch/i386/kernel/smpboot.c 2003-01-20 22:58:35.000000000 +0100
@@ -38,6 +38,7 @@
#include <linux/kernel.h>
#include <linux/mm.h>
+#include <linux/sched.h>
#include <linux/kernel_stat.h>
#include <linux/smp_lock.h>
#include <linux/irq.h>
@@ -945,6 +946,16 @@
int cpu_sibling_map[NR_CPUS] __cacheline_aligned;
+static int test_ht;
+
+static int __init ht_setup(char *str)
+{
+ test_ht = 1;
+ return 1;
+}
+
+__setup("test_ht", ht_setup);
+
static void __init smp_boot_cpus(unsigned int max_cpus)
{
int apicid, cpu, bit;
@@ -1087,16 +1098,31 @@
Dprintk("Boot done.\n");
/*
- * If Hyper-Threading is avaialble, construct cpu_sibling_map[], so
+ * Here we can be sure that there is an IO-APIC in the system. Let's
+ * go and set it up:
+ */
+ smpboot_setup_io_apic();
+
+ setup_boot_APIC_clock();
+
+ /*
+ * Synchronize the TSC with the AP
+ */
+ if (cpu_has_tsc && cpucount)
+ synchronize_tsc_bp();
+ /*
+ * If Hyper-Threading is available, construct cpu_sibling_map[], so
* that we can tell the sibling CPU efficiently.
*/
+printk("cpu_has_ht: %d, smp_num_siblings: %d, num_online_cpus(): %d.\n", cpu_has_ht, smp_num_siblings, num_online_cpus());
if (cpu_has_ht && smp_num_siblings > 1) {
for (cpu = 0; cpu < NR_CPUS; cpu++)
cpu_sibling_map[cpu] = NO_PROC_ID;
for (cpu = 0; cpu < NR_CPUS; cpu++) {
int i;
- if (!test_bit(cpu, &cpu_callout_map)) continue;
+ if (!test_bit(cpu, &cpu_callout_map))
+ continue;
for (i = 0; i < NR_CPUS; i++) {
if (i == cpu || !test_bit(i, &cpu_callout_map))
@@ -1112,17 +1138,41 @@
printk(KERN_WARNING "WARNING: No sibling found for CPU %d.\n", cpu);
}
}
- }
-
- smpboot_setup_io_apic();
-
- setup_boot_APIC_clock();
+#if CONFIG_SHARE_RUNQUEUE
+ /*
+ * At this point APs would have synchronised TSC and
+ * waiting for smp_commenced, with their APIC timer
+ * disabled. So BP can go ahead do some initialization
+ * for Hyper-Threading (if needed).
+ */
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ int i;
+ if (!test_bit(cpu, &cpu_callout_map))
+ continue;
+ for (i = 0; i < NR_CPUS; i++) {
+ if (i <= cpu)
+ continue;
+ if (!test_bit(i, &cpu_callout_map))
+ continue;
- /*
- * Synchronize the TSC with the AP
- */
- if (cpu_has_tsc && cpucount)
- synchronize_tsc_bp();
+ if (phys_proc_id[cpu] != phys_proc_id[i])
+ continue;
+ /*
+ * merge runqueues, resulting in one
+ * runqueue per package:
+ */
+ sched_map_runqueue(cpu, i);
+ break;
+ }
+ }
+#endif
+ }
+#if CONFIG_SHARE_RUNQUEUE
+ if (smp_num_siblings == 1 && test_ht) {
+ printk("Simulating a 2-sibling 1-phys-CPU HT setup!\n");
+ sched_map_runqueue(0, 1);
+ }
+#endif
}
/* These are wrappers to interface to the new boot process. Someone
--- linux/arch/i386/Kconfig.orig 2003-01-20 20:19:23.000000000 +0100
+++ linux/arch/i386/Kconfig 2003-01-20 22:58:35.000000000 +0100
@@ -408,6 +408,24 @@
If you don't know what to do here, say N.
+choice
+
+ prompt "Hyperthreading Support"
+ depends on SMP
+ default NR_SIBLINGS_0
+
+config NR_SIBLINGS_0
+ bool "off"
+
+config NR_SIBLINGS_2
+ bool "2 siblings"
+
+config NR_SIBLINGS_4
+ bool "4 siblings"
+
+endchoice
+
+
config PREEMPT
bool "Preemptible Kernel"
help
--- linux/fs/pipe.c.orig 2003-01-20 19:28:43.000000000 +0100
+++ linux/fs/pipe.c 2003-01-20 22:58:35.000000000 +0100
@@ -117,7 +117,7 @@
up(PIPE_SEM(*inode));
/* Signal writers asynchronously that there is more room. */
if (do_wakeup) {
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
}
if (ret > 0)
@@ -205,7 +205,7 @@
}
up(PIPE_SEM(*inode));
if (do_wakeup) {
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
}
if (ret > 0) {
@@ -275,7 +275,7 @@
free_page((unsigned long) info->base);
kfree(info);
} else {
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
}
--- linux/include/linux/sched.h.orig 2003-01-20 19:29:09.000000000 +0100
+++ linux/include/linux/sched.h 2003-01-20 22:58:35.000000000 +0100
@@ -147,6 +147,24 @@
extern void sched_init(void);
extern void init_idle(task_t *idle, int cpu);
+/*
+ * Is there a way to do this via Kconfig?
+ */
+#define CONFIG_NR_SIBLINGS 0
+
+#if CONFIG_NR_SIBLINGS_2
+# define CONFIG_NR_SIBLINGS 2
+#elif CONFIG_NR_SIBLINGS_4
+# define CONFIG_NR_SIBLINGS 4
+#endif
+
+#if CONFIG_NR_SIBLINGS
+# define CONFIG_SHARE_RUNQUEUE 1
+#else
+# define CONFIG_SHARE_RUNQUEUE 0
+#endif
+extern void sched_map_runqueue(int cpu1, int cpu2);
+
extern void show_state(void);
extern void show_trace(unsigned long *stack);
extern void show_stack(unsigned long *stack);
@@ -293,7 +311,7 @@
prio_array_t *array;
unsigned long sleep_avg;
- unsigned long sleep_timestamp;
+ unsigned long last_run;
unsigned long policy;
unsigned long cpus_allowed;
@@ -605,6 +623,8 @@
#define remove_parent(p) list_del_init(&(p)->sibling)
#define add_parent(p, parent) list_add_tail(&(p)->sibling,&(parent)->children)
+#if 1
+
#define REMOVE_LINKS(p) do { \
if (thread_group_leader(p)) \
list_del_init(&(p)->tasks); \
@@ -633,6 +653,31 @@
#define while_each_thread(g, t) \
while ((t = next_thread(t)) != g)
+#else
+
+#define REMOVE_LINKS(p) do { \
+ list_del_init(&(p)->tasks); \
+ remove_parent(p); \
+ } while (0)
+
+#define SET_LINKS(p) do { \
+ list_add_tail(&(p)->tasks,&init_task.tasks); \
+ add_parent(p, (p)->parent); \
+ } while (0)
+
+#define next_task(p) list_entry((p)->tasks.next, struct task_struct, tasks)
+#define prev_task(p) list_entry((p)->tasks.prev, struct task_struct, tasks)
+
+#define for_each_process(p) \
+ for (p = &init_task ; (p = next_task(p)) != &init_task ; )
+
+#define do_each_thread(g, t) \
+ for (t = &init_task ; (t = next_task(t)) != &init_task ; )
+
+#define while_each_thread(g, t)
+
+#endif
+
extern task_t * FASTCALL(next_thread(task_t *p));
#define thread_group_leader(p) (p->pid == p->tgid)
--- linux/include/asm-i386/apic.h.orig 2003-01-20 19:28:31.000000000 +0100
+++ linux/include/asm-i386/apic.h 2003-01-20 22:58:35.000000000 +0100
@@ -98,4 +98,6 @@
#endif /* CONFIG_X86_LOCAL_APIC */
+extern int phys_proc_id[NR_CPUS];
+
#endif /* __ASM_APIC_H */
--- linux/kernel/fork.c.orig 2003-01-20 19:29:09.000000000 +0100
+++ linux/kernel/fork.c 2003-01-20 22:58:35.000000000 +0100
@@ -876,7 +876,7 @@
*/
p->first_time_slice = 1;
current->time_slice >>= 1;
- p->sleep_timestamp = jiffies;
+ p->last_run = jiffies;
if (!current->time_slice) {
/*
* This case is rare, it happens when the parent has only
--- linux/kernel/sys.c.orig 2003-01-20 19:28:52.000000000 +0100
+++ linux/kernel/sys.c 2003-01-20 22:58:36.000000000 +0100
@@ -220,7 +220,7 @@
if (error == -ESRCH)
error = 0;
- if (niceval < task_nice(p) && !capable(CAP_SYS_NICE))
+ if (0 && niceval < task_nice(p) && !capable(CAP_SYS_NICE))
error = -EACCES;
else
set_user_nice(p, niceval);
--- linux/kernel/sched.c.orig 2003-01-20 19:29:09.000000000 +0100
+++ linux/kernel/sched.c 2003-01-20 22:58:36.000000000 +0100
@@ -54,20 +54,21 @@
/*
* These are the 'tuning knobs' of the scheduler:
*
- * Minimum timeslice is 10 msecs, default timeslice is 150 msecs,
- * maximum timeslice is 300 msecs. Timeslices get refilled after
+ * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
+ * maximum timeslice is 200 msecs. Timeslices get refilled after
* they expire.
*/
#define MIN_TIMESLICE ( 10 * HZ / 1000)
-#define MAX_TIMESLICE (300 * HZ / 1000)
-#define CHILD_PENALTY 95
+#define MAX_TIMESLICE (200 * HZ / 1000)
+#define CHILD_PENALTY 50
#define PARENT_PENALTY 100
#define EXIT_WEIGHT 3
#define PRIO_BONUS_RATIO 25
#define INTERACTIVE_DELTA 2
-#define MAX_SLEEP_AVG (2*HZ)
-#define STARVATION_LIMIT (2*HZ)
-#define NODE_THRESHOLD 125
+#define MAX_SLEEP_AVG (10*HZ)
+#define STARVATION_LIMIT (30*HZ)
+#define SYNC_WAKEUPS 1
+#define SMART_WAKE_CHILD 1
/*
* If a task is 'interactive' then we reinsert it in the active
@@ -141,6 +142,48 @@
};
/*
+ * It's possible for two CPUs to share the same runqueue.
+ * This makes sense if they eg. share caches.
+ *
+ * We take the common 1:1 (SMP, UP) case and optimize it,
+ * the rest goes via remapping: rq_idx(cpu) gives the
+ * runqueue on which a particular cpu is on, cpu_idx(cpu)
+ * gives the rq-specific index of the cpu.
+ *
+ * (Note that the generic scheduler code does not impose any
+ * restrictions on the mappings - there can be 4 CPUs per
+ * runqueue or even assymetric mappings.)
+ */
+#if CONFIG_SHARE_RUNQUEUE
+# define MAX_NR_SIBLINGS CONFIG_NR_SIBLINGS
+ long __rq_idx[NR_CPUS] __cacheline_aligned;
+ static long __cpu_idx[NR_CPUS] __cacheline_aligned;
+# define rq_idx(cpu) (__rq_idx[(cpu)])
+# define cpu_idx(cpu) (__cpu_idx[(cpu)])
+# define for_each_sibling(idx, rq) \
+ for ((idx) = 0; (idx) < (rq)->nr_cpus; (idx)++)
+# define rq_nr_cpus(rq) ((rq)->nr_cpus)
+# define cpu_active_balance(c) (cpu_rq(c)->cpu[0].active_balance)
+#else
+# define MAX_NR_SIBLINGS 1
+# define rq_idx(cpu) (cpu)
+# define cpu_idx(cpu) 0
+# define for_each_sibling(idx, rq) while (0)
+# define cpu_active_balance(c) 0
+# define do_active_balance(rq, cpu) do { } while (0)
+# define rq_nr_cpus(rq) 1
+ static inline void active_load_balance(runqueue_t *rq, int this_cpu) { }
+#endif
+
+typedef struct cpu_s {
+ task_t *curr, *idle;
+ task_t *migration_thread;
+ struct list_head migration_queue;
+ int active_balance;
+ int cpu;
+} cpu_t;
+
+/*
* This is the main, per-CPU runqueue data structure.
*
* Locking rule: those places that want to lock multiple runqueues
@@ -151,7 +194,7 @@
spinlock_t lock;
unsigned long nr_running, nr_switches, expired_timestamp,
nr_uninterruptible;
- task_t *curr, *idle;
+ struct mm_struct *prev_mm;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
#ifdef CONFIG_NUMA
@@ -159,27 +202,39 @@
unsigned int nr_balanced;
int prev_node_load[MAX_NUMNODES];
#endif
- task_t *migration_thread;
- struct list_head migration_queue;
+ int nr_cpus;
+ cpu_t cpu[MAX_NR_SIBLINGS];
atomic_t nr_iowait;
} ____cacheline_aligned;
static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
-#define cpu_rq(cpu) (runqueues + (cpu))
+#define cpu_rq(cpu) (runqueues + (rq_idx(cpu)))
+#define cpu_int(c) ((cpu_rq(c))->cpu + cpu_idx(c))
+#define cpu_curr_ptr(cpu) (cpu_int(cpu)->curr)
+#define cpu_idle_ptr(cpu) (cpu_int(cpu)->idle)
+
#define this_rq() cpu_rq(smp_processor_id())
#define task_rq(p) cpu_rq(task_cpu(p))
-#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)
+#define migration_thread(cpu) (cpu_int(cpu)->migration_thread)
+#define migration_queue(cpu) (&cpu_int(cpu)->migration_queue)
+
+#if NR_CPUS > 1
+# define task_allowed(p, cpu) ((p)->cpus_allowed & (1UL << (cpu)))
+#else
+# define task_allowed(p, cpu) 1
+#endif
+
/*
* Default context-switch locking:
*/
#ifndef prepare_arch_switch
# define prepare_arch_switch(rq, next) do { } while(0)
# define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock)
-# define task_running(rq, p) ((rq)->curr == (p))
+# define task_running(p) (cpu_curr_ptr(task_cpu(p)) == (p))
#endif
#ifdef CONFIG_NUMA
@@ -322,16 +377,21 @@
* Also update all the scheduling statistics stuff. (sleep average
* calculation, priority modifiers, etc.)
*/
+static inline void __activate_task(task_t *p, runqueue_t *rq)
+{
+ enqueue_task(p, rq->active);
+ nr_running_inc(rq);
+}
+
static inline void activate_task(task_t *p, runqueue_t *rq)
{
- unsigned long sleep_time = jiffies - p->sleep_timestamp;
- prio_array_t *array = rq->active;
+ unsigned long sleep_time = jiffies - p->last_run;
if (!rt_task(p) && sleep_time) {
/*
* This code gives a bonus to interactive tasks. We update
* an 'average sleep time' value here, based on
- * sleep_timestamp. The more time a task spends sleeping,
+ * ->last_run. The more time a task spends sleeping,
* the higher the average gets - and the higher the priority
* boost gets as well.
*/
@@ -340,8 +400,7 @@
p->sleep_avg = MAX_SLEEP_AVG;
p->prio = effective_prio(p);
}
- enqueue_task(p, array);
- nr_running_inc(rq);
+ __activate_task(p, rq);
}
/*
@@ -382,6 +441,11 @@
#endif
}
+static inline void resched_cpu(int cpu)
+{
+ resched_task(cpu_curr_ptr(cpu));
+}
+
#ifdef CONFIG_SMP
/*
@@ -398,7 +462,7 @@
repeat:
preempt_disable();
rq = task_rq(p);
- if (unlikely(task_running(rq, p))) {
+ if (unlikely(task_running(p))) {
cpu_relax();
/*
* enable/disable preemption just to make this
@@ -409,7 +473,7 @@
goto repeat;
}
rq = task_rq_lock(p, &flags);
- if (unlikely(task_running(rq, p))) {
+ if (unlikely(task_running(p))) {
task_rq_unlock(rq, &flags);
preempt_enable();
goto repeat;
@@ -431,10 +495,39 @@
*/
void kick_if_running(task_t * p)
{
- if ((task_running(task_rq(p), p)) && (task_cpu(p) != smp_processor_id()))
+ if ((task_running(p)) && (task_cpu(p) != smp_processor_id()))
resched_task(p);
}
+static void wake_up_cpu(runqueue_t *rq, int cpu, task_t *p)
+{
+ cpu_t *curr_cpu;
+ task_t *curr;
+ int idx;
+
+ if (idle_cpu(cpu))
+ return resched_cpu(cpu);
+
+ for_each_sibling(idx, rq) {
+ curr_cpu = rq->cpu + idx;
+ if (!task_allowed(p, curr_cpu->cpu))
+ continue;
+ if (curr_cpu->idle == curr_cpu->curr)
+ return resched_cpu(curr_cpu->cpu);
+ }
+
+ if (p->prio < cpu_curr_ptr(cpu)->prio)
+ return resched_task(cpu_curr_ptr(cpu));
+
+ for_each_sibling(idx, rq) {
+ curr_cpu = rq->cpu + idx;
+ if (!task_allowed(p, curr_cpu->cpu))
+ continue;
+ curr = curr_cpu->curr;
+ if (p->prio < curr->prio)
+ return resched_task(curr);
+ }
+}
/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
@@ -455,6 +548,7 @@
long old_state;
runqueue_t *rq;
+ sync &= SYNC_WAKEUPS;
repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
@@ -463,7 +557,7 @@
* Fast-migrate the task if it's not running or runnable
* currently. Do not violate hard affinity.
*/
- if (unlikely(sync && !task_running(rq, p) &&
+ if (unlikely(sync && !task_running(p) &&
(task_cpu(p) != smp_processor_id()) &&
(p->cpus_allowed & (1UL << smp_processor_id())))) {
@@ -473,10 +567,12 @@
}
if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
- activate_task(p, rq);
-
- if (p->prio < rq->curr->prio)
- resched_task(rq->curr);
+ if (sync)
+ __activate_task(p, rq);
+ else {
+ activate_task(p, rq);
+ wake_up_cpu(rq, task_cpu(p), p);
+ }
success = 1;
}
p->state = TASK_RUNNING;
@@ -512,8 +608,19 @@
p->prio = effective_prio(p);
}
set_task_cpu(p, smp_processor_id());
- activate_task(p, rq);
+ if (SMART_WAKE_CHILD) {
+ if (unlikely(!current->array))
+ __activate_task(p, rq);
+ else {
+ p->prio = current->prio;
+ list_add_tail(&p->run_list, ¤t->run_list);
+ p->array = current->array;
+ p->array->nr_active++;
+ nr_running_inc(rq);
+ }
+ } else
+ activate_task(p, rq);
rq_unlock(rq);
}
@@ -561,7 +668,7 @@
* context_switch - switch to the new MM and the new
* thread's register state.
*/
-static inline task_t * context_switch(task_t *prev, task_t *next)
+static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next)
{
struct mm_struct *mm = next->mm;
struct mm_struct *oldmm = prev->active_mm;
@@ -575,7 +682,7 @@
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
- mmdrop(oldmm);
+ rq->prev_mm = oldmm;
}
/* Here we just switch the register state and the stack. */
@@ -596,8 +703,9 @@
unsigned long i, sum = 0;
for (i = 0; i < NR_CPUS; i++)
- sum += cpu_rq(i)->nr_running;
-
+ /* Shared runqueues are counted only once. */
+ if (!cpu_idx(i))
+ sum += cpu_rq(i)->nr_running;
return sum;
}
@@ -608,7 +716,9 @@
for (i = 0; i < NR_CPUS; i++) {
if (!cpu_online(i))
continue;
- sum += cpu_rq(i)->nr_uninterruptible;
+ /* Shared runqueues are counted only once. */
+ if (!cpu_idx(i))
+ sum += cpu_rq(i)->nr_uninterruptible;
}
return sum;
}
@@ -790,7 +900,23 @@
#endif /* CONFIG_NUMA */
-#if CONFIG_SMP
+/*
+ * One of the idle_cpu_tick() and busy_cpu_tick() functions will
+ * get called every timer tick, on every CPU. Our balancing action
+ * frequency and balancing agressivity depends on whether the CPU is
+ * idle or not.
+ *
+ * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * systems with HZ=100, every 10 msecs.)
+ */
+#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+
+#if !CONFIG_SMP
+
+static inline void load_balance(runqueue_t *rq, int this_cpu, int idle) { }
+
+#else
/*
* double_lock_balance - lock the busiest runqueue
@@ -906,12 +1032,7 @@
set_task_cpu(p, this_cpu);
nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
- /*
- * Note that idle threads have a prio of MAX_PRIO, for this test
- * to be always true for them.
- */
- if (p->prio < this_rq->curr->prio)
- set_need_resched();
+ wake_up_cpu(this_rq, this_cpu, p);
}
/*
@@ -922,9 +1043,9 @@
* We call this with the current runqueue locked,
* irqs disabled.
*/
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int this_cpu, int idle)
{
- int imbalance, idx, this_cpu = smp_processor_id();
+ int imbalance, idx;
runqueue_t *busiest;
prio_array_t *array;
struct list_head *head, *curr;
@@ -972,12 +1093,14 @@
* 1) running (obviously), or
* 2) cannot be migrated to this CPU due to cpus_allowed, or
* 3) are cache-hot on their current CPU.
+ *
+ * (except if we are in idle mode which is a more agressive
+ * form of rebalancing.)
*/
-#define CAN_MIGRATE_TASK(p,rq,this_cpu) \
- ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \
- !task_running(rq, p) && \
- ((p)->cpus_allowed & (1UL << (this_cpu))))
+#define CAN_MIGRATE_TASK(p,rq,cpu) \
+ ((idle || (jiffies - (p)->last_run > cache_decay_ticks)) && \
+ !task_running(p) && task_allowed(p, cpu))
curr = curr->prev;
@@ -1000,31 +1123,136 @@
;
}
+#if CONFIG_SHARE_RUNQUEUE
+static void active_load_balance(runqueue_t *this_rq, int this_cpu)
+{
+ runqueue_t *rq;
+ int i, idx;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_online(i))
+ continue;
+ rq = cpu_rq(i);
+ if (rq == this_rq)
+ continue;
+ /*
+ * Any SMT-specific imbalance?
+ */
+ for_each_sibling(idx, rq)
+ if (rq->cpu[idx].idle == rq->cpu[idx].curr)
+ goto next_cpu;
+
+ /*
+ * At this point it's sure that we have a SMT
+ * imbalance: this (physical) CPU is idle but
+ * another CPU has two tasks running.
+ *
+ * We wake up one of the migration threads (it
+ * doesnt matter which one) and let it fix things up:
+ */
+ if (!cpu_active_balance(this_cpu)) {
+ cpu_active_balance(this_cpu) = 1;
+ spin_unlock(&this_rq->lock);
+ wake_up_process(rq->cpu[0].migration_thread);
+ spin_lock(&this_rq->lock);
+ }
+next_cpu:
+ }
+}
+
+static void do_active_balance(runqueue_t *this_rq, int this_cpu)
+{
+ runqueue_t *rq;
+ int i, idx;
+
+ spin_unlock(&this_rq->lock);
+
+ cpu_active_balance(this_cpu) = 0;
+
+ /*
+ * Is the imbalance still present?
+ */
+ for_each_sibling(idx, this_rq)
+ if (this_rq->cpu[idx].idle == this_rq->cpu[idx].curr)
+ goto out;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_online(i))
+ continue;
+ rq = cpu_rq(i);
+ if (rq == this_rq)
+ continue;
+
+ /* completely idle CPU? */
+ if (rq->nr_running)
+ continue;
+
+ /*
+ * At this point it's reasonably sure that we have an
+ * imbalance. Since we are the migration thread, try to
+ * balance a thread over to the target queue.
+ */
+ spin_lock(&rq->lock);
+ load_balance(rq, i, 1);
+ spin_unlock(&rq->lock);
+ goto out;
+ }
+out:
+ spin_lock(&this_rq->lock);
+}
+
/*
- * One of the idle_cpu_tick() and busy_cpu_tick() functions will
- * get called every timer tick, on every CPU. Our balancing action
- * frequency and balancing agressivity depends on whether the CPU is
- * idle or not.
+ * This routine is called to map a CPU into another CPU's runqueue.
*
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
- * systems with HZ=100, every 10 msecs.)
+ * This must be called during bootup with the merged runqueue having
+ * no tasks.
*/
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
-#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+void sched_map_runqueue(int cpu1, int cpu2)
+{
+ runqueue_t *rq1 = cpu_rq(cpu1);
+ runqueue_t *rq2 = cpu_rq(cpu2);
+ int cpu2_idx_orig = cpu_idx(cpu2), cpu2_idx;
+
+ printk("sched_merge_runqueues: CPU#%d <=> CPU#%d, on CPU#%d.\n", cpu1, cpu2, smp_processor_id());
+ if (rq1 == rq2)
+ BUG();
+ if (rq2->nr_running)
+ BUG();
+ /*
+ * At this point, we dont have anything in the runqueue yet. So,
+ * there is no need to move processes between the runqueues.
+ * Only, the idle processes should be combined and accessed
+ * properly.
+ */
+ cpu2_idx = rq1->nr_cpus++;
-static inline void idle_tick(runqueue_t *rq)
+ if (rq_idx(cpu1) != cpu1)
+ BUG();
+ rq_idx(cpu2) = cpu1;
+ cpu_idx(cpu2) = cpu2_idx;
+ rq1->cpu[cpu2_idx].cpu = cpu2;
+ rq1->cpu[cpu2_idx].idle = rq2->cpu[cpu2_idx_orig].idle;
+ rq1->cpu[cpu2_idx].curr = rq2->cpu[cpu2_idx_orig].curr;
+ INIT_LIST_HEAD(&rq1->cpu[cpu2_idx].migration_queue);
+
+ /* just to be safe: */
+ rq2->cpu[cpu2_idx_orig].idle = NULL;
+ rq2->cpu[cpu2_idx_orig].curr = NULL;
+}
+#endif
+#endif
+
+DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
+
+static inline void idle_tick(runqueue_t *rq, unsigned long j)
{
- if (jiffies % IDLE_REBALANCE_TICK)
+ if (j % IDLE_REBALANCE_TICK)
return;
spin_lock(&rq->lock);
- load_balance(rq, 1);
+ load_balance(rq, smp_processor_id(), 1);
spin_unlock(&rq->lock);
}
-#endif
-
-DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
-
/*
* We place interactive tasks back into the active array, if possible.
*
@@ -1035,9 +1263,9 @@
* increasing number of running tasks:
*/
#define EXPIRED_STARVING(rq) \
- ((rq)->expired_timestamp && \
+ (STARVATION_LIMIT && ((rq)->expired_timestamp && \
(jiffies - (rq)->expired_timestamp >= \
- STARVATION_LIMIT * ((rq)->nr_running) + 1))
+ STARVATION_LIMIT * ((rq)->nr_running) + 1)))
/*
* This function gets called by the timer code, with HZ frequency.
@@ -1050,12 +1278,13 @@
{
int cpu = smp_processor_id();
runqueue_t *rq = this_rq();
+ unsigned long j = jiffies;
task_t *p = current;
if (rcu_pending(cpu))
rcu_check_callbacks(cpu, user_ticks);
- if (p == rq->idle) {
+ if (p == cpu_idle_ptr(cpu)) {
/* note: this timer irq context must be accounted for as well */
if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
kstat_cpu(cpu).cpustat.system += sys_ticks;
@@ -1063,9 +1292,7 @@
kstat_cpu(cpu).cpustat.iowait += sys_ticks;
else
kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
- idle_tick(rq);
-#endif
+ idle_tick(rq, j);
return;
}
if (TASK_NICE(p) > 0)
@@ -1074,12 +1301,13 @@
kstat_cpu(cpu).cpustat.user += user_ticks;
kstat_cpu(cpu).cpustat.system += sys_ticks;
+ spin_lock(&rq->lock);
/* Task might have expired already, but not scheduled off yet */
if (p->array != rq->active) {
set_tsk_need_resched(p);
+ spin_unlock(&rq->lock);
return;
}
- spin_lock(&rq->lock);
if (unlikely(rt_task(p))) {
/*
* RR tasks need a special form of timeslice management.
@@ -1115,16 +1343,14 @@
if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
if (!rq->expired_timestamp)
- rq->expired_timestamp = jiffies;
+ rq->expired_timestamp = j;
enqueue_task(p, rq->expired);
} else
enqueue_task(p, rq->active);
}
out:
-#if CONFIG_SMP
- if (!(jiffies % BUSY_REBALANCE_TICK))
- load_balance(rq, 0);
-#endif
+ if (!(j % BUSY_REBALANCE_TICK))
+ load_balance(rq, smp_processor_id(), 0);
spin_unlock(&rq->lock);
}
@@ -1135,11 +1361,11 @@
*/
asmlinkage void schedule(void)
{
+ int idx, this_cpu, retry = 0;
+ struct list_head *queue;
task_t *prev, *next;
- runqueue_t *rq;
prio_array_t *array;
- struct list_head *queue;
- int idx;
+ runqueue_t *rq;
/*
* Test if we are atomic. Since do_exit() needs to call into
@@ -1152,15 +1378,15 @@
dump_stack();
}
}
-
- check_highmem_ptes();
need_resched:
+ check_highmem_ptes();
+ this_cpu = smp_processor_id();
preempt_disable();
prev = current;
rq = this_rq();
release_kernel_lock(prev);
- prev->sleep_timestamp = jiffies;
+ prev->last_run = jiffies;
spin_lock_irq(&rq->lock);
/*
@@ -1183,12 +1409,14 @@
}
pick_next_task:
if (unlikely(!rq->nr_running)) {
-#if CONFIG_SMP
- load_balance(rq, 1);
+ load_balance(rq, this_cpu, 1);
if (rq->nr_running)
goto pick_next_task;
-#endif
- next = rq->idle;
+ active_load_balance(rq, this_cpu);
+ if (rq->nr_running)
+ goto pick_next_task;
+pick_idle:
+ next = cpu_idle_ptr(this_cpu);
rq->expired_timestamp = 0;
goto switch_tasks;
}
@@ -1204,24 +1432,59 @@
rq->expired_timestamp = 0;
}
+new_array:
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);
+ if ((next != prev) && (rq_nr_cpus(rq) > 1)) {
+ struct list_head *tmp = queue->next;
+
+ while (task_running(next) || !task_allowed(next, this_cpu)) {
+ tmp = tmp->next;
+ if (tmp != queue) {
+ next = list_entry(tmp, task_t, run_list);
+ continue;
+ }
+ idx = find_next_bit(array->bitmap, MAX_PRIO, ++idx);
+ if (idx == MAX_PRIO) {
+ if (retry || !rq->expired->nr_active)
+ goto pick_idle;
+ /*
+ * To avoid infinite changing of arrays,
+ * when we have only tasks runnable by
+ * sibling.
+ */
+ retry = 1;
+
+ array = rq->expired;
+ goto new_array;
+ }
+ queue = array->queue + idx;
+ tmp = queue->next;
+ next = list_entry(tmp, task_t, run_list);
+ }
+ }
switch_tasks:
prefetch(next);
clear_tsk_need_resched(prev);
- RCU_qsctr(prev->thread_info->cpu)++;
+ RCU_qsctr(task_cpu(prev))++;
if (likely(prev != next)) {
+ struct mm_struct *prev_mm;
rq->nr_switches++;
- rq->curr = next;
+ cpu_curr_ptr(this_cpu) = next;
+ set_task_cpu(next, this_cpu);
prepare_arch_switch(rq, next);
- prev = context_switch(prev, next);
+ prev = context_switch(rq, prev, next);
barrier();
rq = this_rq();
+ prev_mm = rq->prev_mm;
+ rq->prev_mm = NULL;
finish_arch_switch(rq, prev);
+ if (prev_mm)
+ mmdrop(prev_mm);
} else
spin_unlock_irq(&rq->lock);
@@ -1481,9 +1744,8 @@
* If the task is running and lowered its priority,
* or increased its priority then reschedule its CPU:
*/
- if ((NICE_TO_PRIO(nice) < p->static_prio) ||
- task_running(rq, p))
- resched_task(rq->curr);
+ if ((NICE_TO_PRIO(nice) < p->static_prio) || task_running(p))
+ resched_task(cpu_curr_ptr(task_cpu(p)));
}
out_unlock:
task_rq_unlock(rq, &flags);
@@ -1561,7 +1823,7 @@
*/
int task_curr(task_t *p)
{
- return cpu_curr(task_cpu(p)) == p;
+ return cpu_curr_ptr(task_cpu(p)) == p;
}
/**
@@ -1570,7 +1832,7 @@
*/
int idle_cpu(int cpu)
{
- return cpu_curr(cpu) == cpu_rq(cpu)->idle;
+ return cpu_curr_ptr(cpu) == cpu_idle_ptr(cpu);
}
/**
@@ -1660,7 +1922,7 @@
else
p->prio = p->static_prio;
if (array)
- activate_task(p, task_rq(p));
+ __activate_task(p, task_rq(p));
out_unlock:
task_rq_unlock(rq, &flags);
@@ -2135,7 +2397,7 @@
local_irq_save(flags);
double_rq_lock(idle_rq, rq);
- idle_rq->curr = idle_rq->idle = idle;
+ cpu_curr_ptr(cpu) = cpu_idle_ptr(cpu) = idle;
deactivate_task(idle, rq);
idle->array = NULL;
idle->prio = MAX_PRIO;
@@ -2190,6 +2452,7 @@
unsigned long flags;
migration_req_t req;
runqueue_t *rq;
+ int cpu;
#if 0 /* FIXME: Grab cpu_lock, return error on this case. --RR */
new_mask &= cpu_online_map;
@@ -2211,31 +2474,31 @@
* If the task is not on a runqueue (and not running), then
* it is sufficient to simply update the task's cpu field.
*/
- if (!p->array && !task_running(rq, p)) {
+ if (!p->array && !task_running(p)) {
set_task_cpu(p, __ffs(p->cpus_allowed));
task_rq_unlock(rq, &flags);
return;
}
init_completion(&req.done);
req.task = p;
- list_add(&req.list, &rq->migration_queue);
+ cpu = task_cpu(p);
+ list_add(&req.list, migration_queue(cpu));
task_rq_unlock(rq, &flags);
-
- wake_up_process(rq->migration_thread);
+ wake_up_process(migration_thread(cpu));
wait_for_completion(&req.done);
}
/*
- * migration_thread - this is a highprio system thread that performs
+ * migration_task - this is a highprio system thread that performs
* thread migration by 'pulling' threads into the target runqueue.
*/
-static int migration_thread(void * data)
+static int migration_task(void * data)
{
struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
int cpu = (long) data;
runqueue_t *rq;
- int ret;
+ int ret, idx;
daemonize();
sigfillset(¤t->blocked);
@@ -2250,7 +2513,8 @@
ret = setscheduler(0, SCHED_FIFO, ¶m);
rq = this_rq();
- rq->migration_thread = current;
+ migration_thread(cpu) = current;
+ idx = cpu_idx(cpu);
sprintf(current->comm, "migration/%d", smp_processor_id());
@@ -2263,7 +2527,9 @@
task_t *p;
spin_lock_irqsave(&rq->lock, flags);
- head = &rq->migration_queue;
+ if (cpu_active_balance(cpu))
+ do_active_balance(rq, cpu);
+ head = migration_queue(cpu);
current->state = TASK_INTERRUPTIBLE;
if (list_empty(head)) {
spin_unlock_irqrestore(&rq->lock, flags);
@@ -2292,9 +2558,8 @@
set_task_cpu(p, cpu_dest);
if (p->array) {
deactivate_task(p, rq_src);
- activate_task(p, rq_dest);
- if (p->prio < rq_dest->curr->prio)
- resched_task(rq_dest->curr);
+ __activate_task(p, rq_dest);
+ wake_up_cpu(rq_dest, cpu_dest, p);
}
}
double_rq_unlock(rq_src, rq_dest);
@@ -2312,12 +2577,13 @@
unsigned long action,
void *hcpu)
{
+ long cpu = (long) hcpu;
+
switch (action) {
case CPU_ONLINE:
- printk("Starting migration thread for cpu %li\n",
- (long)hcpu);
- kernel_thread(migration_thread, hcpu, CLONE_KERNEL);
- while (!cpu_rq((long)hcpu)->migration_thread)
+ printk("Starting migration thread for cpu %li\n", cpu);
+ kernel_thread(migration_task, hcpu, CLONE_KERNEL);
+ while (!migration_thread(cpu))
yield();
break;
}
@@ -2392,11 +2658,20 @@
for (i = 0; i < NR_CPUS; i++) {
prio_array_t *array;
+ /*
+ * Start with a 1:1 mapping between CPUs and runqueues:
+ */
+#if CONFIG_SHARE_RUNQUEUE
+ rq_idx(i) = i;
+ cpu_idx(i) = 0;
+#endif
rq = cpu_rq(i);
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- INIT_LIST_HEAD(&rq->migration_queue);
+ INIT_LIST_HEAD(migration_queue(i));
+ rq->nr_cpus = 1;
+ rq->cpu[cpu_idx(i)].cpu = i;
atomic_set(&rq->nr_iowait, 0);
nr_running_init(rq);
@@ -2414,9 +2689,13 @@
* We have to do a little magic to get the first
* thread right in SMP mode.
*/
- rq = this_rq();
- rq->curr = current;
- rq->idle = current;
+ cpu_curr_ptr(smp_processor_id()) = current;
+ cpu_idle_ptr(smp_processor_id()) = current;
+ printk("sched_init().\n");
+ printk("smp_processor_id(): %d.\n", smp_processor_id());
+ printk("rq_idx(smp_processor_id()): %ld.\n", rq_idx(smp_processor_id()));
+ printk("this_rq(): %p.\n", this_rq());
+
set_task_cpu(current, smp_processor_id());
wake_up_process(current);
--- linux/init/main.c.orig 2003-01-20 19:29:09.000000000 +0100
+++ linux/init/main.c 2003-01-20 22:58:36.000000000 +0100
@@ -354,7 +354,14 @@
static void rest_init(void)
{
+ /*
+ * We count on the initial thread going ok
+ * Like idlers init is an unlocked kernel thread, which will
+ * make syscalls (and thus be locked).
+ */
+ init_idle(current, smp_processor_id());
kernel_thread(init, NULL, CLONE_KERNEL);
+
unlock_kernel();
cpu_idle();
}
@@ -438,13 +445,6 @@
check_bugs();
printk("POSIX conformance testing by UNIFIX\n");
- /*
- * We count on the initial thread going ok
- * Like idlers init is an unlocked kernel thread, which will
- * make syscalls (and thus be locked).
- */
- init_idle(current, smp_processor_id());
-
/* Do the rest non-__init'ed, we're now alive */
rest_init();
}
Ingo wrote:
>--- linux/fs/pipe.c.orig 2003-01-20 19:28:43.000000000 +0100
>+++ linux/fs/pipe.c 2003-01-20 22:58:35.000000000 +0100
>@@ -117,7 +117,7 @@
> up(PIPE_SEM(*inode));
> /* Signal writers asynchronously that there is more room. */
> if (do_wakeup) {
>- wake_up_interruptible(PIPE_WAIT(*inode));
>+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
> kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
> }
> if (ret > 0)
>
What's the purpose of this change?
I thought that the _sync functions should be called if it's guaranteed
that schedule() will be called immediately, i.e. if the scheduler should
not rebalance.
You've added _sync() to the codepaths that lead to the end of the syscall.
--
Manfred
On Mon, 20 Jan 2003, Manfred Spraul wrote:
> > if (do_wakeup) {
> >- wake_up_interruptible(PIPE_WAIT(*inode));
> >+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
> What's the purpose of this change?
It was a quick experiment i forgot to remove from my sources, will fix it
in the next patch.
Ingo
Ingo Molnar <[email protected]> wrote:
>
>
> the attached patch (against 2.5.59) is my current scheduler tree, it
> includes two main areas of changes:
>
> - interactivity improvements, mostly reworked bits from Andrea's tree and
> various tunings.
>
Thanks for doing this. Initial testing with one workload which is extremely
bad with 2.5.59: huge improvement.
The workload is:
workstation> ssh laptop
laptop> setenv DISPLAY workstation:0
laptop> make -j0 bzImage&
laptop> some-x-application &
For some reason, X-across-ethernet performs terribly when there's a kernel
compile on the client machine - lots of half-second lags.
All gone now.
wrt this:
if (SMART_WAKE_CHILD) {
if (unlikely(!current->array))
__activate_task(p, rq);
else {
p->prio = current->prio;
list_add_tail(&p->run_list, ¤t->run_list);
p->array = current->array;
p->array->nr_active++;
nr_running_inc(rq);
}
for some reason I decided that RT tasks need special handling here. I
forget why though ;) Please double-check that.
--- 25/kernel/sched.c~rcf-simple Thu Dec 26 02:34:11 2002
+++ 25-akpm/kernel/sched.c Thu Dec 26 02:34:40 2002
@@ -452,6 +452,8 @@ int wake_up_process(task_t * p)
void wake_up_forked_process(task_t * p)
{
runqueue_t *rq = this_rq_lock();
+ struct task_struct *this_task = current;
+ prio_array_t *array = this_task->array;
p->state = TASK_RUNNING;
if (!rt_task(p)) {
@@ -467,6 +469,19 @@ void wake_up_forked_process(task_t * p)
set_task_cpu(p, smp_processor_id());
activate_task(p, rq);
+ /*
+ * Take caller off the head of the runqueue, so child will run first.
+ */
+ if (array) {
+ if (!rt_task(current)) {
+ dequeue_task(this_task, array);
+ enqueue_task(this_task, array);
+ } else {
+ list_del(&this_task->run_list);
+ list_add_tail(&this_task->run_list,
+ array->queue + this_task->prio);
+ }
+ }
rq_unlock(rq);
}
On Mon, 2003-01-20 at 14:28, Andrew Morton wrote:
> Ingo Molnar <[email protected]> wrote:
> >
> >
> > the attached patch (against 2.5.59) is my current scheduler tree, it
> > includes two main areas of changes:
> >
> > - interactivity improvements, mostly reworked bits from Andrea's tree and
> > various tunings.
> >
>
> Thanks for doing this. Initial testing with one workload which is extremely
> bad with 2.5.59: huge improvement.
>
Initial testing on my NUMA box looks good. So far, I only have
kernbench numbers, but the system time shows a nice decrease,
and the CPU % busy time has gone up.
Kernbench:
Elapsed User System CPU
ingoD7-59 28.944s 285.008s 79.998s 1260.8%
stock59 29.668s 283.762s 82.286s 1233%
I'll try to get more testing in over the next few days.
There was one minor build error introduced by the patch:
#define NODE_THRESHOLD 125
was removed, but the NUMA code uses this.
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486
On Monday 20 January 2003 17:56, Ingo Molnar wrote:
> On Mon, 20 Jan 2003, Erich Focht wrote:
> > Could you please explain your idea? As far as I understand, the SMP
> > balancer (pre-NUMA) tries a global rebalance at each call. Maybe you
> > mean something different...
>
> yes, but eg. in the idle-rebalance case we are more agressive at moving
> tasks across SMP CPUs. We could perhaps do a similar ->nr_balanced logic
> to do this 'agressive' balancing even if not triggered from the
> CPU-will-be-idle path. Ie. _perhaps_ the SMP balancer could become a bit
> more agressive.
Do you mean: make the SMP balancer more aggressive by lowering the
125% threshold?
> ie. SMP is just the first level in the cache-hierarchy, NUMA is the second
> level. (lets hope we dont have to deal with a third caching level anytime
> soon - although that could as well happen once SMT CPUs start doing NUMA.)
> There's no real reason to do balancing in a different way on each level -
> the weight might be different, but the core logic should be synced up.
> (one thing that is indeed different for the NUMA step is locality of
> uncached memory.)
We have an IA64 2-level node hierarchy machine with 32 CPUs (NEC
TX7). In the "old" node affine scheduler patch the multilevel feature
was in by different cross-node steal delays (longer if node is further
away). In the current approach we could just add another counter, such
that we call the cross-supernode balancer only if the intra-supernode
balancer failed a few times. No idea whether this helps...
> > Yes! Actually the currently implemented nr_balanced logic is pretty
> > dumb: the counter reaches the cross-node balance threshold after a
> > certain number of calls to intra-node lb, no matter whether these were
> > successfull or not. I'd like to try incrementing the counter only on
> > unsuccessfull load balances, this would give a clear priority to
> > intra-node balancing and a clear and controllable delay for cross-node
> > balancing. A tiny patch for this (for 2.5.59) is attached. As the name
> > nr_balanced would be misleading for this kind of usage, I renamed it to
> > nr_lb_failed.
>
> indeed this approach makes much more sense than the simple ->nr_balanced
> counter. A similar approach makes sense on the SMP level as well: if the
> current 'busy' rebalancer fails to get a new task, we can try the current
> 'idle' rebalancer. Ie. a CPU going idle would do the less intrusive
> rebalancing first.
>
> have you experimented with making the counter limit == 1 actually? Ie.
> immediately trying to do a global balancing once the less intrusive
> balancing fails?
Didn't have time to try and probably won't be able to check this
before the beginning of next week :-( .
Regards,
Erich
On Mon, 2003-01-20 at 13:18, Ingo Molnar wrote:
>
> the attached patch (against 2.5.59) is my current scheduler tree, it
> includes two main areas of changes:
>
> - interactivity improvements, mostly reworked bits from Andrea's tree and
> various tunings.
>
> - HT scheduler: 'shared runqueue' concept plus related logic: HT-aware
> passive load balancing, active-balancing, HT-aware task pickup,
> HT-aware affinity and HT-aware wakeup.
I ran Erich's numatest on a system with this patch, plus the
cputime_stats patch (so that we would get meaningful numbers),
and found a problem. It appears that on the lightly loaded system
sched_best_cpu is now loading up one node before moving on to the
next. Once the system is loaded (i.e., a process per cpu) things
even out. Before applying the D7 patch, processes were distributed
evenly across nodes, even in low load situations.
The numbers below on schedbench show that on 4 and 8 process runs,
the D7 patch gets the same average user time as on the 16 and greater
runs. However, without the D7 patch, the 4 and 8 process runs tend
to have significantly decreased average user time.
Below I'm including the summarized output, and then the detailed
output for the relevant runs on both D7 patched systems and stock.
Overall performance is improved with the D7 patch, so I would like
to find out and fix what went wrong on the light load cases and
encourage the adoption of the the D7 patch (or at least the parts
that make the NUMA scheduler even faster). I'm not likely to have
time to chase this down for the next few days, so am posting
results to see if anyone else can find the cause.
kernels:
* stock59-stats = stock 2.5.59 with the cputime_stats patch
* ingoD7-59.stats = testD7-59 = stock59-stats + Ingo's D7 patch
Kernbench:
Elapsed User System CPU
testD7-59 28.96s 285.314s 79.616s 1260.6%
ingoD7-59.stats 28.824s 284.834s 79.164s 1263.6%
stock59-stats 29.498s 283.976s 83.05s 1243.8%
Schedbench 4:
AvgUser Elapsed TotalUser TotalSys
testD7-59 53.19 53.43 212.81 0.59
ingoD7-59.stats 44.77 46.52 179.10 0.78
stock59-stats 22.25 35.94 89.06 0.81
Schedbench 8:
AvgUser Elapsed TotalUser TotalSys
testD7-59 53.22 53.66 425.81 1.40
ingoD7-59.stats 39.44 47.15 315.62 1.62
stock59-stats 28.40 42.25 227.26 1.67
Schedbench 16:
AvgUser Elapsed TotalUser TotalSys
testD7-59 52.84 58.26 845.49 2.78
ingoD7-59.stats 52.85 57.31 845.68 3.29
stock59-stats 52.97 57.19 847.70 3.29
Schedbench 32:
AvgUser Elapsed TotalUser TotalSys
testD7-59 56.77 122.51 1816.80 7.58
ingoD7-59.stats 56.54 125.79 1809.67 6.97
stock59-stats 56.57 118.05 1810.53 5.97
Schedbench 64:
AvgUser Elapsed TotalUser TotalSys
testD7-59 57.52 234.27 3681.86 18.18
ingoD7-59.stats 58.25 242.61 3728.46 17.40
stock59-stats 56.75 234.12 3632.72 15.70
Detailed stats from running numatest with 4 processes on the D7 patch.
Note how most of the load is put on node 0.
Executing 4 times: ./randupdt 1000000
Running 'hackbench 20' in the background: Time: 5.039
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 100.0 0.0 0.0 0.0 | 0 0 | 46.27
2 58.4 0.0 0.0 41.6 | 0 0 | 41.18
3 100.0 0.0 0.0 0.0 | 0 0 | 45.72
4 100.0 0.0 0.0 0.0 | 0 0 | 45.89
AverageUserTime 44.77 seconds
ElapsedTime 46.52
TotalUserTime 179.10
TotalSysTime 0.78
Detailed stats from running numatest with 8 processes on the D7 patch.
In this one it appears that node 0 was loaded, then node 1.
Executing 8 times: ./randupdt 1000000
Running 'hackbench 20' in the background: Time: 11.185
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 100.0 0.0 0.0 0.0 | 0 0 | 46.89
2 100.0 0.0 0.0 0.0 | 0 0 | 46.20
3 100.0 0.0 0.0 0.0 | 0 0 | 46.31
4 0.0 100.0 0.0 0.0 | 1 1 | 39.44
5 0.0 0.0 99.9 0.0 | 2 2 | 16.00
6 62.6 0.0 0.0 37.4 | 0 0 | 42.23
7 0.0 100.0 0.0 0.0 | 1 1 | 39.12
8 0.0 100.0 0.0 0.0 | 1 1 | 39.35
AverageUserTime 39.44 seconds
ElapsedTime 47.15
TotalUserTime 315.62
TotalSysTime 1.62
Control run - detailed stats running numatest with 4 processes
on a stock59 kernel.
Executing 4 times: ./randupdt 1000000
Running 'hackbench 20' in the background: Time: 8.297
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 99.8 0.0 0.0 | 1 1 | 16.63
2 100.0 0.0 0.0 0.0 | 0 0 | 27.83
3 0.0 0.0 99.9 0.0 | 2 2 | 16.27
4 100.0 0.0 0.0 0.0 | 0 0 | 28.29
AverageUserTime 22.25 seconds
ElapsedTime 35.94
TotalUserTime 89.06
TotalSysTime 0.81
Control run - detailed stats running numatest with 8 processes
on a stock59 kernel.
Executing 8 times: ./randupdt 1000000
Running 'hackbench 20' in the background: Time: 9.458
Job node00 node01 node02 node03 | iSched MSched | UserTime(s)
1 0.0 99.9 0.0 0.0 | 1 1 | 27.77
2 100.0 0.0 0.0 0.0 | 0 0 | 29.34
3 0.0 0.0 100.0 0.0 | 2 2 | 28.03
4 0.0 0.0 0.0 100.0 | 3 3 | 24.15
5 13.1 0.0 0.0 86.9 | 0 3 *| 33.36
6 0.0 100.0 0.0 0.0 | 1 1 | 27.94
7 0.0 0.0 100.0 0.0 | 2 2 | 28.02
8 100.0 0.0 0.0 0.0 | 0 0 | 28.58
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486
> On Mon, 2003-01-20 at 13:18, Ingo Molnar wrote:
> >
> > the attached patch (against 2.5.59) is my current scheduler tree, it
> > includes two main areas of changes:
> >
> > - interactivity improvements, mostly reworked bits from Andrea's tree
and
> > various tunings.
> >
> > - HT scheduler: 'shared runqueue' concept plus related logic: HT-aware
> > passive load balancing, active-balancing, HT-aware task pickup,
> > HT-aware affinity and HT-aware wakeup.
>
> I ran Erich's numatest on a system with this patch, plus the
> cputime_stats patch (so that we would get meaningful numbers),
> and found a problem. It appears that on the lightly loaded system
> sched_best_cpu is now loading up one node before moving on to the
> next. Once the system is loaded (i.e., a process per cpu) things
> even out. Before applying the D7 patch, processes were distributed
> evenly across nodes, even in low load situations.
Michael, my experience has been that 2.5.59 loaded up the first node before
distributing out tasks (at least on kernbench). The first check in
sched_best_cpu would almost always place the new task on the same cpu, and
intra node balance on an idle cpu in the same node would almost always steal
it before a inter node balance could steal it. Also, sched_best_cpu does
not appear to be changed in D7. Actually, I expected D7 to have the
opposite effect you describe (although I have not tried it yet), since
load_balance will now steal a running task if called by an idle cpu.
I'll try to get some of these tests on x440 asap to compare.
-Andrew Theurer
> Michael, my experience has been that 2.5.59 loaded up the first node before
> distributing out tasks (at least on kernbench).
Could you throw a printk in map_cpu_to_node and check which cpus come out
on which nodes?
> I'll try to get some of these tests on x440 asap to compare.
So what are you running on now? the HT stuff?
M.
On Wednesday 22 January 2003 10:17, Martin J. Bligh wrote:
> > Michael, my experience has been that 2.5.59 loaded up the first node
> > before distributing out tasks (at least on kernbench).
>
> Could you throw a printk in map_cpu_to_node and check which cpus come out
> on which nodes?
>
> > I'll try to get some of these tests on x440 asap to compare.
>
> So what are you running on now? the HT stuff?
I am running nothing right now, mainly because I don't have access to that HT
system anymore. HT-numa may work better than it did with the new load_balance
idle policy, but I'm not sure it's worth persuing with Ingo's HT patch. Let
me know if you think it's worth investigating.
-Andrew Theurer
On Wed, 2003-01-22 at 08:41, Andrew Theurer wrote:
> > On Mon, 2003-01-20 at 13:18, Ingo Molnar wrote:
> > >
> > > the attached patch (against 2.5.59) is my current scheduler tree, it
> > > includes two main areas of changes:
> > >
> > > - interactivity improvements, mostly reworked bits from Andrea's tree
> and
> > > various tunings.
> > >
> > > - HT scheduler: 'shared runqueue' concept plus related logic: HT-aware
> > > passive load balancing, active-balancing, HT-aware task pickup,
> > > HT-aware affinity and HT-aware wakeup.
> >
> > I ran Erich's numatest on a system with this patch, plus the
> > cputime_stats patch (so that we would get meaningful numbers),
> > and found a problem. It appears that on the lightly loaded system
> > sched_best_cpu is now loading up one node before moving on to the
> > next. Once the system is loaded (i.e., a process per cpu) things
> > even out. Before applying the D7 patch, processes were distributed
> > evenly across nodes, even in low load situations.
>
> Michael, my experience has been that 2.5.59 loaded up the first node before
> distributing out tasks (at least on kernbench).
Well the data I posted doesn't support that conclusion - it showed at
most two processes on the first node before moving to the next node
for 2.5.59, but for the D7 patched system, the current node was fully
loaded before putting processes on other nodes. I've repeated this on
multiple runs and obtained similar results.
The first check in
> sched_best_cpu would almost always place the new task on the same cpu, and
> intra node balance on an idle cpu in the same node would almost always steal
> it before a inter node balance could steal it. Also, sched_best_cpu does
> not appear to be changed in D7.
That is true, and is the only thing I've had a chance to look at.
sched_best_cpu depends on data collected elsewhere, so my suspicion
is that it is working with bad data. I'll try to find time this week
to look further at it.
Actually, I expected D7 to have the
> opposite effect you describe (although I have not tried it yet), since
> load_balance will now steal a running task if called by an idle cpu.
>
> I'll try to get some of these tests on x440 asap to compare.
I'm interested in seeing these results. Any chance of getting time on
a 4-node x440?
>
> -Andrew Theurer
>
>
--
Michael Hohnbaum 503-578-5486
[email protected] T/L 775-5486
the attached patch (against 2.5.59 or BK-curr) has two HT-scheduler fixes
over the -D7 patch:
- HT-aware task pickup did not do the right thing in an important
special-case: when there are 2 tasks running in a 2-CPU runqueue and
one of the tasks reschedules but still stays runnable. In this case,
instead of running the previous task again, the scheduler incorrectly
picked the idle task, which in turn reduced performance.
- active load-balancing only worked in a limited way, due to an indexing
bug, now it properly active-balances on all CPUs.
changes in -D7 over 2.5.59:
- interactivity improvements, mostly reworked bits from Andrea's tree
and various tunings.
- HT scheduler: 'shared runqueue' concept plus related logic: HT-aware
passive load balancing, active-balancing, HT-aware task pickup,
HT-aware affinity and HT-aware wakeup.
the sched-2.5.59-E2 patch can also be downloaded from:
http://redhat.com/~mingo/O(1)-scheduler/
Ingo
--- linux/arch/i386/kernel/cpu/proc.c.orig
+++ linux/arch/i386/kernel/cpu/proc.c
@@ -1,4 +1,5 @@
#include <linux/smp.h>
+#include <linux/sched.h>
#include <linux/timex.h>
#include <linux/string.h>
#include <asm/semaphore.h>
@@ -101,6 +102,13 @@
fpu_exception ? "yes" : "no",
c->cpuid_level,
c->wp_works_ok ? "yes" : "no");
+#if CONFIG_SHARE_RUNQUEUE
+{
+ extern long __rq_idx[NR_CPUS];
+
+ seq_printf(m, "\nrunqueue\t: %d\n", __rq_idx[n]);
+}
+#endif
for ( i = 0 ; i < 32*NCAPINTS ; i++ )
if ( test_bit(i, c->x86_capability) &&
--- linux/arch/i386/kernel/smpboot.c.orig
+++ linux/arch/i386/kernel/smpboot.c
@@ -38,6 +38,7 @@
#include <linux/kernel.h>
#include <linux/mm.h>
+#include <linux/sched.h>
#include <linux/kernel_stat.h>
#include <linux/smp_lock.h>
#include <linux/irq.h>
@@ -945,6 +946,16 @@
int cpu_sibling_map[NR_CPUS] __cacheline_aligned;
+static int test_ht;
+
+static int __init ht_setup(char *str)
+{
+ test_ht = 1;
+ return 1;
+}
+
+__setup("test_ht", ht_setup);
+
static void __init smp_boot_cpus(unsigned int max_cpus)
{
int apicid, cpu, bit;
@@ -1087,16 +1098,31 @@
Dprintk("Boot done.\n");
/*
- * If Hyper-Threading is avaialble, construct cpu_sibling_map[], so
+ * Here we can be sure that there is an IO-APIC in the system. Let's
+ * go and set it up:
+ */
+ smpboot_setup_io_apic();
+
+ setup_boot_APIC_clock();
+
+ /*
+ * Synchronize the TSC with the AP
+ */
+ if (cpu_has_tsc && cpucount)
+ synchronize_tsc_bp();
+ /*
+ * If Hyper-Threading is available, construct cpu_sibling_map[], so
* that we can tell the sibling CPU efficiently.
*/
+printk("cpu_has_ht: %d, smp_num_siblings: %d, num_online_cpus(): %d.\n", cpu_has_ht, smp_num_siblings, num_online_cpus());
if (cpu_has_ht && smp_num_siblings > 1) {
for (cpu = 0; cpu < NR_CPUS; cpu++)
cpu_sibling_map[cpu] = NO_PROC_ID;
for (cpu = 0; cpu < NR_CPUS; cpu++) {
int i;
- if (!test_bit(cpu, &cpu_callout_map)) continue;
+ if (!test_bit(cpu, &cpu_callout_map))
+ continue;
for (i = 0; i < NR_CPUS; i++) {
if (i == cpu || !test_bit(i, &cpu_callout_map))
@@ -1112,17 +1138,41 @@
printk(KERN_WARNING "WARNING: No sibling found for CPU %d.\n", cpu);
}
}
- }
-
- smpboot_setup_io_apic();
-
- setup_boot_APIC_clock();
+#if CONFIG_SHARE_RUNQUEUE
+ /*
+ * At this point APs would have synchronised TSC and
+ * waiting for smp_commenced, with their APIC timer
+ * disabled. So BP can go ahead do some initialization
+ * for Hyper-Threading (if needed).
+ */
+ for (cpu = 0; cpu < NR_CPUS; cpu++) {
+ int i;
+ if (!test_bit(cpu, &cpu_callout_map))
+ continue;
+ for (i = 0; i < NR_CPUS; i++) {
+ if (i <= cpu)
+ continue;
+ if (!test_bit(i, &cpu_callout_map))
+ continue;
- /*
- * Synchronize the TSC with the AP
- */
- if (cpu_has_tsc && cpucount)
- synchronize_tsc_bp();
+ if (phys_proc_id[cpu] != phys_proc_id[i])
+ continue;
+ /*
+ * merge runqueues, resulting in one
+ * runqueue per package:
+ */
+ sched_map_runqueue(cpu, i);
+ break;
+ }
+ }
+#endif
+ }
+#if CONFIG_SHARE_RUNQUEUE
+ if (smp_num_siblings == 1 && test_ht) {
+ printk("Simulating a 2-sibling 1-phys-CPU HT setup!\n");
+ sched_map_runqueue(0, 1);
+ }
+#endif
}
/* These are wrappers to interface to the new boot process. Someone
--- linux/arch/i386/Kconfig.orig
+++ linux/arch/i386/Kconfig
@@ -408,6 +408,24 @@
If you don't know what to do here, say N.
+choice
+
+ prompt "Hyperthreading Support"
+ depends on SMP
+ default NR_SIBLINGS_0
+
+config NR_SIBLINGS_0
+ bool "off"
+
+config NR_SIBLINGS_2
+ bool "2 siblings"
+
+config NR_SIBLINGS_4
+ bool "4 siblings"
+
+endchoice
+
+
config PREEMPT
bool "Preemptible Kernel"
help
--- linux/fs/pipe.c.orig
+++ linux/fs/pipe.c
@@ -117,7 +117,7 @@
up(PIPE_SEM(*inode));
/* Signal writers asynchronously that there is more room. */
if (do_wakeup) {
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
}
if (ret > 0)
@@ -205,7 +205,7 @@
}
up(PIPE_SEM(*inode));
if (do_wakeup) {
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
}
if (ret > 0) {
@@ -275,7 +275,7 @@
free_page((unsigned long) info->base);
kfree(info);
} else {
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
}
--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -147,6 +147,24 @@
extern void sched_init(void);
extern void init_idle(task_t *idle, int cpu);
+/*
+ * Is there a way to do this via Kconfig?
+ */
+#define CONFIG_NR_SIBLINGS 0
+
+#if CONFIG_NR_SIBLINGS_2
+# define CONFIG_NR_SIBLINGS 2
+#elif CONFIG_NR_SIBLINGS_4
+# define CONFIG_NR_SIBLINGS 4
+#endif
+
+#if CONFIG_NR_SIBLINGS
+# define CONFIG_SHARE_RUNQUEUE 1
+#else
+# define CONFIG_SHARE_RUNQUEUE 0
+#endif
+extern void sched_map_runqueue(int cpu1, int cpu2);
+
extern void show_state(void);
extern void show_trace(unsigned long *stack);
extern void show_stack(unsigned long *stack);
@@ -293,7 +311,7 @@
prio_array_t *array;
unsigned long sleep_avg;
- unsigned long sleep_timestamp;
+ unsigned long last_run;
unsigned long policy;
unsigned long cpus_allowed;
@@ -605,6 +623,8 @@
#define remove_parent(p) list_del_init(&(p)->sibling)
#define add_parent(p, parent) list_add_tail(&(p)->sibling,&(parent)->children)
+#if 1
+
#define REMOVE_LINKS(p) do { \
if (thread_group_leader(p)) \
list_del_init(&(p)->tasks); \
@@ -633,6 +653,31 @@
#define while_each_thread(g, t) \
while ((t = next_thread(t)) != g)
+#else
+
+#define REMOVE_LINKS(p) do { \
+ list_del_init(&(p)->tasks); \
+ remove_parent(p); \
+ } while (0)
+
+#define SET_LINKS(p) do { \
+ list_add_tail(&(p)->tasks,&init_task.tasks); \
+ add_parent(p, (p)->parent); \
+ } while (0)
+
+#define next_task(p) list_entry((p)->tasks.next, struct task_struct, tasks)
+#define prev_task(p) list_entry((p)->tasks.prev, struct task_struct, tasks)
+
+#define for_each_process(p) \
+ for (p = &init_task ; (p = next_task(p)) != &init_task ; )
+
+#define do_each_thread(g, t) \
+ for (t = &init_task ; (t = next_task(t)) != &init_task ; )
+
+#define while_each_thread(g, t)
+
+#endif
+
extern task_t * FASTCALL(next_thread(task_t *p));
#define thread_group_leader(p) (p->pid == p->tgid)
--- linux/include/asm-i386/apic.h.orig
+++ linux/include/asm-i386/apic.h
@@ -98,4 +98,6 @@
#endif /* CONFIG_X86_LOCAL_APIC */
+extern int phys_proc_id[NR_CPUS];
+
#endif /* __ASM_APIC_H */
--- linux/kernel/fork.c.orig
+++ linux/kernel/fork.c
@@ -876,7 +876,7 @@
*/
p->first_time_slice = 1;
current->time_slice >>= 1;
- p->sleep_timestamp = jiffies;
+ p->last_run = jiffies;
if (!current->time_slice) {
/*
* This case is rare, it happens when the parent has only
--- linux/kernel/sys.c.orig
+++ linux/kernel/sys.c
@@ -220,7 +220,7 @@
if (error == -ESRCH)
error = 0;
- if (niceval < task_nice(p) && !capable(CAP_SYS_NICE))
+ if (0 && niceval < task_nice(p) && !capable(CAP_SYS_NICE))
error = -EACCES;
else
set_user_nice(p, niceval);
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -60,13 +60,15 @@
*/
#define MIN_TIMESLICE ( 10 * HZ / 1000)
#define MAX_TIMESLICE (300 * HZ / 1000)
-#define CHILD_PENALTY 95
+#define CHILD_PENALTY 50
#define PARENT_PENALTY 100
#define EXIT_WEIGHT 3
#define PRIO_BONUS_RATIO 25
#define INTERACTIVE_DELTA 2
-#define MAX_SLEEP_AVG (2*HZ)
-#define STARVATION_LIMIT (2*HZ)
+#define MAX_SLEEP_AVG (10*HZ)
+#define STARVATION_LIMIT (30*HZ)
+#define SYNC_WAKEUPS 1
+#define SMART_WAKE_CHILD 1
#define NODE_THRESHOLD 125
/*
@@ -141,6 +143,48 @@
};
/*
+ * It's possible for two CPUs to share the same runqueue.
+ * This makes sense if they eg. share caches.
+ *
+ * We take the common 1:1 (SMP, UP) case and optimize it,
+ * the rest goes via remapping: rq_idx(cpu) gives the
+ * runqueue on which a particular cpu is on, cpu_idx(cpu)
+ * gives the rq-specific index of the cpu.
+ *
+ * (Note that the generic scheduler code does not impose any
+ * restrictions on the mappings - there can be 4 CPUs per
+ * runqueue or even assymetric mappings.)
+ */
+#if CONFIG_SHARE_RUNQUEUE
+# define MAX_NR_SIBLINGS CONFIG_NR_SIBLINGS
+ long __rq_idx[NR_CPUS] __cacheline_aligned;
+ static long __cpu_idx[NR_CPUS] __cacheline_aligned;
+# define rq_idx(cpu) (__rq_idx[(cpu)])
+# define cpu_idx(cpu) (__cpu_idx[(cpu)])
+# define for_each_sibling(idx, rq) \
+ for ((idx) = 0; (idx) < (rq)->nr_cpus; (idx)++)
+# define rq_nr_cpus(rq) ((rq)->nr_cpus)
+# define cpu_active_balance(c) (cpu_rq(c)->cpu[0].active_balance)
+#else
+# define MAX_NR_SIBLINGS 1
+# define rq_idx(cpu) (cpu)
+# define cpu_idx(cpu) 0
+# define for_each_sibling(idx, rq) while (0)
+# define cpu_active_balance(c) 0
+# define do_active_balance(rq, cpu) do { } while (0)
+# define rq_nr_cpus(rq) 1
+ static inline void active_load_balance(runqueue_t *rq, int this_cpu) { }
+#endif
+
+typedef struct cpu_s {
+ task_t *curr, *idle;
+ task_t *migration_thread;
+ struct list_head migration_queue;
+ int active_balance;
+ int cpu;
+} cpu_t;
+
+/*
* This is the main, per-CPU runqueue data structure.
*
* Locking rule: those places that want to lock multiple runqueues
@@ -151,7 +195,7 @@
spinlock_t lock;
unsigned long nr_running, nr_switches, expired_timestamp,
nr_uninterruptible;
- task_t *curr, *idle;
+ struct mm_struct *prev_mm;
prio_array_t *active, *expired, arrays[2];
int prev_nr_running[NR_CPUS];
#ifdef CONFIG_NUMA
@@ -159,27 +203,39 @@
unsigned int nr_balanced;
int prev_node_load[MAX_NUMNODES];
#endif
- task_t *migration_thread;
- struct list_head migration_queue;
+ int nr_cpus;
+ cpu_t cpu[MAX_NR_SIBLINGS];
atomic_t nr_iowait;
} ____cacheline_aligned;
static struct runqueue runqueues[NR_CPUS] __cacheline_aligned;
-#define cpu_rq(cpu) (runqueues + (cpu))
+#define cpu_rq(cpu) (runqueues + (rq_idx(cpu)))
+#define cpu_int(c) ((cpu_rq(c))->cpu + cpu_idx(c))
+#define cpu_curr_ptr(cpu) (cpu_int(cpu)->curr)
+#define cpu_idle_ptr(cpu) (cpu_int(cpu)->idle)
+
#define this_rq() cpu_rq(smp_processor_id())
#define task_rq(p) cpu_rq(task_cpu(p))
-#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define rt_task(p) ((p)->prio < MAX_RT_PRIO)
+#define migration_thread(cpu) (cpu_int(cpu)->migration_thread)
+#define migration_queue(cpu) (&cpu_int(cpu)->migration_queue)
+
+#if NR_CPUS > 1
+# define task_allowed(p, cpu) ((p)->cpus_allowed & (1UL << (cpu)))
+#else
+# define task_allowed(p, cpu) 1
+#endif
+
/*
* Default context-switch locking:
*/
#ifndef prepare_arch_switch
# define prepare_arch_switch(rq, next) do { } while(0)
# define finish_arch_switch(rq, next) spin_unlock_irq(&(rq)->lock)
-# define task_running(rq, p) ((rq)->curr == (p))
+# define task_running(p) (cpu_curr_ptr(task_cpu(p)) == (p))
#endif
#ifdef CONFIG_NUMA
@@ -322,16 +378,21 @@
* Also update all the scheduling statistics stuff. (sleep average
* calculation, priority modifiers, etc.)
*/
+static inline void __activate_task(task_t *p, runqueue_t *rq)
+{
+ enqueue_task(p, rq->active);
+ nr_running_inc(rq);
+}
+
static inline void activate_task(task_t *p, runqueue_t *rq)
{
- unsigned long sleep_time = jiffies - p->sleep_timestamp;
- prio_array_t *array = rq->active;
+ unsigned long sleep_time = jiffies - p->last_run;
if (!rt_task(p) && sleep_time) {
/*
* This code gives a bonus to interactive tasks. We update
* an 'average sleep time' value here, based on
- * sleep_timestamp. The more time a task spends sleeping,
+ * ->last_run. The more time a task spends sleeping,
* the higher the average gets - and the higher the priority
* boost gets as well.
*/
@@ -340,8 +401,7 @@
p->sleep_avg = MAX_SLEEP_AVG;
p->prio = effective_prio(p);
}
- enqueue_task(p, array);
- nr_running_inc(rq);
+ __activate_task(p, rq);
}
/*
@@ -382,6 +442,11 @@
#endif
}
+static inline void resched_cpu(int cpu)
+{
+ resched_task(cpu_curr_ptr(cpu));
+}
+
#ifdef CONFIG_SMP
/*
@@ -398,7 +463,7 @@
repeat:
preempt_disable();
rq = task_rq(p);
- if (unlikely(task_running(rq, p))) {
+ if (unlikely(task_running(p))) {
cpu_relax();
/*
* enable/disable preemption just to make this
@@ -409,7 +474,7 @@
goto repeat;
}
rq = task_rq_lock(p, &flags);
- if (unlikely(task_running(rq, p))) {
+ if (unlikely(task_running(p))) {
task_rq_unlock(rq, &flags);
preempt_enable();
goto repeat;
@@ -431,10 +496,39 @@
*/
void kick_if_running(task_t * p)
{
- if ((task_running(task_rq(p), p)) && (task_cpu(p) != smp_processor_id()))
+ if ((task_running(p)) && (task_cpu(p) != smp_processor_id()))
resched_task(p);
}
+static void wake_up_cpu(runqueue_t *rq, int cpu, task_t *p)
+{
+ cpu_t *curr_cpu;
+ task_t *curr;
+ int idx;
+
+ if (idle_cpu(cpu))
+ return resched_cpu(cpu);
+
+ for_each_sibling(idx, rq) {
+ curr_cpu = rq->cpu + idx;
+ if (!task_allowed(p, curr_cpu->cpu))
+ continue;
+ if (curr_cpu->idle == curr_cpu->curr)
+ return resched_cpu(curr_cpu->cpu);
+ }
+
+ if (p->prio < cpu_curr_ptr(cpu)->prio)
+ return resched_task(cpu_curr_ptr(cpu));
+
+ for_each_sibling(idx, rq) {
+ curr_cpu = rq->cpu + idx;
+ if (!task_allowed(p, curr_cpu->cpu))
+ continue;
+ curr = curr_cpu->curr;
+ if (p->prio < curr->prio)
+ return resched_task(curr);
+ }
+}
/***
* try_to_wake_up - wake up a thread
* @p: the to-be-woken-up thread
@@ -455,6 +549,7 @@
long old_state;
runqueue_t *rq;
+ sync &= SYNC_WAKEUPS;
repeat_lock_task:
rq = task_rq_lock(p, &flags);
old_state = p->state;
@@ -463,7 +558,7 @@
* Fast-migrate the task if it's not running or runnable
* currently. Do not violate hard affinity.
*/
- if (unlikely(sync && !task_running(rq, p) &&
+ if (unlikely(sync && !task_running(p) &&
(task_cpu(p) != smp_processor_id()) &&
(p->cpus_allowed & (1UL << smp_processor_id())))) {
@@ -473,10 +568,12 @@
}
if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
- activate_task(p, rq);
-
- if (p->prio < rq->curr->prio)
- resched_task(rq->curr);
+ if (sync)
+ __activate_task(p, rq);
+ else {
+ activate_task(p, rq);
+ wake_up_cpu(rq, task_cpu(p), p);
+ }
success = 1;
}
p->state = TASK_RUNNING;
@@ -512,8 +609,19 @@
p->prio = effective_prio(p);
}
set_task_cpu(p, smp_processor_id());
- activate_task(p, rq);
+ if (SMART_WAKE_CHILD) {
+ if (unlikely(!current->array))
+ __activate_task(p, rq);
+ else {
+ p->prio = current->prio;
+ list_add_tail(&p->run_list, ¤t->run_list);
+ p->array = current->array;
+ p->array->nr_active++;
+ nr_running_inc(rq);
+ }
+ } else
+ activate_task(p, rq);
rq_unlock(rq);
}
@@ -561,7 +669,7 @@
* context_switch - switch to the new MM and the new
* thread's register state.
*/
-static inline task_t * context_switch(task_t *prev, task_t *next)
+static inline task_t * context_switch(runqueue_t *rq, task_t *prev, task_t *next)
{
struct mm_struct *mm = next->mm;
struct mm_struct *oldmm = prev->active_mm;
@@ -575,7 +683,7 @@
if (unlikely(!prev->mm)) {
prev->active_mm = NULL;
- mmdrop(oldmm);
+ rq->prev_mm = oldmm;
}
/* Here we just switch the register state and the stack. */
@@ -596,8 +704,9 @@
unsigned long i, sum = 0;
for (i = 0; i < NR_CPUS; i++)
- sum += cpu_rq(i)->nr_running;
-
+ /* Shared runqueues are counted only once. */
+ if (!cpu_idx(i))
+ sum += cpu_rq(i)->nr_running;
return sum;
}
@@ -608,7 +717,9 @@
for (i = 0; i < NR_CPUS; i++) {
if (!cpu_online(i))
continue;
- sum += cpu_rq(i)->nr_uninterruptible;
+ /* Shared runqueues are counted only once. */
+ if (!cpu_idx(i))
+ sum += cpu_rq(i)->nr_uninterruptible;
}
return sum;
}
@@ -790,7 +901,23 @@
#endif /* CONFIG_NUMA */
-#if CONFIG_SMP
+/*
+ * One of the idle_cpu_tick() and busy_cpu_tick() functions will
+ * get called every timer tick, on every CPU. Our balancing action
+ * frequency and balancing agressivity depends on whether the CPU is
+ * idle or not.
+ *
+ * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
+ * systems with HZ=100, every 10 msecs.)
+ */
+#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+
+#if !CONFIG_SMP
+
+static inline void load_balance(runqueue_t *rq, int this_cpu, int idle) { }
+
+#else
/*
* double_lock_balance - lock the busiest runqueue
@@ -906,12 +1033,7 @@
set_task_cpu(p, this_cpu);
nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
- /*
- * Note that idle threads have a prio of MAX_PRIO, for this test
- * to be always true for them.
- */
- if (p->prio < this_rq->curr->prio)
- set_need_resched();
+ wake_up_cpu(this_rq, this_cpu, p);
}
/*
@@ -922,9 +1044,9 @@
* We call this with the current runqueue locked,
* irqs disabled.
*/
-static void load_balance(runqueue_t *this_rq, int idle)
+static void load_balance(runqueue_t *this_rq, int this_cpu, int idle)
{
- int imbalance, idx, this_cpu = smp_processor_id();
+ int imbalance, idx;
runqueue_t *busiest;
prio_array_t *array;
struct list_head *head, *curr;
@@ -972,12 +1094,14 @@
* 1) running (obviously), or
* 2) cannot be migrated to this CPU due to cpus_allowed, or
* 3) are cache-hot on their current CPU.
+ *
+ * (except if we are in idle mode which is a more agressive
+ * form of rebalancing.)
*/
-#define CAN_MIGRATE_TASK(p,rq,this_cpu) \
- ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \
- !task_running(rq, p) && \
- ((p)->cpus_allowed & (1UL << (this_cpu))))
+#define CAN_MIGRATE_TASK(p,rq,cpu) \
+ ((idle || (jiffies - (p)->last_run > cache_decay_ticks)) && \
+ !task_running(p) && task_allowed(p, cpu))
curr = curr->prev;
@@ -1000,31 +1124,136 @@
;
}
+#if CONFIG_SHARE_RUNQUEUE
+static void active_load_balance(runqueue_t *this_rq, int this_cpu)
+{
+ runqueue_t *rq;
+ int i, idx;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_online(i))
+ continue;
+ rq = cpu_rq(i);
+ if (rq == this_rq)
+ continue;
+ /*
+ * Any SMT-specific imbalance?
+ */
+ for_each_sibling(idx, rq)
+ if (rq->cpu[idx].idle == rq->cpu[idx].curr)
+ goto next_cpu;
+
+ /*
+ * At this point it's sure that we have a SMT
+ * imbalance: this (physical) CPU is idle but
+ * another CPU has two (or more) tasks running.
+ *
+ * We wake up one of the migration threads (it
+ * doesnt matter which one) and let it fix things up:
+ */
+ if (!cpu_active_balance(i)) {
+ cpu_active_balance(i) = 1;
+ spin_unlock(&this_rq->lock);
+ wake_up_process(rq->cpu[0].migration_thread);
+ spin_lock(&this_rq->lock);
+ }
+next_cpu:
+ }
+}
+
+static void do_active_balance(runqueue_t *this_rq, int this_cpu)
+{
+ runqueue_t *rq;
+ int i, idx;
+
+ spin_unlock(&this_rq->lock);
+
+ cpu_active_balance(this_cpu) = 0;
+
+ /*
+ * Is the imbalance still present?
+ */
+ for_each_sibling(idx, this_rq)
+ if (this_rq->cpu[idx].idle == this_rq->cpu[idx].curr)
+ goto out;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ if (!cpu_online(i))
+ continue;
+ rq = cpu_rq(i);
+ if (rq == this_rq)
+ continue;
+
+ /* completely idle CPU? */
+ if (rq->nr_running)
+ continue;
+
+ /*
+ * At this point it's reasonably sure that we have an
+ * imbalance. Since we are the migration thread, try to
+ * balance a thread over to the target queue.
+ */
+ spin_lock(&rq->lock);
+ load_balance(rq, i, 1);
+ spin_unlock(&rq->lock);
+ goto out;
+ }
+out:
+ spin_lock(&this_rq->lock);
+}
+
/*
- * One of the idle_cpu_tick() and busy_cpu_tick() functions will
- * get called every timer tick, on every CPU. Our balancing action
- * frequency and balancing agressivity depends on whether the CPU is
- * idle or not.
+ * This routine is called to map a CPU into another CPU's runqueue.
*
- * busy-rebalance every 250 msecs. idle-rebalance every 1 msec. (or on
- * systems with HZ=100, every 10 msecs.)
+ * This must be called during bootup with the merged runqueue having
+ * no tasks.
*/
-#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
-#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+void sched_map_runqueue(int cpu1, int cpu2)
+{
+ runqueue_t *rq1 = cpu_rq(cpu1);
+ runqueue_t *rq2 = cpu_rq(cpu2);
+ int cpu2_idx_orig = cpu_idx(cpu2), cpu2_idx;
+
+ printk("sched_merge_runqueues: CPU#%d <=> CPU#%d, on CPU#%d.\n", cpu1, cpu2, smp_processor_id());
+ if (rq1 == rq2)
+ BUG();
+ if (rq2->nr_running)
+ BUG();
+ /*
+ * At this point, we dont have anything in the runqueue yet. So,
+ * there is no need to move processes between the runqueues.
+ * Only, the idle processes should be combined and accessed
+ * properly.
+ */
+ cpu2_idx = rq1->nr_cpus++;
-static inline void idle_tick(runqueue_t *rq)
+ if (rq_idx(cpu1) != cpu1)
+ BUG();
+ rq_idx(cpu2) = cpu1;
+ cpu_idx(cpu2) = cpu2_idx;
+ rq1->cpu[cpu2_idx].cpu = cpu2;
+ rq1->cpu[cpu2_idx].idle = rq2->cpu[cpu2_idx_orig].idle;
+ rq1->cpu[cpu2_idx].curr = rq2->cpu[cpu2_idx_orig].curr;
+ INIT_LIST_HEAD(&rq1->cpu[cpu2_idx].migration_queue);
+
+ /* just to be safe: */
+ rq2->cpu[cpu2_idx_orig].idle = NULL;
+ rq2->cpu[cpu2_idx_orig].curr = NULL;
+}
+#endif
+#endif
+
+DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
+
+static inline void idle_tick(runqueue_t *rq, unsigned long j)
{
- if (jiffies % IDLE_REBALANCE_TICK)
+ if (j % IDLE_REBALANCE_TICK)
return;
spin_lock(&rq->lock);
- load_balance(rq, 1);
+ load_balance(rq, smp_processor_id(), 1);
spin_unlock(&rq->lock);
}
-#endif
-
-DEFINE_PER_CPU(struct kernel_stat, kstat) = { { 0 } };
-
/*
* We place interactive tasks back into the active array, if possible.
*
@@ -1035,9 +1264,9 @@
* increasing number of running tasks:
*/
#define EXPIRED_STARVING(rq) \
- ((rq)->expired_timestamp && \
+ (STARVATION_LIMIT && ((rq)->expired_timestamp && \
(jiffies - (rq)->expired_timestamp >= \
- STARVATION_LIMIT * ((rq)->nr_running) + 1))
+ STARVATION_LIMIT * ((rq)->nr_running) + 1)))
/*
* This function gets called by the timer code, with HZ frequency.
@@ -1050,12 +1279,13 @@
{
int cpu = smp_processor_id();
runqueue_t *rq = this_rq();
+ unsigned long j = jiffies;
task_t *p = current;
if (rcu_pending(cpu))
rcu_check_callbacks(cpu, user_ticks);
- if (p == rq->idle) {
+ if (p == cpu_idle_ptr(cpu)) {
/* note: this timer irq context must be accounted for as well */
if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
kstat_cpu(cpu).cpustat.system += sys_ticks;
@@ -1063,9 +1293,7 @@
kstat_cpu(cpu).cpustat.iowait += sys_ticks;
else
kstat_cpu(cpu).cpustat.idle += sys_ticks;
-#if CONFIG_SMP
- idle_tick(rq);
-#endif
+ idle_tick(rq, j);
return;
}
if (TASK_NICE(p) > 0)
@@ -1074,12 +1302,13 @@
kstat_cpu(cpu).cpustat.user += user_ticks;
kstat_cpu(cpu).cpustat.system += sys_ticks;
+ spin_lock(&rq->lock);
/* Task might have expired already, but not scheduled off yet */
if (p->array != rq->active) {
set_tsk_need_resched(p);
+ spin_unlock(&rq->lock);
return;
}
- spin_lock(&rq->lock);
if (unlikely(rt_task(p))) {
/*
* RR tasks need a special form of timeslice management.
@@ -1115,16 +1344,14 @@
if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
if (!rq->expired_timestamp)
- rq->expired_timestamp = jiffies;
+ rq->expired_timestamp = j;
enqueue_task(p, rq->expired);
} else
enqueue_task(p, rq->active);
}
out:
-#if CONFIG_SMP
- if (!(jiffies % BUSY_REBALANCE_TICK))
- load_balance(rq, 0);
-#endif
+ if (!(j % BUSY_REBALANCE_TICK))
+ load_balance(rq, smp_processor_id(), 0);
spin_unlock(&rq->lock);
}
@@ -1135,11 +1362,11 @@
*/
asmlinkage void schedule(void)
{
+ int idx, this_cpu, retry = 0;
+ struct list_head *queue;
task_t *prev, *next;
- runqueue_t *rq;
prio_array_t *array;
- struct list_head *queue;
- int idx;
+ runqueue_t *rq;
/*
* Test if we are atomic. Since do_exit() needs to call into
@@ -1152,15 +1379,15 @@
dump_stack();
}
}
-
- check_highmem_ptes();
need_resched:
+ check_highmem_ptes();
+ this_cpu = smp_processor_id();
preempt_disable();
prev = current;
rq = this_rq();
release_kernel_lock(prev);
- prev->sleep_timestamp = jiffies;
+ prev->last_run = jiffies;
spin_lock_irq(&rq->lock);
/*
@@ -1183,12 +1410,14 @@
}
pick_next_task:
if (unlikely(!rq->nr_running)) {
-#if CONFIG_SMP
- load_balance(rq, 1);
+ load_balance(rq, this_cpu, 1);
if (rq->nr_running)
goto pick_next_task;
-#endif
- next = rq->idle;
+ active_load_balance(rq, this_cpu);
+ if (rq->nr_running)
+ goto pick_next_task;
+pick_idle:
+ next = cpu_idle_ptr(this_cpu);
rq->expired_timestamp = 0;
goto switch_tasks;
}
@@ -1204,24 +1433,60 @@
rq->expired_timestamp = 0;
}
+new_array:
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);
+ if ((next != prev) && (rq_nr_cpus(rq) > 1)) {
+ struct list_head *tmp = queue->next;
+
+ while ((task_running(next) && (next != prev)) || !task_allowed(next, this_cpu)) {
+ tmp = tmp->next;
+ if (tmp != queue) {
+ next = list_entry(tmp, task_t, run_list);
+ continue;
+ }
+ idx = find_next_bit(array->bitmap, MAX_PRIO, ++idx);
+ if (idx == MAX_PRIO) {
+ if (retry || !rq->expired->nr_active) {
+ goto pick_idle;
+ }
+ /*
+ * To avoid infinite changing of arrays,
+ * when we have only tasks runnable by
+ * sibling.
+ */
+ retry = 1;
+
+ array = rq->expired;
+ goto new_array;
+ }
+ queue = array->queue + idx;
+ tmp = queue->next;
+ next = list_entry(tmp, task_t, run_list);
+ }
+ }
switch_tasks:
prefetch(next);
clear_tsk_need_resched(prev);
- RCU_qsctr(prev->thread_info->cpu)++;
+ RCU_qsctr(task_cpu(prev))++;
if (likely(prev != next)) {
+ struct mm_struct *prev_mm;
rq->nr_switches++;
- rq->curr = next;
+ cpu_curr_ptr(this_cpu) = next;
+ set_task_cpu(next, this_cpu);
prepare_arch_switch(rq, next);
- prev = context_switch(prev, next);
+ prev = context_switch(rq, prev, next);
barrier();
rq = this_rq();
+ prev_mm = rq->prev_mm;
+ rq->prev_mm = NULL;
finish_arch_switch(rq, prev);
+ if (prev_mm)
+ mmdrop(prev_mm);
} else
spin_unlock_irq(&rq->lock);
@@ -1481,9 +1746,8 @@
* If the task is running and lowered its priority,
* or increased its priority then reschedule its CPU:
*/
- if ((NICE_TO_PRIO(nice) < p->static_prio) ||
- task_running(rq, p))
- resched_task(rq->curr);
+ if ((NICE_TO_PRIO(nice) < p->static_prio) || task_running(p))
+ resched_task(cpu_curr_ptr(task_cpu(p)));
}
out_unlock:
task_rq_unlock(rq, &flags);
@@ -1561,7 +1825,7 @@
*/
int task_curr(task_t *p)
{
- return cpu_curr(task_cpu(p)) == p;
+ return cpu_curr_ptr(task_cpu(p)) == p;
}
/**
@@ -1570,7 +1834,7 @@
*/
int idle_cpu(int cpu)
{
- return cpu_curr(cpu) == cpu_rq(cpu)->idle;
+ return cpu_curr_ptr(cpu) == cpu_idle_ptr(cpu);
}
/**
@@ -1660,7 +1924,7 @@
else
p->prio = p->static_prio;
if (array)
- activate_task(p, task_rq(p));
+ __activate_task(p, task_rq(p));
out_unlock:
task_rq_unlock(rq, &flags);
@@ -2135,7 +2399,7 @@
local_irq_save(flags);
double_rq_lock(idle_rq, rq);
- idle_rq->curr = idle_rq->idle = idle;
+ cpu_curr_ptr(cpu) = cpu_idle_ptr(cpu) = idle;
deactivate_task(idle, rq);
idle->array = NULL;
idle->prio = MAX_PRIO;
@@ -2190,6 +2454,7 @@
unsigned long flags;
migration_req_t req;
runqueue_t *rq;
+ int cpu;
#if 0 /* FIXME: Grab cpu_lock, return error on this case. --RR */
new_mask &= cpu_online_map;
@@ -2211,31 +2476,31 @@
* If the task is not on a runqueue (and not running), then
* it is sufficient to simply update the task's cpu field.
*/
- if (!p->array && !task_running(rq, p)) {
+ if (!p->array && !task_running(p)) {
set_task_cpu(p, __ffs(p->cpus_allowed));
task_rq_unlock(rq, &flags);
return;
}
init_completion(&req.done);
req.task = p;
- list_add(&req.list, &rq->migration_queue);
+ cpu = task_cpu(p);
+ list_add(&req.list, migration_queue(cpu));
task_rq_unlock(rq, &flags);
-
- wake_up_process(rq->migration_thread);
+ wake_up_process(migration_thread(cpu));
wait_for_completion(&req.done);
}
/*
- * migration_thread - this is a highprio system thread that performs
+ * migration_task - this is a highprio system thread that performs
* thread migration by 'pulling' threads into the target runqueue.
*/
-static int migration_thread(void * data)
+static int migration_task(void * data)
{
struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
int cpu = (long) data;
runqueue_t *rq;
- int ret;
+ int ret, idx;
daemonize();
sigfillset(¤t->blocked);
@@ -2250,7 +2515,8 @@
ret = setscheduler(0, SCHED_FIFO, ¶m);
rq = this_rq();
- rq->migration_thread = current;
+ migration_thread(cpu) = current;
+ idx = cpu_idx(cpu);
sprintf(current->comm, "migration/%d", smp_processor_id());
@@ -2263,7 +2529,9 @@
task_t *p;
spin_lock_irqsave(&rq->lock, flags);
- head = &rq->migration_queue;
+ if (cpu_active_balance(cpu))
+ do_active_balance(rq, cpu);
+ head = migration_queue(cpu);
current->state = TASK_INTERRUPTIBLE;
if (list_empty(head)) {
spin_unlock_irqrestore(&rq->lock, flags);
@@ -2292,9 +2560,8 @@
set_task_cpu(p, cpu_dest);
if (p->array) {
deactivate_task(p, rq_src);
- activate_task(p, rq_dest);
- if (p->prio < rq_dest->curr->prio)
- resched_task(rq_dest->curr);
+ __activate_task(p, rq_dest);
+ wake_up_cpu(rq_dest, cpu_dest, p);
}
}
double_rq_unlock(rq_src, rq_dest);
@@ -2312,12 +2579,13 @@
unsigned long action,
void *hcpu)
{
+ long cpu = (long) hcpu;
+
switch (action) {
case CPU_ONLINE:
- printk("Starting migration thread for cpu %li\n",
- (long)hcpu);
- kernel_thread(migration_thread, hcpu, CLONE_KERNEL);
- while (!cpu_rq((long)hcpu)->migration_thread)
+ printk("Starting migration thread for cpu %li\n", cpu);
+ kernel_thread(migration_task, hcpu, CLONE_KERNEL);
+ while (!migration_thread(cpu))
yield();
break;
}
@@ -2392,11 +2660,20 @@
for (i = 0; i < NR_CPUS; i++) {
prio_array_t *array;
+ /*
+ * Start with a 1:1 mapping between CPUs and runqueues:
+ */
+#if CONFIG_SHARE_RUNQUEUE
+ rq_idx(i) = i;
+ cpu_idx(i) = 0;
+#endif
rq = cpu_rq(i);
rq->active = rq->arrays;
rq->expired = rq->arrays + 1;
spin_lock_init(&rq->lock);
- INIT_LIST_HEAD(&rq->migration_queue);
+ INIT_LIST_HEAD(migration_queue(i));
+ rq->nr_cpus = 1;
+ rq->cpu[cpu_idx(i)].cpu = i;
atomic_set(&rq->nr_iowait, 0);
nr_running_init(rq);
@@ -2414,9 +2691,13 @@
* We have to do a little magic to get the first
* thread right in SMP mode.
*/
- rq = this_rq();
- rq->curr = current;
- rq->idle = current;
+ cpu_curr_ptr(smp_processor_id()) = current;
+ cpu_idle_ptr(smp_processor_id()) = current;
+ printk("sched_init().\n");
+ printk("smp_processor_id(): %d.\n", smp_processor_id());
+ printk("rq_idx(smp_processor_id()): %ld.\n", rq_idx(smp_processor_id()));
+ printk("this_rq(): %p.\n", this_rq());
+
set_task_cpu(current, smp_processor_id());
wake_up_process(current);
--- linux/init/main.c.orig
+++ linux/init/main.c
@@ -354,7 +354,14 @@
static void rest_init(void)
{
+ /*
+ * We count on the initial thread going ok
+ * Like idlers init is an unlocked kernel thread, which will
+ * make syscalls (and thus be locked).
+ */
+ init_idle(current, smp_processor_id());
kernel_thread(init, NULL, CLONE_KERNEL);
+
unlock_kernel();
cpu_idle();
}
@@ -438,13 +445,6 @@
check_bugs();
printk("POSIX conformance testing by UNIFIX\n");
- /*
- * We count on the initial thread going ok
- * Like idlers init is an unlocked kernel thread, which will
- * make syscalls (and thus be locked).
- */
- init_idle(current, smp_processor_id());
-
/* Do the rest non-__init'ed, we're now alive */
rest_init();
}
On Mon, 2003-02-03 at 13:23, Ingo Molnar wrote:
> the attached patch (against 2.5.59 or BK-curr) has two HT-scheduler fixes
> over the -D7 patch:
Hi, Ingo. I am running this now, with good results. Unfortunately I do
not have an HT-capable processor, so I am only testing the interactivity
improvements. They are looking very good - a step in the right
direction. Very nice.
A couple bits:
> - wake_up_interruptible(PIPE_WAIT(*inode));
> + wake_up_interruptible_sync(PIPE_WAIT(*inode));
> ...
> - wake_up_interruptible(PIPE_WAIT(*inode));
> + wake_up_interruptible_sync(PIPE_WAIT(*inode));
> ...
> - wake_up_interruptible(PIPE_WAIT(*inode));
> + wake_up_interruptible_sync(PIPE_WAIT(*inode));
These are not correct, right? I believe we want normal behavior here,
no?
> --- linux/kernel/sys.c.orig
> +++ linux/kernel/sys.c
> @@ -220,7 +220,7 @@
>
> if (error == -ESRCH)
> error = 0;
> - if (niceval < task_nice(p) && !capable(CAP_SYS_NICE))
> + if (0 && niceval < task_nice(p) && !capable(CAP_SYS_NICE))
What is the point of this? Left in for debugging?
> -#define MAX_SLEEP_AVG (2*HZ)
> -#define STARVATION_LIMIT (2*HZ)
> +#define MAX_SLEEP_AVG (10*HZ)
> +#define STARVATION_LIMIT (30*HZ)
Do you really want a 30 second starvation limit? Ouch.
> + printk("rq_idx(smp_processor_id()): %ld.\n", rq_idx(smp_processor_id()));
This gives a compiler warning on UP:
kernel/sched.c: In function `sched_init':
kernel/sched.c:2722: warning: long int format, int arg (arg 2)
Since rq_idx(), on UP, merely returns its parameter which is an int.
Otherwise, looking nice :)
Robert Love
Hi Ingo,
On Monday 03 February 2003 19:23, Ingo Molnar wrote:
> -#define
> CAN_MIGRATE_TASK(p,rq,this_cpu) \
> - ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \
> - !task_running(rq, p) && \
> - ((p)->cpus_allowed & (1UL << (this_cpu)))) +#define
> CAN_MIGRATE_TASK(p,rq,cpu) \
> + ((idle || (jiffies - (p)->last_run > cache_decay_ticks)) && \
> + !task_running(p) && task_allowed(p, cpu))
at least for NUMA systems this is too aggressive (though I believe
normal SMP systems could be hurt, too).
The problem: freshly forked tasks are stolen by idle cpus on the same
node before they exec. This actually disables the sched_balance_exec()
mechanism as the tasks to be balanced already run alone on other
CPUs. Which means: the whole benefit of having balanced nodes
(maximize the memory bandwidth) is gone.
The change below is less aggressive but in the same philosophy. Could
you please take it instead?
> CAN_MIGRATE_TASK(p,rq,cpu) \
> + ((jiffies - (p)->last_run > (cache_decay_ticks >> idle)) && \
> + !task_running(p) && task_allowed(p, cpu))
Regards,
Erich