As recently as 2.6.15-rc5-mm1 the combination of "migration cost matrix"
calculations and "dynamic sched domains" is broken. This patch fixes
the basic bug, even though I still have larger problems with the
"migration cost matrix" concept as implemented.
The essence of the bug is that the code and data in the CPU Scheduler
that perform the cost calculation are declared as "__initdata" or
"__init" or "__devinit", even though a runtime declaration of a
cpu-exclusive cpuset will invoke build_sched_domains() to rebuilt
the sched domains, and that calls the now-released
calibrate_migration_costs(), et al. The attached patch changes the
declarations of the code and data to make them persistent.
To review, to create a cpu-exclusive cpuset, do something like this:
cd /dev/cpuset
mkdir cpu0
echo 0 > cpu0/cpus
echo 0 > cpu0/mems
echo 1 > cpu0/cpu_exclusive
Various other problems with "migration cost matrix" are not addressed
by this patch. They are:
(1) calibrate_migration_costs() printk's the matrix at boot time, and
for large NR_CPUS values this can be voluminous. We might consider
changing the naked printk()'s into printk(KERN_DEBUG ...) to hide
them to everyone but a sysadmin or developer who has a need to
view the values.
(2) calibrate_migration_costs() printk's the matrix for every call to
build_sched_domains(), which is called twice for each declaration
of a cpu-exclusive cpuset. The syslog output is voluminous for
large NR_CPUS values. It is unclear to me how interesting this
output is after the initial display at boot time. Various Job
Manager software will actively create and destroy cpu-exclusive
cpusets, and that will flood the syslog with matrix output.
(3) For a cpu-exclusive cpuset cpu, the domain_distance() calculation
produces a legal value, but not a useful value. That is to say,
if cpu0 is isolated into a cpu-exclusive cpuset, then the semantic
meaning of "domain distance" between cpu0 and another CPU outside of
this cpu-exclusive is essentially undefined. No load-balancing
will occur in or out of this cpu-exclusive cpuset. There is no
need to measure_migration_cost() of any migrations in or out of
a cpu-exclusive cpuset.
The reason why this is significant is that measure_migration_cost()
is expensive, and the creation of that cpu0 cpu-exclusive cpuset
stalls the
echo 1 > cpu0/cpu_exclusive
about six seconds on a 64p ia64/sn2 platform, all to calculate
a cost value that is never used.
(4) The expense of the migration_cost() calculation can be sizeable.
It can be reduced somewhat by changing the 5% increments into 10%
or even higher. That reduces the boot-time "calibration delay"
from 50 seconds down to 20 seconds (for the 64p ia64/sn2) for
three levels of sched domains, and the declaration of a single
cpu0 cpu-exclusive dynamic sched domain from 6 seconds down to 5
(all to calculate that unnecessary new domain_distance).
(5) Besides, the migration cost between any two CPUs is something
that can be calculated once at boot time and remembered
thereafter. I suspect the reason why the algorithm doesn't do
this is that an exhaustive NxN calculation will be very slow for
large NR_CPUS counts, which explains why the calculations are
now done in the context of sched domains.
However, simply calculating the migration cost for each "domain
distance" presupposes that the sched domain groupings accurately
reflect true migration costs. For platforms like the ia64/sn2
this is not true. For example, a 64p ia64/sn2 will have three
sched domain levels: level zero is each 2-CPU pair within a
node; level one is an SD_NODES_PER_DOMAIN-size arbitrary clump
of 32 CPUs; level two is all 64 CPUs. Looking at that middle
level, if I understand the algorithm, there is one and only one
migration_cost[1] calculation made for an arbitrary choice of one
CPU in a 32-CPU clump and another CPU in that clump. Reality is
that there are various migration costs for different pairs of CPUs
in that set of 32. This is yet another reason why I believe we
don't need to do those 5% size increments to get really accurate
timings.
For now, here is the patch to avoid the basic bug:
Signed-off-by: John Hawkes <[email protected]>
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c 2005-12-09 09:26:04.000000000 -0800
+++ linux/kernel/sched.c 2005-12-09 10:53:28.000000000 -0800
@@ -5295,7 +5295,7 @@
#define MIGRATION_FACTOR_SCALE 128
-static __initdata unsigned int migration_factor = MIGRATION_FACTOR_SCALE;
+static unsigned int migration_factor = MIGRATION_FACTOR_SCALE;
static int __init setup_migration_factor(char *str)
{
@@ -5310,7 +5310,7 @@
* Estimated distance of two CPUs, measured via the number of domains
* we have to pass for the two CPUs to be in the same span:
*/
-__devinit static unsigned long domain_distance(int cpu1, int cpu2)
+static unsigned long domain_distance(int cpu1, int cpu2)
{
unsigned long distance = 0;
struct sched_domain *sd;
@@ -5329,7 +5329,7 @@
return distance;
}
-static __initdata unsigned int migration_debug;
+static unsigned int migration_debug;
static int __init setup_migration_debug(char *str)
{
@@ -5345,7 +5345,7 @@
* bootup. Gets used in the domain-setup code (i.e. during SMP
* bootup).
*/
-__initdata unsigned int max_cache_size;
+unsigned int max_cache_size;
static int __init setup_max_cache_size(char *str)
{
@@ -5360,7 +5360,7 @@
* is the operation that is timed, so we try to generate unpredictable
* cachemisses that still end up filling the L2 cache:
*/
-__init static void touch_cache(void *__cache, unsigned long __size)
+static void touch_cache(void *__cache, unsigned long __size)
{
unsigned long size = __size/sizeof(long), chunk1 = size/3,
chunk2 = 2*size/3;
@@ -5382,8 +5382,8 @@
/*
* Measure the cache-cost of one task migration. Returns in units of nsec.
*/
-__init static unsigned long long measure_one(void *cache, unsigned long size,
- int source, int target)
+static unsigned long long measure_one(void *cache, unsigned long size,
+ int source, int target)
{
cpumask_t mask, saved_mask;
unsigned long long t0, t1, t2, t3, cost;
@@ -5449,7 +5449,7 @@
* Architectures can prime the upper limit of the search range via
* max_cache_size, otherwise the search range defaults to 20MB...64K.
*/
-__init static unsigned long long
+static unsigned long long
measure_cost(int cpu1, int cpu2, void *cache, unsigned int size)
{
unsigned long long cost1, cost2;
@@ -5501,7 +5501,7 @@
return cost1 - cost2;
}
-__devinit static unsigned long long measure_migration_cost(int cpu1, int cpu2)
+static unsigned long long measure_migration_cost(int cpu1, int cpu2)
{
unsigned long long max_cost = 0, fluct = 0, avg_fluct = 0;
unsigned int max_size, size, size_found = 0;
@@ -5606,7 +5606,7 @@
return 2 * max_cost * migration_factor / MIGRATION_FACTOR_SCALE;
}
-void __devinit calibrate_migration_costs(void)
+void calibrate_migration_costs(void)
{
int cpu1 = -1, cpu2 = -1, cpu, orig_cpu = raw_smp_processor_id();
struct sched_domain *sd;
> (5) Besides, the migration cost between any two CPUs is something
> that can be calculated once at boot time and remembered
> thereafter. I suspect the reason why the algorithm doesn't do
> this is that an exhaustive NxN calculation will be very slow for
> large NR_CPUS counts, which explains why the calculations are
> now done in the context of sched domains.
Agreed - I too suspect that this a form of compression, both of
computation costs and data size. We save space and time by not
calculating the full N * N matrix, where N is num_onlinecpus(), but
just the sched domain sub-matrices.
In theory, I would think that we should -not- compress based on sched
domains, because:
1) these are (now becoming) dynamic, and
2) they don't reflect the "natural" basis for such compression,
which would be hardware topology based, not sched domain based.
Rather we should compress based on the topological symmetries of the
hardware system. Of course, this is an ARCH specific characteristic,
or even platform specific.
Perhaps we could provide an ARCH specific routine that would map any
ordered pair <cpu0, cpu1> of cpu numbers to a canonical pair, such that
the canonical pairs were "about as far apart, for that system
topology", but potentially much fewer in number than the entire N * N
space, and a smaller maximum value of the largest cpu number returned.
The default routine would be the identify function, which would work
fine for ordinary sized systems.
A second ARCH specific routine would return the largest value M
canonical cpu number that would be returned by the above routine.
The distance array could be dynamically allocated to M**2 size.
The default routine would just return the highest online CPU number.
These 'canonical cpu pairs' would replace the sched domains as the
basis for compression.
Then one time at boot, for each possible pair of online cpus, map that
pair to its canonical pair, and if not already done, compute its
migration cost. For example, if on the current systems topology, cpu
pairs <3,5> and <67,69> are pretty much the same distances apart, the
"canonical" pair for both these might be <3,5>, and only that pair
would have to be actually computed and stored. Everytime the software
using this wanted results for <67,69>, it would get mapped to <3,5> for
resolution.
In the extreme case of a big NUMA system with an essentially homogeneous
topology (all cpu-cpu distances the same), all <cpu0, cpu1> pairs where
cpu- != cpu1, could be mapped to same canonical pair <0, 1>.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401
* [email protected] <[email protected]> wrote:
> The essence of the bug is that the code and data in the CPU Scheduler
> that perform the cost calculation are declared as "__initdata" or
> "__init" or "__devinit", even though a runtime declaration of a
> cpu-exclusive cpuset will invoke build_sched_domains() to rebuilt the
> sched domains, and that calls the now-released
> calibrate_migration_costs(), et al. The attached patch changes the
> declarations of the code and data to make them persistent.
ok.
Acked-by: Ingo Molnar <[email protected]>
> (1) calibrate_migration_costs() printk's the matrix at boot time, and
> for large NR_CPUS values this can be voluminous. We might consider
> changing the naked printk()'s into printk(KERN_DEBUG ...) to hide
> them to everyone but a sysadmin or developer who has a need to
> view the values.
yeah, agreed.
> (2) calibrate_migration_costs() printk's the matrix for every call to
> build_sched_domains(), which is called twice for each declaration
> of a cpu-exclusive cpuset. The syslog output is voluminous for
> large NR_CPUS values. It is unclear to me how interesting this
> output is after the initial display at boot time. Various Job
> Manager software will actively create and destroy cpu-exclusive
> cpusets, and that will flood the syslog with matrix output.
ok. I've changed the code to be less verbose: only the distance function
is printed, not the full matrix. I have also made some other printks
dependent on migration_debug (which is off by default). The boot-time
output is now a minimalistic:
migration_cost=269,777
which can also be pasted into the kernel's boot command line to turn off
calibration on the next bootup. Patch attached below, goes to the tail
of the current scheduler queue in -mm.
[i'll reply to your other points in the next mail.]
Ingo
----
be less verbose in the migration-cost printout.
Signed-off-by: Ingo Molnar <[email protected]>
kernel/sched.c | 52 +++++++++++++---------------------------------------
1 files changed, 13 insertions(+), 39 deletions(-)
Index: linux-mm.q/kernel/sched.c
===================================================================
--- linux-mm.q.orig/kernel/sched.c
+++ linux-mm.q/kernel/sched.c
@@ -5485,10 +5485,8 @@ static unsigned long long measure_migrat
void calibrate_migration_costs(void)
{
int cpu1 = -1, cpu2 = -1, cpu, orig_cpu = raw_smp_processor_id();
+ unsigned long j0, j1, distance, max_distance = 0;
struct sched_domain *sd;
- unsigned long distance, max_distance = 0;
- unsigned long long cost;
- unsigned long j0, j1;
j0 = jiffies;
@@ -5502,14 +5500,11 @@ void calibrate_migration_costs(void)
distance = domain_distance(cpu1, cpu2);
max_distance = max(max_distance, distance);
/*
- * Do we have the result cached already?
+ * No result cached yet?
*/
- if (migration_cost[distance] != -1LL)
- cost = migration_cost[distance];
- else {
- cost = measure_migration_cost(cpu1, cpu2);
- migration_cost[distance] = cost;
- }
+ if (migration_cost[distance] == -1LL)
+ migration_cost[distance] =
+ measure_migration_cost(cpu1, cpu2);
}
}
/*
@@ -5526,8 +5521,8 @@ void calibrate_migration_costs(void)
/*
* Print the matrix:
*/
- printk("---------------------\n");
- printk("| migration cost matrix (max_cache_size: %d, cpu: %d MHz):\n",
+ if (migration_debug)
+ printk("migration: max_cache_size: %d, cpu: %d MHz:\n",
max_cache_size,
#ifdef CONFIG_X86
cpu_khz/1000
@@ -5535,37 +5530,16 @@ void calibrate_migration_costs(void)
-1
#endif
);
- printk("---------------------\n");
- printk(" ");
- for_each_online_cpu(cpu1)
- printk(" [%02d]", cpu1);
- printk("\n");
- for_each_online_cpu(cpu1) {
- printk("[%02d]: ", cpu1);
- for_each_online_cpu(cpu2) {
- if (cpu1 == cpu2) {
- printk(" - ");
- continue;
- }
- distance = domain_distance(cpu1, cpu2);
- max_distance = max(max_distance, distance);
- cost = migration_cost[distance];
- printk(" %2ld.%ld(%ld)", (long)cost / 1000000,
- ((long)cost / 100000) % 10, distance);
- }
- printk("\n");
- }
- printk("--------------------------------\n");
- printk("| cacheflush times [%ld]:", max_distance+1);
+ printk("migration_cost=");
for (distance = 0; distance <= max_distance; distance++) {
- cost = migration_cost[distance];
- printk(" %ld.%ld (%Ld)", (long)cost / 1000000,
- ((long)cost / 100000) % 10, cost);
+ if (distance)
+ printk(",");
+ printk("%ld", (long)migration_cost[distance] / 1000);
}
printk("\n");
j1 = jiffies;
- printk("| calibration delay: %ld seconds\n", (j1-j0)/HZ);
- printk("--------------------------------\n");
+ if (migration_debug)
+ printk("migration: %ld seconds\n", (j1-j0)/HZ);
/*
* Move back to the original CPU. NUMA-Q gets confused
* [email protected] <[email protected]> wrote:
> (3) For a cpu-exclusive cpuset cpu, the domain_distance() calculation
> produces a legal value, but not a useful value. That is to say,
> if cpu0 is isolated into a cpu-exclusive cpuset, then the semantic
> meaning of "domain distance" between cpu0 and another CPU outside of
> this cpu-exclusive is essentially undefined. No load-balancing
> will occur in or out of this cpu-exclusive cpuset. There is no
> need to measure_migration_cost() of any migrations in or out of
> a cpu-exclusive cpuset.
this is a bug in calibrate_migration_costs() - it should only consider
the cpu_map for which the domains are being built. Does the patch below
fix things for you?
we _may_ want to do calibration for isolated domain trees too: if the
new domain tree exposes a domain-depth that was not covered yet. But in
any case, caching should work most of the time.
in any case, i'm curious why none of the WARN_ON()s in domain_distance()
triggered: trying to measure the distance between two CPUs which have no
common span should have triggered the first WARN_ON() in
domain_distance().
> (4) The expense of the migration_cost() calculation can be sizeable.
> It can be reduced somewhat by changing the 5% increments into 10%
> or even higher. That reduces the boot-time "calibration delay"
> from 50 seconds down to 20 seconds (for the 64p ia64/sn2) for
> three levels of sched domains, and the declaration of a single
> cpu0 cpu-exclusive dynamic sched domain from 6 seconds down to 5
> (all to calculate that unnecessary new domain_distance).
an architecture can reduce the calibration costs by tuning the
max_cache_size variable down from the default 64MB. You can also do this
from the command-line, via: max_cache_size=33000000.
but otherwise, the calibration measurement is already a bit noisy, so
i'd not want to make it even more imprecise, at least for now.
> (5) Besides, the migration cost between any two CPUs is something
> that can be calculated once at boot time and remembered
> thereafter. I suspect the reason why the algorithm doesn't do
> this is that an exhaustive NxN calculation will be very slow for
> large NR_CPUS counts, which explains why the calculations are
> now done in the context of sched domains.
we do cache migration costs. This code in calibrate_migration_costs()
does it:
/*
* No result cached yet?
*/
if (migration_cost[distance] == -1LL)
migration_cost[distance] =
measure_migration_cost(cpu1, cpu2);
i changed it from a NxN calculation to the domain-distance and added
caching, primarily because Altix guys complained about the excessive
output and excessive boot-time :-)
> However, simply calculating the migration cost for each "domain
> distance" presupposes that the sched domain groupings accurately
> reflect true migration costs. [...]
yes, that is true. But obviously the migration costs are attached to the
domain structure, so going beyond that in calibration makes no sense -
we couldnt use the information. OTOH the domain tree on the overwhelming
majority of platforms depicts the true cache-hierarchy of the hardware.
> [...] For platforms like the ia64/sn2 this is not true. For
> example, a 64p ia64/sn2 will have three sched domain levels:
> level zero is each 2-CPU pair within a node; level one is an
> SD_NODES_PER_DOMAIN-size arbitrary clump of 32 CPUs; level two is
> all 64 CPUs. Looking at that middle level, if I understand the
> algorithm, there is one and only one migration_cost[1] calculation
> made for an arbitrary choice of one CPU in a 32-CPU clump and
> another CPU in that clump. Reality is that there are various
> migration costs for different pairs of CPUs in that set of 32.
> This is yet another reason why I believe we don't need to do those
> 5% size increments to get really accurate timings.
the domain tree should match the true cache hierarchy as closely as
possible, to get optimal task migration _anyway_. It's not just about
cache-hot autodetection, this is a generic thing for other types of
migrations too.
one artificial simplification in the domain_distance() approach is the
assumption that the domain tree is fundamentally 'symmetric', i.e. a
domain 2 steps away from the bottom of the tree describes the same type
of cache, no matter from which CPU we do this. The moment someone adds
fundamental assymetry to the hardware, this approach has to be changed.
in the SN2 case i'm not sure there's any true assymetry on the hardware
side, is there? Couldnt the domain tree be changed to match the hardware
more - if yes, would there be any disadvantages?
Ingo
-----
limit scheduler migration calibration to the affected CPU map.
Signed-off-by: Ingo Molnar <[email protected]>
include/linux/sched.h | 2 --
kernel/sched.c | 10 +++++-----
2 files changed, 5 insertions(+), 7 deletions(-)
Index: linux-mm.q/include/linux/sched.h
===================================================================
--- linux-mm.q.orig/include/linux/sched.h
+++ linux-mm.q/include/linux/sched.h
@@ -650,8 +650,6 @@ extern void partition_sched_domains(cpum
*/
extern unsigned int max_cache_size;
-extern void calibrate_migration_costs(void);
-
#endif /* CONFIG_SMP */
Index: linux-mm.q/kernel/sched.c
===================================================================
--- linux-mm.q.orig/kernel/sched.c
+++ linux-mm.q/kernel/sched.c
@@ -5482,7 +5482,7 @@ static unsigned long long measure_migrat
return 2 * max_cost * migration_factor / MIGRATION_FACTOR_SCALE;
}
-void calibrate_migration_costs(void)
+static void calibrate_migration_costs(const cpumask_t *cpu_map)
{
int cpu1 = -1, cpu2 = -1, cpu, orig_cpu = raw_smp_processor_id();
unsigned long j0, j1, distance, max_distance = 0;
@@ -5493,8 +5493,8 @@ void calibrate_migration_costs(void)
/*
* First pass - calculate the cacheflush times:
*/
- for_each_online_cpu(cpu1) {
- for_each_online_cpu(cpu2) {
+ for_each_cpu_mask(cpu1, *cpu_map) {
+ for_each_cpu_mask(cpu2, *cpu_map) {
if (cpu1 == cpu2)
continue;
distance = domain_distance(cpu1, cpu2);
@@ -5511,7 +5511,7 @@ void calibrate_migration_costs(void)
* Second pass - update the sched domain hierarchy with
* the new cache-hot-time estimations:
*/
- for_each_online_cpu(cpu) {
+ for_each_cpu_mask(cpu, *cpu_map) {
distance = 0;
for_each_domain(cpu, sd) {
sd->cache_hot_time = migration_cost[distance];
@@ -5924,7 +5924,7 @@ next_sg:
/*
* Tune cache-hot values:
*/
- calibrate_migration_costs();
+ calibrate_migration_costs(cpu_map);
}
/*
* Set up scheduler domains and groups. Callers must hold the hotplug lock.
I'm looking at 2.6.15-rc5-mm2 and there is still a problem: the
migration_cost[] array is disappearing after boot, which leads to completely
bogus migration_cost[] values when dynamic sched domains are declared.
The fix:
Index: linux/kernel/sched.c
==========================================================
--- linux.orig/kernel/sched.c 2005-12-12 10:30:24.000000000 -0800
+++ linux/kernel/sched.c 2005-12-12 11:25:27.000000000 -0800
@@ -5279,7 +5279,7 @@
*/
#define MAX_DOMAIN_DISTANCE 32
-static __initdata unsigned long long migration_cost[MAX_DOMAIN_DISTANCE] =
+static unsigned long long migration_cost[MAX_DOMAIN_DISTANCE] =
{ [ 0 ... MAX_DOMAIN_DISTANCE-1 ] = -1LL };
/*
* John Hawkes <[email protected]> wrote:
> I'm looking at 2.6.15-rc5-mm2 and there is still a problem: the
> migration_cost[] array is disappearing after boot, which leads to
> completely bogus migration_cost[] values when dynamic sched domains
> are declared.
oops, indeed!
> -static __initdata unsigned long long migration_cost[MAX_DOMAIN_DISTANCE] =
> +static unsigned long long migration_cost[MAX_DOMAIN_DISTANCE] =
Acked-by: Ingo Molnar <[email protected]>
Ingo