The boot-time migration cost auto-tuning stuff seems to have
been merged to Linus' tree since 2.6.15. On little one- or
two-processor systems, the time required to measure the
migration costs isn't very noticeable, but by the time we
get to even a four-processor ia64 box, it adds about
30 seconds to the boot time, which seems like a lot.
Is that expected? Is the information we get really worth
that much? Could the measurement be done at run-time
instead? Is there a smaller hammer we could use, e.g.,
flushing just the buffer rather than the *entire* cache?
Did we just implement sched_cacheflush() incorrectly for
ia64?
Only ia64, x86, and x86_64 currently have a non-empty
sched_cacheflush(), and the x86* ones contain only "wbinvd()".
So I suspect that only ia64 sees this slowdown. But I would
guess that other arches will implement it in the future.
Bjorn
> The boot-time migration cost auto-tuning stuff seems to have
> been merged to Linus' tree since 2.6.15. On little one- or
> two-processor systems, the time required to measure the
> migration costs isn't very noticeable, but by the time we
> get to even a four-processor ia64 box, it adds about
> 30 seconds to the boot time, which seems like a lot.
I only see about 16 seconds for a 4-way tiger (not that 16 seconds
is good ... but it not as bad as 30). This was with a build
from tiger_defconfig that sets CONFIG_NR_CPUS=4 ... so I wonder
what's causing the factor of two. I measured with a printk
each side of build_sched_domains() and booted with the "time"
command line arg to get:
[ 0.540718] Building sched domains
[ 16.124693] migration_cost=10091
[ 16.124789] Done
More importantly, how does this time scale as the number of
cpus increases? Linear, or worse? What happens on a 512 cpu
Altix (if it's quadratic, they may be still waiting for the
boot to finish :-)
-Tony
Luck, Tony wrote:
>>The boot-time migration cost auto-tuning stuff seems to have
>>been merged to Linus' tree since 2.6.15. On little one- or
>>two-processor systems, the time required to measure the
>>migration costs isn't very noticeable, but by the time we
>>get to even a four-processor ia64 box, it adds about
>>30 seconds to the boot time, which seems like a lot.
>
>
> I only see about 16 seconds for a 4-way tiger (not that 16 seconds
> is good ... but it not as bad as 30). This was with a build
> from tiger_defconfig that sets CONFIG_NR_CPUS=4 ... so I wonder
> what's causing the factor of two. I measured with a printk
> each side of build_sched_domains() and booted with the "time"
> command line arg to get:
I've noticed the delay on a 16p and 64p. At first I thought it was a
system hang but have since learned to live with the delay.
What happens on a 512 cpu
> Altix (if it's quadratic, they may be still waiting for the
> boot to finish :-)
Not quadratic. This is a 64p Altix ...
[ 9.942253] Brought up 64 CPUs
[ 9.942904] Total of 64 processors activated (143654.91 BogoMIPS).
[ 9.943995] build_sched_domains: start
[ 32.108439] migration_cost=0,32232,39021
[ 37.894391] build_sched_domains: end
P.
>
> -Tony
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
* Bjorn Helgaas <[email protected]> wrote:
> The boot-time migration cost auto-tuning stuff seems to have been
> merged to Linus' tree since 2.6.15. On little one- or two-processor
> systems, the time required to measure the migration costs isn't very
> noticeable, but by the time we get to even a four-processor ia64 box,
> it adds about 30 seconds to the boot time, which seems like a lot.
>
> Is that expected? Is the information we get really worth that much?
> Could the measurement be done at run-time instead? Is there a smaller
> hammer we could use, e.g., flushing just the buffer rather than the
> *entire* cache? Did we just implement sched_cacheflush() incorrectly
> for ia64?
>
> Only ia64, x86, and x86_64 currently have a non-empty
> sched_cacheflush(), and the x86* ones contain only "wbinvd()". So I
> suspect that only ia64 sees this slowdown. But I would guess that
> other arches will implement it in the future.
the main cost comes from accessing the test-buffer when the buffer size
gets above the real cachesize. There are a coupe of ways to improve
that:
- double-check that max_cache_size gets set up correctly on your
architecture - the code searches from ~64K to 2*max_cache_size.
- take the values that are auto-detected and use the migration_cost=
boot parameter - see Documentation/kernel-parameters.txt:
migration_cost=
[KNL,SMP] debug: override scheduler migration costs
Format: <level-1-usecs>,<level-2-usecs>,...
This debugging option can be used to override the
default scheduler migration cost matrix. The numbers
are indexed by 'CPU domain distance'.
E.g. migration_cost=1000,2000,3000 on an SMT NUMA
box will set up an intra-core migration cost of
1 msec, an inter-core migration cost of 2 msecs,
and an inter-node migration cost of 3 msecs.
(a distribution could do this automatically as well in the installer,
i've constructed the bootup printout to be in the format that is
needed for migration_cost. I have not tested this too extensively
though, so double-check the result via an additional
migration_debug=2 printout as well! Let me know if you find any bugs
here.)
via this solution you will get zero overhead on subsequent bootups.
- in kernel/sched.c, decrease ITERATIONS from 2 to 1. This will make the
measurement more noisy though.
- in kernel/sched.c, change this line:
size = size * 20 / 19;
to:
size = size * 10 / 9;
this will probably halve the cost - against at the expense of
accuracy and statistical stability.
Ingo
On Mon, Jan 30, 2006 at 06:21:40PM +0100, Ingo Molnar wrote:
> - double-check that max_cache_size gets set up correctly on your
> architecture - the code searches from ~64K to 2*max_cache_size.
Ia64 gets that from PAL in get_max_cacheline_size() in ia64/kernel/setup.c
A quick printk() in there confirms that we get the right answer (9MB for
me), and that it happens before we compute the migration cost.
> - take the values that are auto-detected and use the migration_cost=
> boot parameter - see Documentation/kernel-parameters.txt:
> ...
> via this solution you will get zero overhead on subsequent bootups.
But if you are going to go this route, you could drop all this code from
the kernel and have a hard-wired constant, with a user-mode test program
to compute the more accurate value.
> - in kernel/sched.c, decrease ITERATIONS from 2 to 1. This will make the
> measurement more noisy though.
Doing this drops the time to compute the value from 15.58s to 10.39s, while
the value of migration_cost changes from 10112 to 9909.
> - in kernel/sched.c, change this line:
> size = size * 20 / 19;
> to:
> size = size * 10 / 9;
Doing this instead of changing ITERATIONS makes the computation take 7.79s
and the computed migration_cost is 9987.
Doing both gets the time down to 5.20s, and the migration_cost=9990.
So the variation in the computed value of migration_cost was at worst
2% with these modifications to the algorithm. Do you really need to know
the value to this accuracy? What 2nd order bad effects would occur from
using an off-by-2% value for scheduling decisions?
On the plus side Prarit's results show that this time isn't scaling with
NR_CPUS ... apparently just cache size and number of domains are significant
in the time to compute.
-Tony
* Luck, Tony <[email protected]> wrote:
> Doing this drops the time to compute the value from 15.58s to 10.39s,
> while the value of migration_cost changes from 10112 to 9909.
>
> > - in kernel/sched.c, change this line:
> > size = size * 20 / 19;
> > to:
> > size = size * 10 / 9;
>
> Doing this instead of changing ITERATIONS makes the computation take
> 7.79s and the computed migration_cost is 9987.
>
> Doing both gets the time down to 5.20s, and the migration_cost=9990.
ok, that's good enough i think - we could certainly do the patch below
in v2.6.16.
> On the plus side Prarit's results show that this time isn't scaling
> with NR_CPUS ... apparently just cache size and number of domains are
> significant in the time to compute.
yes, this comes from the algorithm, it only computes once per distance
(and uses the cached value from then on), independently of the number of
CPUs.
Ingo
---
reduce the amount of time the migration cost calculations cost during
bootup.
Signed-off-by: Ingo Molnar <[email protected]>
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -5141,7 +5141,7 @@ static void init_sched_build_groups(stru
#define SEARCH_SCOPE 2
#define MIN_CACHE_SIZE (64*1024U)
#define DEFAULT_CACHE_SIZE (5*1024*1024U)
-#define ITERATIONS 2
+#define ITERATIONS 1
#define SIZE_THRESH 130
#define COST_THRESH 130
@@ -5480,9 +5480,9 @@ static unsigned long long measure_migrat
break;
}
/*
- * Increase the cachesize in 5% steps:
+ * Increase the cachesize in 10% steps:
*/
- size = size * 20 / 19;
+ size = size * 10 / 9;
}
if (migration_debug)
Ingo Molnar wrote on Monday, January 30, 2006 9:22 AM
> - in kernel/sched.c, decrease ITERATIONS from 2 to 1. This will make the
> measurement more noisy though.
>
> - in kernel/sched.c, change this line:
>
> size = size * 20 / 19;
>
> to:
>
> size = size * 10 / 9;
>
> this will probably halve the cost - against at the expense of
> accuracy and statistical stability.
I think the kernel should keep the accuracy and stability. One option
would be by default not to measure the migration cost. For people who
wants to have an accurate scheduler parameter, they can turn on a boot
time option to do the measure.
- Ken
On Mon, Jan 30, 2006 at 08:24:38PM +0100, Ingo Molnar wrote:
> > Doing both gets the time down to 5.20s, and the migration_cost=9990.
>
> ok, that's good enough i think - we could certainly do the patch below
> in v2.6.16.
Might it be wise to see whether the 2% variation that I saw can be
repeated on some other architecture? Bjorn's initial post was just
questioning whether we need to spend this much time during boot to acquire
this data. Now we have *one* data point that on an ia64 with four cpus
with 9MB cache in a single domain that we can speed the calculation by
a factor of three with only a 2% loss of accuracy. Can someone else try
this patch and post the before/after values for migration_cost from dmesg?
-Tony
---
reduce the amount of time the migration cost calculations cost during
bootup.
Signed-off-by: Ingo Molnar <[email protected]>
--- linux/kernel/sched.c.orig
+++ linux/kernel/sched.c
@@ -5141,7 +5141,7 @@ static void init_sched_build_groups(stru
#define SEARCH_SCOPE 2
#define MIN_CACHE_SIZE (64*1024U)
#define DEFAULT_CACHE_SIZE (5*1024*1024U)
-#define ITERATIONS 2
+#define ITERATIONS 1
#define SIZE_THRESH 130
#define COST_THRESH 130
@@ -5480,9 +5480,9 @@ static unsigned long long measure_migrat
break;
}
/*
- * Increase the cachesize in 5% steps:
+ * Increase the cachesize in 10% steps:
*/
- size = size * 20 / 19;
+ size = size * 10 / 9;
}
if (migration_debug)
From: "Luck, Tony" <[email protected]>
...
> So the variation in the computed value of migration_cost was at worst
> 2% with these modifications to the algorithm. Do you really need to know
> the value to this accuracy? What 2nd order bad effects would occur from
> using an off-by-2% value for scheduling decisions?
>
> On the plus side Prarit's results show that this time isn't scaling with
> NR_CPUS ... apparently just cache size and number of domains are significant
> in the time to compute.
Yes, the calculation is done just once per domain level, and a desire to
achieve great accuracy for the calculation presupposes that the cpuM-to-cpuN
migration cost for a given domain level is identical (or very close) across
all the CPU pairs. That is, for a given domain level, only one CPU pair are
chosen for the calculation. For the ia64/sn2 NUMA Altix, and I suspect for
other NUMA platforms, this just isn't true for the middle domain level (i.e.,
the level that appears when the CPU count is >32p) -- i.e., some CPU pairs are
"closer" than other pairs. The variation for other CPU pairs in this domain
level is certainly much greater than 2%.
John Hawkes
Tony,
>
>
> Might it be wise to see whether the 2% variation that I saw can be
> repeated on some other architecture? Can someone else try
> this patch and post the before/after values for migration_cost from dmesg?
>
Ask and ye shall receive ... on the 64p/64G system.
Pristine:
[ 9.942253] Brought up 64 CPUs
[ 9.942904] Total of 64 processors activated (143654.91 BogoMIPS).
[ 9.943995] build_sched_domains: start
[ 32.108439] migration_cost=0,32232,39021
[ 37.894391] build_sched_domains: end
Patched:
[ 0.001307] Calibrating delay loop... 2244.60 BogoMIPS (lpj=4489216)
[ 9.942308] Brought up 64 CPUs
[ 9.942812] Total of 64 processors activated (143654.91 BogoMIPS).
[ 18.080441] migration_cost=0,31934,38750
[ 23.865993] checking if image is initramfs... it is
P.
Prarit Bhargava wrote:
>
> Tony,
>
>>
>>
>> Might it be wise to see whether the 2% variation that I saw can be
>> repeated on some other architecture? Can someone else try
>> this patch and post the before/after values for migration_cost from
>> dmesg?
>>
Whoops. Let's try that again:
Pristine (with build_sched_domains measurement):
[ 9.942253] Brought up 64 CPUs
[ 9.942904] Total of 64 processors activated (143654.91 BogoMIPS).
[ 9.943995] build_sched_domains: start
[ 32.108439] migration_cost=0,32232,39021
[ 37.894391] build_sched_domains: end
Patched (with build_sched_domains measurement):
[ 9.942267] Brought up 64 CPUs
[ 9.942930] Total of 64 processors activated (143654.91 BogoMIPS).
[ 9.944032] build_sched_domains: beginmigration_cost=0,31854,38739
[ 23.868304] build_sched_domains: end
P.
>
In-Reply-To: <[email protected]>
On Mon, 30 Jan 2006, Tony Luck wrote:
> Might it be wise to see whether the 2% variation that I saw can be
> repeated on some other architecture? Bjorn's initial post was just
> questioning whether we need to spend this much time during boot to acquire
> this data. Now we have *one* data point that on an ia64 with four cpus
> with 9MB cache in a single domain that we can speed the calculation by
> a factor of three with only a 2% loss of accuracy. Can someone else try
> this patch and post the before/after values for migration_cost from dmesg?
Before:
messages.1:Jan 24 01:19:45 d2 kernel: [ 6.377117] migration_cost=9352
messages.1:Jan 27 21:07:55 d2 kernel: [ 6.384871] migration_cost=9329
messages.1:Jan 28 11:00:32 d2 kernel: [ 6.384215] migration_cost=9338
messages.1:Jan 28 12:55:03 d2 kernel: [ 6.389189] migration_cost=9364
After:
messages:Jan 31 07:55:07 d2 kernel: [ 1.859359] migration_cost=9274
This was on a dual PII Xeon with 2MB L2 cache. About 3.5x as fast and
only 1% change.
Maybe the default could be to run the quick test with an option to run the
more-accurate one?
--
Chuck