2010-12-15 01:59:02

by Robin

[permalink] [raw]
Subject: [RFC 2/2] Make x86 calibrate_delay run in parallel.


On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
up the cpus. By specifying lpj=<value>, we reduced that to 75 seconds.
Andi Kleen suggested we rework the calibrate_delay calls to run in
parallel. With that code in place, a test boot of the same machine took
61 seconds to bring the cups up. I am not sure how we beat the lpj=
case, but it did outperform.

One thing to note is the total BogoMIPS value is also consistently higher.
I am wondering if this is an effect with the cores being in performance
mode. I did notice that the parallel calibrate_delay calls did cause the
fans on the machine to ramp up to full speed where the normal sequential
calls did not cause them to budge at all.

Signed-off-by: Robin Holt <[email protected]>
To: Andi Kleen <[email protected]>
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>

---

Some before and after logs:

2 socket, 8 cores per socket, no hyperthreads:
Before:
[ 0.816215] Booting Node 0, Processors #1 #2 #3 #4 #5 #6 #7 Ok.
[ 1.463913] Booting Node 1, Processors #8 #9 #10 #11 #12 #13 #14 #15 Ok.
[ 2.202919] Brought up 16 CPUs
[ 2.206325] Total of 16 processors activated (72523.23 BogoMIPS).
# grep bogomips /proc/cpuinfo
bogomips : 4532.81
bogomips : 4532.65
bogomips : 4532.64
bogomips : 4532.64
bogomips : 4532.65
bogomips : 4532.64
bogomips : 4532.64
bogomips : 4532.64
bogomips : 4532.72
bogomips : 4532.74
bogomips : 4532.72
bogomips : 4532.73
bogomips : 4532.74
bogomips : 4532.74
bogomips : 4532.74
bogomips : 4532.73


After:
[ 0.747991] UV: Map MMR_HI 0xf7e00000000 - 0xf7e04000000
[ 0.753913] UV: Map MMIOH_HI 0xf8000000000 - 0xf8100000000
[ 0.760314] Booting Node 0, Processors #1 #2 #3 #4 #5 #6 #7 Ok.
[ 0.990706] Booting Node 1, Processors #8 #9 #10 #11 #12 #13 #14 #15 Ok.
[ 1.253240] Brought up 16 CPUs
[ 1.315378] Total of 16 processors activated (127783.49 BogoMIPS).
# grep bogomips /proc/cpuinfo
bogomips : 4533.49
bogomips : 7890.05
bogomips : 9699.67
bogomips : 10047.13
bogomips : 8276.11
bogomips : 8236.85
bogomips : 10062.50
bogomips : 11421.44
bogomips : 7920.28
bogomips : 7883.65
bogomips : 9700.00
bogomips : 9949.31
bogomips : 6448.05
bogomips : 6443.88
bogomips : 4738.22
bogomips : 4532.79

2 socket, 8 cores per socket, hyperthreaded:
Before:
[ 0.538499] Booting Node 0, Processors #1 #2 #3 #4 #5 #6 #7 Ok.
[ 1.323403] Booting Node 1, Processors #8 #9 #10 #11 #12 #13 #14 #15 Ok.
[ 2.221987] Booting Node 0, Processors #16 #17 #18 #19 #20 #21 #22 #23 Ok.
[ 3.120388] Booting Node 1, Processors #24 #25 #26 #27 #28 #29 #30 #31 Ok.
[ 4.018423] Brought up 32 CPUs
[ 4.021833] Total of 32 processors activated (145083.20 BogoMIPS).
After:
[ 0.771327] Booting Node 0, Processors #1 #2 #3 #4 #5 #6 #7 Ok.
[ 1.001745] Booting Node 1, Processors #8 #9 #10 #11 #12 #13 #14 #15 Ok.
[ 1.264354] Booting Node 0, Processors #16 #17 #18 #19 #20 #21 #22 #23 Ok.
[ 1.528090] Booting Node 1, Processors #24 #25 #26 #27 #28 #29 #30 #31 Ok.
[ 1.790866] Brought up 32 CPUs
[ 1.852380] Total of 32 processors activated (279493.75 BogoMIPS).


2 socket, 6 cores per socket, no hyperthreads:
Before:
[ 0.773336] Booting Node 0, Processors #1 #2 #3 #4 #5 Ok.
[ 1.233990] Booting Node 1, Processors #6 #7 #8 #9 #10 #11 Ok.
[ 1.784768] Brought up 12 CPUs
[ 1.788170] Total of 12 processors activated (63991.86 BogoMIPS).

After:
[ 0.721474] Booting Node 0, Processors #1 #2 #3 #4 #5 Ok.
[ 0.885791] Booting Node 1, Processors #6 #7 #8 #9 #10 #11 Ok.
[ 1.082249] Brought up 12 CPUs
[ 1.144426] Total of 12 processors activated (104214.24 BogoMIPS).


256 socket, 8 cores per socket, hyperthreaded:
Before:
[ 95.105108] Booting Node 0, Processors #1 #2 #3 #4 #5 #6 #7 Ok.
[ 95.768866] Booting Node 1, Processors #8 #9 #10 #11 #12 #13 #14 #15 Ok.
...
[ 410.597682] Booting Node 254, Processors #4080 #4081 #4082 #4083 #4084 #4085 #4086 #4087 Ok.
[ 411.231708] Booting Node 255, Processors #4088 #4089 #4090 #4091 #4092 #4093 #4094 #4095 Ok.
[ 411.859404] Brought up 4096 CPUs
[ 411.861354] Total of 4096 processors activated (18569762.97 BogoMIPS).

After:
[ 68.491186] Booting Node 0, Processors #1 #2 #3 #4 #5 #6 #7 Ok.
[ 68.724012] Booting Node 1, Processors #8 #9 #10 #11 #12 #13 #14 #15 Ok.
...
[ 127.713750] Booting Node 254, Processors #4080 #4081 #4082 #4083 #4084 #4085 #4086 #4087 Ok.
[ 127.842004] Booting Node 255, Processors #4088 #4089 #4090 #4091 #4092 #4093 #4094 #4095 Ok.
[ 127.969171] Brought up 4096 CPUs
[ 128.030130] Total of 4096 processors activated (19160610.04 BogoMIPS).

arch/x86/include/asm/cpumask.h | 1
arch/x86/kernel/cpu/common.c | 2 +
arch/x86/kernel/smpboot.c | 33 ++++++++++++++++++++++---------
3 files changed, 27 insertions(+), 9 deletions(-)

Index: parallelize_calibrate_delay/arch/x86/include/asm/cpumask.h
===================================================================
--- parallelize_calibrate_delay.orig/arch/x86/include/asm/cpumask.h 2010-12-14 18:49:25.414805459 -0600
+++ parallelize_calibrate_delay/arch/x86/include/asm/cpumask.h 2010-12-14 18:50:53.558972740 -0600
@@ -6,6 +6,7 @@
extern cpumask_var_t cpu_callin_mask;
extern cpumask_var_t cpu_callout_mask;
extern cpumask_var_t cpu_initialized_mask;
+extern cpumask_var_t cpu_calibrating_jiffies_mask;
extern cpumask_var_t cpu_sibling_setup_mask;

extern void setup_cpu_local_masks(void);
Index: parallelize_calibrate_delay/arch/x86/kernel/cpu/common.c
===================================================================
--- parallelize_calibrate_delay.orig/arch/x86/kernel/cpu/common.c 2010-12-14 18:49:25.414805459 -0600
+++ parallelize_calibrate_delay/arch/x86/kernel/cpu/common.c 2010-12-14 18:50:53.575016358 -0600
@@ -45,6 +45,7 @@
cpumask_var_t cpu_initialized_mask;
cpumask_var_t cpu_callout_mask;
cpumask_var_t cpu_callin_mask;
+cpumask_var_t cpu_calibrating_jiffies_mask;

/* representing cpus for which sibling maps can be computed */
cpumask_var_t cpu_sibling_setup_mask;
@@ -55,6 +56,7 @@ void __init setup_cpu_local_masks(void)
alloc_bootmem_cpumask_var(&cpu_initialized_mask);
alloc_bootmem_cpumask_var(&cpu_callin_mask);
alloc_bootmem_cpumask_var(&cpu_callout_mask);
+ alloc_bootmem_cpumask_var(&cpu_calibrating_jiffies_mask);
alloc_bootmem_cpumask_var(&cpu_sibling_setup_mask);
}

Index: parallelize_calibrate_delay/arch/x86/kernel/smpboot.c
===================================================================
--- parallelize_calibrate_delay.orig/arch/x86/kernel/smpboot.c 2010-12-14 18:50:53.439014660 -0600
+++ parallelize_calibrate_delay/arch/x86/kernel/smpboot.c 2010-12-14 18:50:53.623015192 -0600
@@ -52,6 +52,7 @@
#include <linux/gfp.h>

#include <asm/acpi.h>
+#include <asm/cpumask.h>
#include <asm/desc.h>
#include <asm/nmi.h>
#include <asm/irq.h>
@@ -265,15 +266,7 @@ static void __cpuinit smp_callin(void)
* Need to setup vector mappings before we enable interrupts.
*/
setup_vector_irq(smp_processor_id());
- /*
- * Get our bogomips.
- *
- * Need to enable IRQs because it can take longer and then
- * the NMI watchdog might kill us.
- */
- local_irq_enable();
- loops_per_jiffy = calibrate_delay(loops_per_jiffy);
- local_irq_disable();
+
pr_debug("Stack at about %p\n", &cpuid);

/*
@@ -294,6 +287,8 @@ static void __cpuinit smp_callin(void)
*/
notrace static void __cpuinit start_secondary(void *unused)
{
+ struct cpuinfo_x86 *c;
+
/*
* Don't put *anything* before cpu_init(), SMP booting is too
* fragile that we want to limit the things done here to the
@@ -327,6 +322,12 @@ notrace static void __cpuinit start_seco
wmb();

/*
+ * Indicate we are still calibrating jiffies. Do not sum bogomips
+ * yet.
+ */
+ cpumask_set_cpu(smp_processor_id(), cpu_calibrating_jiffies_mask);
+
+ /*
* We need to hold call_lock, so there is no inconsistency
* between the time smp_call_function() determines number of
* IPI recipients, and the time when the determination is made
@@ -349,6 +350,15 @@ notrace static void __cpuinit start_seco
/* enable local interrupts */
local_irq_enable();

+ c = &cpu_data(smp_processor_id());
+ /*
+ * Get our bogomips.
+ */
+ local_irq_enable();
+ c->loops_per_jiffy = calibrate_delay(loops_per_jiffy);
+ cpumask_clear_cpu(smp_processor_id(), cpu_calibrating_jiffies_mask);
+ smp_mb__after_clear_bit();
+
/* to prevent fake stack check failure in clock setup */
boot_init_stack_canary();

@@ -1190,6 +1200,11 @@ void __init native_smp_prepare_boot_cpu(

void __init native_smp_cpus_done(unsigned int max_cpus)
{
+ while (cpumask_weight(cpu_calibrating_jiffies_mask)) {
+ cpu_relax();
+ touch_nmi_watchdog();
+ }
+
pr_debug("Boot done.\n");

impress_friends();


2010-12-16 08:34:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.

On Tue, 14 Dec 2010, [email protected] wrote:
> On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> up the cpus. By specifying lpj=<value>, we reduced that to 75 seconds.
> Andi Kleen suggested we rework the calibrate_delay calls to run in
> parallel. With that code in place, a test boot of the same machine took
> 61 seconds to bring the cups up. I am not sure how we beat the lpj=
> case, but it did outperform.

If you know that all cpus are running at the same speed, then you can
set lpj=firstcpu and juts calibrate the first cpu and take the value
for all others.

> One thing to note is the total BogoMIPS value is also consistently higher.
> I am wondering if this is an effect with the cores being in performance
> mode. I did notice that the parallel calibrate_delay calls did cause the

We really to know that. I mean the change from:

> bogomips : 4532.81
> bogomips : 4532.65
> bogomips : 4532.64
> bogomips : 4532.64

to

> bogomips : 4533.49
> bogomips : 7890.05
> bogomips : 9699.67
> bogomips : 10047.13

looks strange. The deviation is more than factor 2 and this is on the
same socket. So before we push that into the tree we better know
what's going on.

Thanks,

tglx

2011-03-31 04:46:49

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.

On Tue, Dec 14, 2010 at 5:58 PM, <[email protected]> wrote:
>
> On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> up the cpus. ?By specifying lpj=<value>, we reduced that to 75 seconds.
> Andi Kleen suggested we rework the calibrate_delay calls to run in
> parallel. ?With that code in place, a test boot of the same machine took
> 61 seconds to bring the cups up. ?I am not sure how we beat the lpj=
> case, but it did outperform.
>
> One thing to note is the total BogoMIPS value is also consistently higher.
> I am wondering if this is an effect with the cores being in performance
> mode. ?I did notice that the parallel calibrate_delay calls did cause the
> fans on the machine to ramp up to full speed where the normal sequential
> calls did not cause them to budge at all.

please check attached patch, that could calibrate correctly.

Thanks

Yinghai


Attachments:
Make-x86-calibrate_delay-run-in-parallel.patch (7.24 kB)

2011-03-31 06:50:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.


* Yinghai Lu <[email protected]> wrote:

> On Tue, Dec 14, 2010 at 5:58 PM, <[email protected]> wrote:
> >
> > On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> > up the cpus. ?By specifying lpj=<value>, we reduced that to 75 seconds.
> > Andi Kleen suggested we rework the calibrate_delay calls to run in
> > parallel. ?With that code in place, a test boot of the same machine took
> > 61 seconds to bring the cups up. ?I am not sure how we beat the lpj=
> > case, but it did outperform.
> >
> > One thing to note is the total BogoMIPS value is also consistently higher.
> > I am wondering if this is an effect with the cores being in performance
> > mode. ?I did notice that the parallel calibrate_delay calls did cause the
> > fans on the machine to ramp up to full speed where the normal sequential
> > calls did not cause them to budge at all.
>
> please check attached patch, that could calibrate correctly.
>
> Thanks
>
> Yinghai

> [PATCH -v2] x86: Make calibrate_delay run in parallel.
>
> On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> up the cpus. By specifying lpj=<value>, we reduced that to 75 seconds.
> Andi Kleen suggested we rework the calibrate_delay calls to run in
> parallel.

The risk wit tat suggestion is that it will spectacularly miscalibrate on
hyperthreading systems.

Thanks,

Ingo

2011-03-31 06:58:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.


* Ingo Molnar <[email protected]> wrote:

>
> * Yinghai Lu <[email protected]> wrote:
>
> > On Tue, Dec 14, 2010 at 5:58 PM, <[email protected]> wrote:
> > >
> > > On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> > > up the cpus. ?By specifying lpj=<value>, we reduced that to 75 seconds.
> > > Andi Kleen suggested we rework the calibrate_delay calls to run in
> > > parallel. ?With that code in place, a test boot of the same machine took
> > > 61 seconds to bring the cups up. ?I am not sure how we beat the lpj=
> > > case, but it did outperform.
> > >
> > > One thing to note is the total BogoMIPS value is also consistently higher.
> > > I am wondering if this is an effect with the cores being in performance
> > > mode. ?I did notice that the parallel calibrate_delay calls did cause the
> > > fans on the machine to ramp up to full speed where the normal sequential
> > > calls did not cause them to budge at all.
> >
> > please check attached patch, that could calibrate correctly.
> >
> > Thanks
> >
> > Yinghai
>
> > [PATCH -v2] x86: Make calibrate_delay run in parallel.
> >
> > On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> > up the cpus. By specifying lpj=<value>, we reduced that to 75 seconds.
> > Andi Kleen suggested we rework the calibrate_delay calls to run in
> > parallel.
>
> The risk wit tat suggestion is that it will spectacularly miscalibrate on
> hyperthreading systems.

s/wit tat/with that

Thanks,

Ingo

2011-03-31 09:29:49

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.

On Wed, Mar 30, 2011 at 09:46:46PM -0700, Yinghai Lu wrote:
> On Tue, Dec 14, 2010 at 5:58 PM, <[email protected]> wrote:
> >
> > On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> > up the cpus. ?By specifying lpj=<value>, we reduced that to 75 seconds.
> > Andi Kleen suggested we rework the calibrate_delay calls to run in
> > parallel. ?With that code in place, a test boot of the same machine took
> > 61 seconds to bring the cups up. ?I am not sure how we beat the lpj=
> > case, but it did outperform.
> >
> > One thing to note is the total BogoMIPS value is also consistently higher.
> > I am wondering if this is an effect with the cores being in performance
> > mode. ?I did notice that the parallel calibrate_delay calls did cause the
> > fans on the machine to ramp up to full speed where the normal sequential
> > calls did not cause them to budge at all.
>
> please check attached patch, that could calibrate correctly.
>
> Thanks
>
> Yinghai

> [PATCH -v2] x86: Make calibrate_delay run in parallel.
>
> On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> up the cpus. By specifying lpj=<value>, we reduced that to 75 seconds.
> Andi Kleen suggested we rework the calibrate_delay calls to run in
> parallel.
>
> -v2: from Yinghai
> two path: one for initial boot cpus. and one for hotplug cpus
> initial path:
> after all cpu boot up, enter idle, use smp_call_function_many
> let every ap call __calibrate_delay.
> We can not put that calibrate_delay after local_irq_enable
> in start_secondary(), at that time that cpu could be involed
> with perf_event with nmi_watchdog enabling. that will cause
> strange calibrating result.

If I understand your description above, that would cause the cpu's lpj
value to be too low if they did take an NMI, correct? The problem I was
seeing was additional cores on the socket got a value much higher than
the first core. I don't recall exact values. It would be something
like the second through fifth cores all got larger than the first, then
the sixth stayed the same as the fifth, and seventh was slightly less
then the sixth and finally the eigth was lower than the seventh.

I don't see how this patch would affect that. Has this been tested on
a multi-core intel cpu? I will try to test it today when I get to the
office.

Additionally, it takes the bogomips value from being part of an output
line and makes it a separate line. On a 4096 cpu system, that will mean
many additional lines of output. In the past, we have seen that will
cause a considerable slowdown as time is spent printing. Fortunately,
that is likely not going to slow things down as a secondary cpu will
likely be doing that work while the boot cpu is allowed to continue with
the boot. Is there really a value for a normal boot to have this output?
Can we remove the individual lines of output and just print the system
BogoMips value?

Thanks,
Robin

2011-03-31 09:37:27

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.

On Thu, Mar 31, 2011 at 08:58:05AM +0200, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Yinghai Lu <[email protected]> wrote:
> >
> > > On Tue, Dec 14, 2010 at 5:58 PM, <[email protected]> wrote:
> > > >
> > > > On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> > > > up the cpus. ?By specifying lpj=<value>, we reduced that to 75 seconds.
> > > > Andi Kleen suggested we rework the calibrate_delay calls to run in
> > > > parallel. ?With that code in place, a test boot of the same machine took
> > > > 61 seconds to bring the cups up. ?I am not sure how we beat the lpj=
> > > > case, but it did outperform.
> > > >
> > > > One thing to note is the total BogoMIPS value is also consistently higher.
> > > > I am wondering if this is an effect with the cores being in performance
> > > > mode. ?I did notice that the parallel calibrate_delay calls did cause the
> > > > fans on the machine to ramp up to full speed where the normal sequential
> > > > calls did not cause them to budge at all.
> > >
> > > please check attached patch, that could calibrate correctly.
> > >
> > > Thanks
> > >
> > > Yinghai
> >
> > > [PATCH -v2] x86: Make calibrate_delay run in parallel.
> > >
> > > On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> > > up the cpus. By specifying lpj=<value>, we reduced that to 75 seconds.
> > > Andi Kleen suggested we rework the calibrate_delay calls to run in
> > > parallel.
> >
> > The risk wit that suggestion is that it will spectacularly miscalibrate on
> > hyperthreading systems.

I am not trying to be argumentative. I never got an understanding of
what was going wrong with that earlier patch and am hoping for some
understanding now.

Why does it spectacularly miscalibrate? Can anything be done to correct
that miscalibration? Doesn't this patch indicate another problem with
the calibration for hotplug cpus? Isn't there already a problem if
you boot a cpu normally, then hot-remove a hyperthread of a cpu, run a
userland task which fully loads up all the cores on that socket, then
hot-add that hyperthread back in? If the lpj value is that volatile,
what value does it really have?

Thank you for your insights,
Robin

2011-03-31 09:57:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.


* Robin Holt <[email protected]> wrote:

> On Thu, Mar 31, 2011 at 08:58:05AM +0200, Ingo Molnar wrote:
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > >
> > > * Yinghai Lu <[email protected]> wrote:
> > >
> > > > On Tue, Dec 14, 2010 at 5:58 PM, <[email protected]> wrote:
> > > > >
> > > > > On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> > > > > up the cpus. ?By specifying lpj=<value>, we reduced that to 75 seconds.
> > > > > Andi Kleen suggested we rework the calibrate_delay calls to run in
> > > > > parallel. ?With that code in place, a test boot of the same machine took
> > > > > 61 seconds to bring the cups up. ?I am not sure how we beat the lpj=
> > > > > case, but it did outperform.
> > > > >
> > > > > One thing to note is the total BogoMIPS value is also consistently higher.
> > > > > I am wondering if this is an effect with the cores being in performance
> > > > > mode. ?I did notice that the parallel calibrate_delay calls did cause the
> > > > > fans on the machine to ramp up to full speed where the normal sequential
> > > > > calls did not cause them to budge at all.
> > > >
> > > > please check attached patch, that could calibrate correctly.
> > > >
> > > > Thanks
> > > >
> > > > Yinghai
> > >
> > > > [PATCH -v2] x86: Make calibrate_delay run in parallel.
> > > >
> > > > On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> > > > up the cpus. By specifying lpj=<value>, we reduced that to 75 seconds.
> > > > Andi Kleen suggested we rework the calibrate_delay calls to run in
> > > > parallel.
> > >
> > > The risk wit that suggestion is that it will spectacularly miscalibrate on
> > > hyperthreading systems.
>
> I am not trying to be argumentative. I never got an understanding of
> what was going wrong with that earlier patch and am hoping for some
> understanding now.

Well, if calibrate_delay() calls run in parallel then different hyperthreads
will impact each other.

> Why does it spectacularly miscalibrate? Can anything be done to correct
> that miscalibration? Doesn't this patch indicate another problem with
> the calibration for hotplug cpus? Isn't there already a problem if
> you boot a cpu normally, then hot-remove a hyperthread of a cpu, run a
> userland task which fully loads up all the cores on that socket, then
> hot-add that hyperthread back in? If the lpj value is that volatile,
> what value does it really have?

The typical CPU hotplug usecase is suspend/resume, where we bring down the CPUs
in a more or less controlled manner.

Yes, you could achieve something similar by frobbing /sys/*/*/online but that's
a big difference to *always* running the calibration loops in parallel.

I'd argue for the opposite direction: only calibrate a physical CPU once per
CPU per bootup - this would also make CPU hotplug faster btw.

( Virtual CPUs (KVM, etc.) need a recalibration per bringup, because the new
CPU could be running on different hardware - but that's a detail: 4096 UV
CPUs are not in this category. )

Really, there's no good reason why every CPU should be calibrated on a system
running identical CPUs, right? Mixed-frequency systems are rather elusive on
x86.

Thanks,

Ingo

2011-03-31 10:31:18

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.

On 03/31/2011 11:57 AM, Ingo Molnar wrote:
> >
> > I am not trying to be argumentative. I never got an understanding of
> > what was going wrong with that earlier patch and am hoping for some
> > understanding now.
>
> Well, if calibrate_delay() calls run in parallel then different hyperthreads
> will impact each other.

It's different but not more wrong. If delay() later runs on a thread
whose sibling is busy, it will in fact give more accurate results.

> > Why does it spectacularly miscalibrate? Can anything be done to correct
> > that miscalibration? Doesn't this patch indicate another problem with
> > the calibration for hotplug cpus? Isn't there already a problem if
> > you boot a cpu normally, then hot-remove a hyperthread of a cpu, run a
> > userland task which fully loads up all the cores on that socket, then
> > hot-add that hyperthread back in? If the lpj value is that volatile,
> > what value does it really have?
>
> The typical CPU hotplug usecase is suspend/resume, where we bring down the CPUs
> in a more or less controlled manner.
>
> Yes, you could achieve something similar by frobbing /sys/*/*/online but that's
> a big difference to *always* running the calibration loops in parallel.
>
> I'd argue for the opposite direction: only calibrate a physical CPU once per
> CPU per bootup - this would also make CPU hotplug faster btw.
>
> ( Virtual CPUs (KVM, etc.) need a recalibration per bringup, because the new
> CPU could be running on different hardware - but that's a detail: 4096 UV
> CPUs are not in this category. )

Virtual cpus change their performance dynamically due to overcommit,
live migration, the host scheduler rearranging them, etc.

> Really, there's no good reason why every CPU should be calibrated on a system
> running identical CPUs, right? Mixed-frequency systems are rather elusive on
> x86.

Good point. And udelay() users are probably not sensitive to accuracy
anyway (which changes with load and thermal conditions).

--
error compiling committee.c: too many arguments to function

2011-03-31 10:46:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.


* Avi Kivity <[email protected]> wrote:

> On 03/31/2011 11:57 AM, Ingo Molnar wrote:
> >>
> >> I am not trying to be argumentative. I never got an understanding of
> >> what was going wrong with that earlier patch and am hoping for some
> >> understanding now.
> >
> > Well, if calibrate_delay() calls run in parallel then different
> > hyperthreads will impact each other.
>
> It's different but not more wrong. If delay() later runs on a thread whose
> sibling is busy, it will in fact give more accurate results.

No, it's actively wrong: because it makes the delay loop *run faster* when
other siblings

I.e. this shortens udelay(X)s potentially, which is far more dangerous than the
current conservative approach of potentially *lengthening* them.

> > Really, there's no good reason why every CPU should be calibrated on a
> > system running identical CPUs, right? Mixed-frequency systems are rather
> > elusive on x86.
>
> Good point. And udelay() users are probably not sensitive to accuracy anyway
> (which changes with load and thermal conditions).

True with one important distinction: they are only sensitive to one fact, that
the delay should not be *shorter* than specified. By shortening udelay() we
essentially overclock the hardware's tolerances - not good.

Thanks,

Ingo

2011-03-31 10:50:49

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.

On 03/31/2011 12:46 PM, Ingo Molnar wrote:
> * Avi Kivity<[email protected]> wrote:
>
> > On 03/31/2011 11:57 AM, Ingo Molnar wrote:
> > >>
> > >> I am not trying to be argumentative. I never got an understanding of
> > >> what was going wrong with that earlier patch and am hoping for some
> > >> understanding now.
> > >
> > > Well, if calibrate_delay() calls run in parallel then different
> > > hyperthreads will impact each other.
> >
> > It's different but not more wrong. If delay() later runs on a thread whose
> > sibling is busy, it will in fact give more accurate results.
>
> No, it's actively wrong: because it makes the delay loop *run faster* when
> other siblings
>
> I.e. this shortens udelay(X)s potentially, which is far more dangerous than the
> current conservative approach of potentially *lengthening* them.
>
> > > Really, there's no good reason why every CPU should be calibrated on a
> > > system running identical CPUs, right? Mixed-frequency systems are rather
> > > elusive on x86.
> >
> > Good point. And udelay() users are probably not sensitive to accuracy anyway
> > (which changes with load and thermal conditions).
>
> True with one important distinction: they are only sensitive to one fact, that
> the delay should not be *shorter* than specified. By shortening udelay() we
> essentially overclock the hardware's tolerances - not good.
>

Makes sense. But I think the thermally controlled cpu frequency
violates this in some way - if calibration is run while the cpu is how
and udelay is later run when it is cool then it could execute faster.

How important is udelay() for hardware timing these days, btw?

--
error compiling committee.c: too many arguments to function

2011-03-31 11:13:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.


* Avi Kivity <[email protected]> wrote:

> How important is udelay() for hardware timing these days, btw?

Seems popular enough:

earth4:~/tip> git grep -w udelay drivers/ | wc -l
4843

And while it's probably less important than it used to be, we cannot really
know for sure. This is one of the rare occasions where some modest amount of
fear, uncertainty and doubt is justified.

Thanks,

Ingo

2011-03-31 11:50:18

by Robin Holt

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.

On Thu, Mar 31, 2011 at 11:57:05AM +0200, Ingo Molnar wrote:
>
> * Robin Holt <[email protected]> wrote:
>
> > On Thu, Mar 31, 2011 at 08:58:05AM +0200, Ingo Molnar wrote:
> > >
> > > * Ingo Molnar <[email protected]> wrote:
> > >
> > > >
> > > > * Yinghai Lu <[email protected]> wrote:
> > > >
> > > > > On Tue, Dec 14, 2010 at 5:58 PM, <[email protected]> wrote:
> > > > > >
> > > > > > On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> > > > > > up the cpus. ?By specifying lpj=<value>, we reduced that to 75 seconds.
> > > > > > Andi Kleen suggested we rework the calibrate_delay calls to run in
> > > > > > parallel. ?With that code in place, a test boot of the same machine took
> > > > > > 61 seconds to bring the cups up. ?I am not sure how we beat the lpj=
> > > > > > case, but it did outperform.
> > > > > >
> > > > > > One thing to note is the total BogoMIPS value is also consistently higher.
> > > > > > I am wondering if this is an effect with the cores being in performance
> > > > > > mode. ?I did notice that the parallel calibrate_delay calls did cause the
> > > > > > fans on the machine to ramp up to full speed where the normal sequential
> > > > > > calls did not cause them to budge at all.
> > > > >
> > > > > please check attached patch, that could calibrate correctly.
> > > > >
> > > > > Thanks
> > > > >
> > > > > Yinghai
> > > >
> > > > > [PATCH -v2] x86: Make calibrate_delay run in parallel.
> > > > >
> > > > > On a 4096 cpu machine, we noticed that 318 seconds were taken for bringing
> > > > > up the cpus. By specifying lpj=<value>, we reduced that to 75 seconds.
> > > > > Andi Kleen suggested we rework the calibrate_delay calls to run in
> > > > > parallel.
> > > >
> > > > The risk wit that suggestion is that it will spectacularly miscalibrate on
> > > > hyperthreading systems.
> >
> > I am not trying to be argumentative. I never got an understanding of
> > what was going wrong with that earlier patch and am hoping for some
> > understanding now.
>
> Well, if calibrate_delay() calls run in parallel then different hyperthreads
> will impact each other.
>
> > Why does it spectacularly miscalibrate? Can anything be done to correct
> > that miscalibration? Doesn't this patch indicate another problem with
> > the calibration for hotplug cpus? Isn't there already a problem if
> > you boot a cpu normally, then hot-remove a hyperthread of a cpu, run a
> > userland task which fully loads up all the cores on that socket, then
> > hot-add that hyperthread back in? If the lpj value is that volatile,
> > what value does it really have?
>
> The typical CPU hotplug usecase is suspend/resume, where we bring down the CPUs
> in a more or less controlled manner.
>
> Yes, you could achieve something similar by frobbing /sys/*/*/online but that's
> a big difference to *always* running the calibration loops in parallel.
>
> I'd argue for the opposite direction: only calibrate a physical CPU once per
> CPU per bootup - this would also make CPU hotplug faster btw.
>
> ( Virtual CPUs (KVM, etc.) need a recalibration per bringup, because the new
> CPU could be running on different hardware - but that's a detail: 4096 UV
> CPUs are not in this category. )
>
> Really, there's no good reason why every CPU should be calibrated on a system
> running identical CPUs, right? Mixed-frequency systems are rather elusive on
> x86.

I had argued initially for calibrating a single core per socket
earlier. I do not remember who indicated that would not work, but I
do recall something about some AMD hardware possibly not having the
same frequency for all cores. Do know any details about any offering
where the individual cores on a socket can have different lpj values
(other than calculation noise)?

Robin

2011-03-31 14:25:42

by Yinghai Lu

[permalink] [raw]
Subject: Re: [RFC 2/2] Make x86 calibrate_delay run in parallel.

On 03/31/2011 02:29 AM, Robin Holt wrote:
...
> I don't see how this patch would affect that. Has this been tested on
> a multi-core intel cpu? I will try to test it today when I get to the
> office.

Yes. I tested on our 8 sockets 10 cores intel cpu system and 8 cores system.
looks getting correct result.

>
> Additionally, it takes the bogomips value from being part of an output
> line and makes it a separate line. On a 4096 cpu system, that will mean
> many additional lines of output. In the past, we have seen that will
> cause a considerable slowdown as time is spent printing. Fortunately,
> that is likely not going to slow things down as a secondary cpu will
> likely be doing that work while the boot cpu is allowed to continue with
> the boot. Is there really a value for a normal boot to have this output?
> Can we remove the individual lines of output and just print the system
> BogoMips value?

that is easy. just update print_lpj.

static void __cpuinit print_lpj(int cpu, char *str, unsigned long lpj)
{
static bool printed;

if (printed)
return;

pr_info("CPU%d: Calibrating delay%s"
"%lu.%02lu BogoMIPS (lpj=%lu)\n", cpu, str,
lpj/(500000/HZ), (lpj/(5000/HZ)) % 100, lpj);

/* only print cpu0, and cpu1 */
if (cpu)
printed = true;
}

current printing is for debug purpose, So we don't need to waiting the booting get done.
just checking it serial console.

Thanks

Yinghai