2015-05-04 09:45:07

by Jan H. Schönherr

[permalink] [raw]
Subject: [PATCH] x86: skip delays during SMP initialization similar to Xen

Remove the per-CPU delays during SMP initialization, which seems to be
possible on newer architectures with an x2APIC.

Xen does this since 2011. In fact, this commit is basically a
combination of the following Xen commits. The first removes the delays,
the second fixes an issue with the removal:

commit 68fce206f6dba9981e8322269db49692c95ce250
Author: Tim Deegan <[email protected]>
Date: Tue Jul 19 14:13:01 2011 +0100

x86: Remove timeouts from INIT-SIPI-SIPI sequence when using x2apic.

Some of the timeouts are pointless since they're waiting for the ICR
to ack the IPI delivery and that doesn't happen on x2apic.
The others should be benign (and are suggested in the SDM) but
removing them makes AP bringup much more reliable on some test boxes.

Signed-off-by: Tim Deegan <[email protected]>

commit f12ee533150761df5a7099c83f2a5fa6c07d1187
Author: Gang Wei <[email protected]>
Date: Thu Dec 29 10:07:54 2011 +0000

X86: Add a delay between INIT & SIPIs for tboot AP bring-up in X2APIC case

Without this delay, Xen could not bring APs up while working with
TXT/tboot, because tboot needs some time in APs to handle INIT before
becoming ready for receiving SIPIs (this delay was removed as part of
c/s 23724 by Tim Deegan).

Signed-off-by: Gang Wei <[email protected]>
Acked-by: Keir Fraser <[email protected]>
Acked-by: Tim Deegan <[email protected]>
Committed-by: Tim Deegan <[email protected]>

Signed-off-by: Jan H. Schönherr <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Tim Deegan <[email protected]>
Cc: Gang Wei <[email protected]>
---
arch/x86/kernel/smpboot.c | 61 +++++++++++++++++++++++++++++------------------
1 file changed, 38 insertions(+), 23 deletions(-)

This patch is against Linux 4.1-rc1.

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 50e547e..63b4641 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -555,7 +555,7 @@ wakeup_secondary_cpu_via_nmi(int apicid, unsigned long start_eip)
static int
wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
{
- unsigned long send_status, accept_status = 0;
+ unsigned long send_status = 0, accept_status = 0;
int maxlvt, num_starts, j;

maxlvt = lapic_get_maxlvt();
@@ -580,22 +580,34 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
phys_apicid);

- pr_debug("Waiting for send to finish...\n");
- send_status = safe_apic_wait_icr_idle();
+ if (!cpu_has_x2apic) {
+ pr_debug("Waiting for send to finish...\n");
+ send_status = safe_apic_wait_icr_idle();

- mdelay(10);
+ mdelay(10);

- pr_debug("Deasserting INIT\n");
+ pr_debug("Deasserting INIT\n");

- /* Target chip */
- /* Send IPI */
- apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
+ /* Target chip */
+ /* Send IPI */
+ apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);

- pr_debug("Waiting for send to finish...\n");
- send_status = safe_apic_wait_icr_idle();
+ pr_debug("Waiting for send to finish...\n");
+ send_status = safe_apic_wait_icr_idle();

- mb();
- atomic_set(&init_deasserted, 1);
+ mb();
+ atomic_set(&init_deasserted, 1);
+ } else if (tboot_enabled()) {
+ /*
+ * With tboot AP is actually spinning in a mini-guest before
+ * receiving INIT. Upon receiving INIT ipi, AP need time to
+ * VMExit, update VMCS to tracking SIPIs and VMResume.
+ *
+ * While AP is in root mode handling the INIT the CPU will drop
+ * any SIPIs
+ */
+ udelay(10);
+ }

/*
* Should we send STARTUP IPIs ?
@@ -637,20 +649,23 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
apic_icr_write(APIC_DM_STARTUP | (start_eip >> 12),
phys_apicid);

- /*
- * Give the other CPU some time to accept the IPI.
- */
- udelay(300);
+ if (!cpu_has_x2apic) {
+ /*
+ * Give the other CPU some time to accept the IPI.
+ */
+ udelay(300);

- pr_debug("Startup point 1\n");
+ pr_debug("Startup point 1\n");

- pr_debug("Waiting for send to finish...\n");
- send_status = safe_apic_wait_icr_idle();
+ pr_debug("Waiting for send to finish...\n");
+ send_status = safe_apic_wait_icr_idle();
+
+ /*
+ * Give the other CPU some time to accept the IPI.
+ */
+ udelay(200);
+ }

- /*
- * Give the other CPU some time to accept the IPI.
- */
- udelay(200);
if (maxlvt > 3) /* Due to the Pentium erratum 3AP. */
apic_write(APIC_ESR, 0);
accept_status = (apic_read(APIC_ESR) & 0xEF);
--
2.3.1.dirty


2015-05-06 08:28:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen


* Jan H. Sch?nherr <[email protected]> wrote:

> Remove the per-CPU delays during SMP initialization, which seems to be
> possible on newer architectures with an x2APIC.
>
> Xen does this since 2011. In fact, this commit is basically a
> combination of the following Xen commits. The first removes the delays,
> the second fixes an issue with the removal:
>
> commit 68fce206f6dba9981e8322269db49692c95ce250
> Author: Tim Deegan <[email protected]>
> Date: Tue Jul 19 14:13:01 2011 +0100
>
> x86: Remove timeouts from INIT-SIPI-SIPI sequence when using x2apic.
>
> Some of the timeouts are pointless since they're waiting for the ICR
> to ack the IPI delivery and that doesn't happen on x2apic.
> The others should be benign (and are suggested in the SDM) but
> removing them makes AP bringup much more reliable on some test boxes.
>
> Signed-off-by: Tim Deegan <[email protected]>
>
> commit f12ee533150761df5a7099c83f2a5fa6c07d1187
> Author: Gang Wei <[email protected]>
> Date: Thu Dec 29 10:07:54 2011 +0000
>
> X86: Add a delay between INIT & SIPIs for tboot AP bring-up in X2APIC case
>
> Without this delay, Xen could not bring APs up while working with
> TXT/tboot, because tboot needs some time in APs to handle INIT before
> becoming ready for receiving SIPIs (this delay was removed as part of
> c/s 23724 by Tim Deegan).
>
> Signed-off-by: Gang Wei <[email protected]>
> Acked-by: Keir Fraser <[email protected]>
> Acked-by: Tim Deegan <[email protected]>
> Committed-by: Tim Deegan <[email protected]>
>
> Signed-off-by: Jan H. Sch?nherr <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Cc: Tim Deegan <[email protected]>
> Cc: Gang Wei <[email protected]>
> ---
> arch/x86/kernel/smpboot.c | 61 +++++++++++++++++++++++++++++------------------
> 1 file changed, 38 insertions(+), 23 deletions(-)
>
> This patch is against Linux 4.1-rc1.

So I really like this, as it nicely side-steps the 'when should we do
the legacy delays' issue by flagging on x2apic support.

If anyone has objections, please holler.

Thanks,

Ingo

Subject: [tip:x86/apic] x86/smpboot: Skip delays during SMP initialization similar to Xen

Commit-ID: f5d6a52f511157c7476590532a23b5664b1ed877
Gitweb: http://git.kernel.org/tip/f5d6a52f511157c7476590532a23b5664b1ed877
Author: Jan H. Schönherr <[email protected]>
AuthorDate: Mon, 4 May 2015 11:42:34 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 6 May 2015 10:24:51 +0200

x86/smpboot: Skip delays during SMP initialization similar to Xen

Remove the per-CPU delays during SMP initialization, which seems
to be possible on newer architectures with an x2APIC.

Xen does this since 2011. In fact, this commit is basically a
combination of the following Xen commits. The first removes the
delays, the second fixes an issue with the removal:

commit 68fce206f6dba9981e8322269db49692c95ce250
Author: Tim Deegan <[email protected]>
Date: Tue Jul 19 14:13:01 2011 +0100

x86: Remove timeouts from INIT-SIPI-SIPI sequence when using x2apic.

Some of the timeouts are pointless since they're waiting for the ICR
to ack the IPI delivery and that doesn't happen on x2apic.
The others should be benign (and are suggested in the SDM) but
removing them makes AP bringup much more reliable on some test boxes.

Signed-off-by: Tim Deegan <[email protected]>

commit f12ee533150761df5a7099c83f2a5fa6c07d1187
Author: Gang Wei <[email protected]>
Date: Thu Dec 29 10:07:54 2011 +0000

X86: Add a delay between INIT & SIPIs for tboot AP bring-up in X2APIC case

Without this delay, Xen could not bring APs up while working with
TXT/tboot, because tboot needs some time in APs to handle INIT before
becoming ready for receiving SIPIs (this delay was removed as part of
c/s 23724 by Tim Deegan).

Signed-off-by: Gang Wei <[email protected]>
Acked-by: Keir Fraser <[email protected]>
Acked-by: Tim Deegan <[email protected]>
Committed-by: Tim Deegan <[email protected]>

Signed-off-by: Jan H. Schönherr <[email protected]>
Cc: Anthony Liguori <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Gang Wei <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Deegan <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/smpboot.c | 61 +++++++++++++++++++++++++++++------------------
1 file changed, 38 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 50e547e..63b4641 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -555,7 +555,7 @@ wakeup_secondary_cpu_via_nmi(int apicid, unsigned long start_eip)
static int
wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
{
- unsigned long send_status, accept_status = 0;
+ unsigned long send_status = 0, accept_status = 0;
int maxlvt, num_starts, j;

maxlvt = lapic_get_maxlvt();
@@ -580,22 +580,34 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
phys_apicid);

- pr_debug("Waiting for send to finish...\n");
- send_status = safe_apic_wait_icr_idle();
+ if (!cpu_has_x2apic) {
+ pr_debug("Waiting for send to finish...\n");
+ send_status = safe_apic_wait_icr_idle();

- mdelay(10);
+ mdelay(10);

- pr_debug("Deasserting INIT\n");
+ pr_debug("Deasserting INIT\n");

- /* Target chip */
- /* Send IPI */
- apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
+ /* Target chip */
+ /* Send IPI */
+ apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);

- pr_debug("Waiting for send to finish...\n");
- send_status = safe_apic_wait_icr_idle();
+ pr_debug("Waiting for send to finish...\n");
+ send_status = safe_apic_wait_icr_idle();

- mb();
- atomic_set(&init_deasserted, 1);
+ mb();
+ atomic_set(&init_deasserted, 1);
+ } else if (tboot_enabled()) {
+ /*
+ * With tboot AP is actually spinning in a mini-guest before
+ * receiving INIT. Upon receiving INIT ipi, AP need time to
+ * VMExit, update VMCS to tracking SIPIs and VMResume.
+ *
+ * While AP is in root mode handling the INIT the CPU will drop
+ * any SIPIs
+ */
+ udelay(10);
+ }

/*
* Should we send STARTUP IPIs ?
@@ -637,20 +649,23 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
apic_icr_write(APIC_DM_STARTUP | (start_eip >> 12),
phys_apicid);

- /*
- * Give the other CPU some time to accept the IPI.
- */
- udelay(300);
+ if (!cpu_has_x2apic) {
+ /*
+ * Give the other CPU some time to accept the IPI.
+ */
+ udelay(300);

- pr_debug("Startup point 1\n");
+ pr_debug("Startup point 1\n");

- pr_debug("Waiting for send to finish...\n");
- send_status = safe_apic_wait_icr_idle();
+ pr_debug("Waiting for send to finish...\n");
+ send_status = safe_apic_wait_icr_idle();
+
+ /*
+ * Give the other CPU some time to accept the IPI.
+ */
+ udelay(200);
+ }

- /*
- * Give the other CPU some time to accept the IPI.
- */
- udelay(200);
if (maxlvt > 3) /* Due to the Pentium erratum 3AP. */
apic_write(APIC_ESR, 0);
accept_status = (apic_read(APIC_ESR) & 0xEF);

2015-05-07 01:27:41

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen

> So I really like this, as it nicely side-steps the 'when should we do
> the legacy delays' issue by flagging on x2apic support.
>
> If anyone has objections, please holler.

We should have no delays even for many processors that lack x2apic.

Len Brown, Intel Open Source Technology Center

2015-05-07 10:24:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen


* Len Brown <[email protected]> wrote:

> > So I really like this, as it nicely side-steps the 'when should we
> > do the legacy delays' issue by flagging on x2apic support.
> >
> > If anyone has objections, please holler.
>
> We should have no delays even for many processors that lack x2apic.

Yes, agreed, but this looks like a good (and safe) first step:
especially systems that have large CPU counts tend to have x2apic
support - and that is where such a change matters most.

I'm not against doing a zero delay approach either, if it can be
triggered robustly.

Here's what the boot time looks like on a 120 CPUs system, with the
patch applied:

[ 0.558947] x86: Booting SMP configuration:
[ 0.563375] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14
[ 0.644851] .... node #1, CPUs: #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29
[ 0.830474] .... node #2, CPUs: #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44
[ 1.016357] .... node #3, CPUs: #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59
[ 1.202342] .... node #0, CPUs: #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74
[ 1.283864] .... node #1, CPUs: #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89
[ 1.397131] .... node #2, CPUs: #90 #91 #92 #93 #94 #95 #96 #97 #98 #99 #100 #101 #102 #103 #104
[ 1.510417] .... node #3, CPUs: #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119
[ 1.620967] x86: Booted up 4 nodes, 120 CPUs
[ 1.625928] smpboot: Total of 120 processors activated (672866.16 BogoMIPS)

1.1 seconds to boot 120 CPUs, 10.8 seconds to hit init, that's an
entirely reasonable runtime I think.

It was 20+ seconds before that, 10+ seconds for the SMP bootup
sequence.

Thanks,

Ingo

2015-05-12 08:03:39

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen

On Thu, May 7, 2015 at 6:23 AM, Ingo Molnar <[email protected]> wrote:

These numbers do not look right.

> Here's what the boot time looks like on a 120 CPUs system, with the
> patch applied:
>
> [ 0.558947] x86: Booting SMP configuration:
> [ 0.563375] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14
> [ 0.644851] .... node #1, CPUs: #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29
> [ 0.830474] .... node #2, CPUs: #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44
> [ 1.016357] .... node #3, CPUs: #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59
> [ 1.202342] .... node #0, CPUs: #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74
> [ 1.283864] .... node #1, CPUs: #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89
> [ 1.397131] .... node #2, CPUs: #90 #91 #92 #93 #94 #95 #96 #97 #98 #99 #100 #101 #102 #103 #104
> [ 1.510417] .... node #3, CPUs: #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119
> [ 1.620967] x86: Booted up 4 nodes, 120 CPUs
> [ 1.625928] smpboot: Total of 120 processors activated (672866.16 BogoMIPS)
>
> 1.1 seconds to boot 120 CPUs, 10.8 seconds to hit init, that's an
> entirely reasonable runtime I think.
>
> It was 20+ seconds before that, 10+ seconds for the SMP bootup
> sequence.

(1.625928-0.558947) = 1.07 seconds to online 119 additional cpus.
/119 = .0089662268 each, lets call it 9ms.

Here is my ivb-ex running stock fedora 21's Linux-3.19 (no patch applied):

[ 0.404369] x86: Booting SMP configuration:
[ 0.409492] .... node #0, CPUs: #1
[ 0.439900] NMI watchdog: enabled on all CPUs, permanently consumes
one hw-PMU counter.
[ 0.450533] #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14
[ 0.652409] .... node #1, CPUs: #15 #16 #17 #18 #19 #20
#21 #22 #23 #24 #25 #26 #27 #28 #29
[ 0.999645] .... node #2, CPUs: #30 #31 #32 #33 #34 #35
#36 #37 #38 #39 #40 #41 #42 #43 #44
[ 1.346991] .... node #3, CPUs: #45 #46 #47 #48 #49 #50
#51 #52 #53 #54 #55 #56 #57 #58 #59
[ 1.694171] .... node #0, CPUs: #60 #61 #62 #63 #64 #65
#66 #67 #68 #69 #70 #71 #72 #73 #74
[ 1.928248] .... node #1, CPUs: #75 #76 #77 #78 #79 #80
#81 #82 #83 #84 #85 #86 #87 #88 #89
[ 2.198370] .... node #2, CPUs: #90 #91 #92 #93 #94 #95
#96 #97 #98 #99 #100 #101 #102 #103 #104
[ 2.468574] .... node #3, CPUs: #105 #106 #107 #108 #109 #110
#111 #112 #113 #114 #115 #116 #117 #118 #119
[ 2.737884] x86: Booted up 4 nodes, 120 CPUs
[ 2.743758] smpboot: Total of 120 processors activated (671097.18 BogoMIPS)

(2.743758-0.404369) = 2.339389 for all 119 processors
/119 = .01965873109243697478 - lets call it 19ms each

so this baseline case is 19ms/processor -- which matches above, where
we delete 10ms/processor.
But even at 19ms each, this is totals only 2.3 seconds for all 119 processors.
So I don't understand your reference to 10+ seconds, above.

thanks,
Len Brown, Intel Open Source Technology Center

ps. I wish the BIOS boot time on this box were as fast as Linux:-)

2015-05-12 09:43:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen


* Len Brown <[email protected]> wrote:

> On Thu, May 7, 2015 at 6:23 AM, Ingo Molnar <[email protected]> wrote:
>
> These numbers do not look right.
>
> > Here's what the boot time looks like on a 120 CPUs system, with the
> > patch applied:
> >
> > [ 0.558947] x86: Booting SMP configuration:
> > [ 0.563375] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14
> > [ 0.644851] .... node #1, CPUs: #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29
> > [ 0.830474] .... node #2, CPUs: #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44
> > [ 1.016357] .... node #3, CPUs: #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59
> > [ 1.202342] .... node #0, CPUs: #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74
> > [ 1.283864] .... node #1, CPUs: #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89
> > [ 1.397131] .... node #2, CPUs: #90 #91 #92 #93 #94 #95 #96 #97 #98 #99 #100 #101 #102 #103 #104
> > [ 1.510417] .... node #3, CPUs: #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119
> > [ 1.620967] x86: Booted up 4 nodes, 120 CPUs
> > [ 1.625928] smpboot: Total of 120 processors activated (672866.16 BogoMIPS)
> >
> > 1.1 seconds to boot 120 CPUs, 10.8 seconds to hit init, that's an
> > entirely reasonable runtime I think.
> >
> > It was 20+ seconds before that, 10+ seconds for the SMP bootup
> > sequence.
>
> (1.625928-0.558947) = 1.07 seconds to online 119 additional cpus.
> /119 = .0089662268 each, lets call it 9ms.
>
> Here is my ivb-ex running stock fedora 21's Linux-3.19 (no patch applied):
>
> [ 0.404369] x86: Booting SMP configuration:
> [ 0.409492] .... node #0, CPUs: #1
> [ 0.439900] NMI watchdog: enabled on all CPUs, permanently consumes
> one hw-PMU counter.
> [ 0.450533] #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14
> [ 0.652409] .... node #1, CPUs: #15 #16 #17 #18 #19 #20
> #21 #22 #23 #24 #25 #26 #27 #28 #29
> [ 0.999645] .... node #2, CPUs: #30 #31 #32 #33 #34 #35
> #36 #37 #38 #39 #40 #41 #42 #43 #44
> [ 1.346991] .... node #3, CPUs: #45 #46 #47 #48 #49 #50
> #51 #52 #53 #54 #55 #56 #57 #58 #59
> [ 1.694171] .... node #0, CPUs: #60 #61 #62 #63 #64 #65
> #66 #67 #68 #69 #70 #71 #72 #73 #74
> [ 1.928248] .... node #1, CPUs: #75 #76 #77 #78 #79 #80
> #81 #82 #83 #84 #85 #86 #87 #88 #89
> [ 2.198370] .... node #2, CPUs: #90 #91 #92 #93 #94 #95
> #96 #97 #98 #99 #100 #101 #102 #103 #104
> [ 2.468574] .... node #3, CPUs: #105 #106 #107 #108 #109 #110
> #111 #112 #113 #114 #115 #116 #117 #118 #119
> [ 2.737884] x86: Booted up 4 nodes, 120 CPUs
> [ 2.743758] smpboot: Total of 120 processors activated (671097.18 BogoMIPS)
>
> (2.743758-0.404369) = 2.339389 for all 119 processors
> /119 = .01965873109243697478 - lets call it 19ms each
>
> so this baseline case is 19ms/processor -- which matches above, where
> we delete 10ms/processor.

> But even at 19ms each, this is totals only 2.3 seconds for all 119
> processors. So I don't understand your reference to 10+ seconds,
> above.

Hm, so what I was referring to is:

[ 0.205318] x86: Booting SMP configuration:
[ 0.206005] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95 #96 #97 #98 #99 #100 #101 #102 #103 #104 #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119
[ 9.021191] x86: Booted up 1 node, 120 CPUs
[ 9.021803] smpboot: Total of 120 processors activated (672102.75 BogoMIPS)
[ 9.027282] devtmpfs: initialized

So it's 8.9 seconds - but still much more than you'd expect. The delay
is evenly distributed amongst CPUs.

> ps. I wish the BIOS boot time on this box were as fast as Linux:-)

LOL! :-)

Thanks,

Ingo

2015-05-12 09:47:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen


* Len Brown <[email protected]> wrote:

> > It was 20+ seconds before that, 10+ seconds for the SMP bootup
> > sequence.
>
> (1.625928-0.558947) = 1.07 seconds to online 119 additional cpus.
> /119 = .0089662268 each, lets call it 9ms.
>
> Here is my ivb-ex running stock fedora 21's Linux-3.19 (no patch applied):
>
> [ 0.404369] x86: Booting SMP configuration:
> [ 0.409492] .... node #0, CPUs: #1
> [ 0.439900] NMI watchdog: enabled on all CPUs, permanently consumes
> one hw-PMU counter.
> [ 0.450533] #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14
> [ 0.652409] .... node #1, CPUs: #15 #16 #17 #18 #19 #20
> #21 #22 #23 #24 #25 #26 #27 #28 #29
> [ 0.999645] .... node #2, CPUs: #30 #31 #32 #33 #34 #35
> #36 #37 #38 #39 #40 #41 #42 #43 #44
> [ 1.346991] .... node #3, CPUs: #45 #46 #47 #48 #49 #50
> #51 #52 #53 #54 #55 #56 #57 #58 #59
> [ 1.694171] .... node #0, CPUs: #60 #61 #62 #63 #64 #65
> #66 #67 #68 #69 #70 #71 #72 #73 #74
> [ 1.928248] .... node #1, CPUs: #75 #76 #77 #78 #79 #80
> #81 #82 #83 #84 #85 #86 #87 #88 #89
> [ 2.198370] .... node #2, CPUs: #90 #91 #92 #93 #94 #95
> #96 #97 #98 #99 #100 #101 #102 #103 #104
> [ 2.468574] .... node #3, CPUs: #105 #106 #107 #108 #109 #110
> #111 #112 #113 #114 #115 #116 #117 #118 #119
> [ 2.737884] x86: Booted up 4 nodes, 120 CPUs
> [ 2.743758] smpboot: Total of 120 processors activated (671097.18 BogoMIPS)
>
> (2.743758-0.404369) = 2.339389 for all 119 processors
> /119 = .01965873109243697478 - lets call it 19ms each
>
> so this baseline case is 19ms/processor -- which matches above, where
> we delete 10ms/processor.
> But even at 19ms each, this is totals only 2.3 seconds for all 119 processors.
> So I don't understand your reference to 10+ seconds, above.

So I was booting 120 CPUs with kvmtool (tools/kvm/ under -tip).

Even with your patches applied it's 7 seconds (config attached):

[ 0.152189] x86: Booting SMP configuration:
[ 0.152705] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95 #96 #97 #98 #99 #100 #101 #102 #103 #104 #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119
[ 7.627192] x86: Booted up 1 node, 120 CPUs
[ 7.627795] smpboot: Total of 120 processors activated (672119.13 BogoMIPS)
[ 7.633325] devtmpfs: initialized

so there's some other delay going on. It could very well be a kvmtool
related initialization delay?

Thanks,

Ingo


Attachments:
(No filename) (2.93 kB)
config (96.41 kB)
Download all attachments

2015-05-12 10:12:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen

On Tue, May 12, 2015 at 11:47:09AM +0200, Ingo Molnar wrote:
> So I was booting 120 CPUs with kvmtool (tools/kvm/ under -tip).
>
> Even with your patches applied it's 7 seconds (config attached):
>
> [ 0.152189] x86: Booting SMP configuration:
> [ 0.152705] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64 #65 #66 #67 #68 #69 #70 #71 #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95 #96 #97 #98 #99 #100 #101 #102 #103 #104 #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119
> [ 7.627192] x86: Booted up 1 node, 120 CPUs
> [ 7.627795] smpboot: Total of 120 processors activated (672119.13 BogoMIPS)
> [ 7.633325] devtmpfs: initialized
>
> so there's some other delay going on. It could very well be a kvmtool
> related initialization delay?

10 seconds with qemu+kvm and latest tip/master:

[ 0.176297] x86: Booting SMP configuration:
[ 0.179640] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35 #36 #37 #38 #39 #40 #41 #42 #43 #44 #45 #46 #47 #48 #49 #50 #51 #52 #53 #54 #55 #56 #57 #58 #59 #60 #61 #62 #63 #64
[ 0.036000] ------------[ cut here ]------------
[ 0.036000] WARNING: CPU: 64 PID: 0 at kernel/locking/lockdep.c:2755 lockdep_trace_alloc+0x10c/0x120()
[ 0.036000] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[ 0.036000] Modules linked in:
[ 0.036000] CPU: 64 PID: 0 Comm: swapper/64 Not tainted 4.1.0-rc3+ #6
[ 0.036000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 0.036000] ffffffff818c0f15 ffff88006d357d08 ffffffff8168c89b 0000000000000000
[ 0.036000] ffff88006d357d58 ffff88006d357d48 ffffffff81057d55 ffff88006d357da8
[ 0.036000] 0000000000000080 0000000000000046 ffffffff81a4fb80 ffff88006d0af020
[ 0.036000] Call Trace:
[ 0.036000] [<ffffffff8168c89b>] dump_stack+0x4f/0x7b
[ 0.036000] [<ffffffff81057d55>] warn_slowpath_common+0x95/0xe0
[ 0.036000] [<ffffffff81057de6>] warn_slowpath_fmt+0x46/0x50
[ 0.036000] [<ffffffff810a6d8c>] lockdep_trace_alloc+0x10c/0x120
[ 0.036000] [<ffffffff8113c82d>] __alloc_pages_nodemask+0xad/0xa20
[ 0.036000] [<ffffffff81691550>] ? mutex_lock_nested+0x2d0/0x3f0
[ 0.036000] [<ffffffff8132e2b3>] ? __this_cpu_preempt_check+0x13/0x20
[ 0.036000] [<ffffffff810a21ab>] ? trace_hardirqs_off_caller+0xcb/0x170
[ 0.036000] [<ffffffff81691524>] ? mutex_lock_nested+0x2a4/0x3f0
[ 0.036000] [<ffffffff8100a426>] ? init_espfix_ap.part.5+0xb6/0x270
[ 0.036000] [<ffffffff81096aea>] ? update_max_interval+0x1a/0x40
[ 0.036000] [<ffffffff8113d1bd>] __get_free_pages+0x1d/0x60
[ 0.036000] [<ffffffff8100a4d4>] init_espfix_ap.part.5+0x164/0x270
[ 0.036000] [<ffffffff8100a601>] init_espfix_ap+0x21/0x30
[ 0.036000] [<ffffffff8103c708>] start_secondary+0xe8/0x180
[ 0.036000] ---[ end trace 2312a8d05943a0d2 ]---
[ 5.744137] #65 #66 #67 #68 #69 #70 #71 #72 #73 #74 #75 #76 #77 #78 #79 #80 #81 #82 #83 #84 #85 #86 #87 #88 #89 #90 #91 #92 #93 #94 #95 #96 #97 #98 #99 #100 #101 #102 #103 #104 #105 #106 #107 #108 #109 #110 #111 #112 #113 #114 #115 #116 #117 #118 #119
[ 10.416459] x86: Booted up 1 node, 120 CPUs
[ 10.420012] smpboot: Total of 120 processors activated (963228.36 BogoMIPS)

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--

Subject: [tip:x86/apic] Revert f5d6a52f5111 ("x86/smpboot: Skip delays during SMP initialization similar to Xen")

Commit-ID: 853b160aaafbe27d6304c8832bb7340d57c6b04e
Gitweb: http://git.kernel.org/tip/853b160aaafbe27d6304c8832bb7340d57c6b04e
Author: Ingo Molnar <[email protected]>
AuthorDate: Wed, 13 May 2015 08:40:49 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 13 May 2015 08:40:49 +0200

Revert f5d6a52f5111 ("x86/smpboot: Skip delays during SMP initialization similar to Xen")

Huang Ying reported x86 boot hangs due to this commit.

Turns out that the change, despite its changelog, does more
than just change timeouts: it also changes the way we
assert/deassert INIT via the APIC_DM_INIT IPI, in the x2apic
case it skips the deassert step.

This is historically fragile code and the patch did not
improve it, so revert these changes.

This commit:

1a744cb356c5 ("x86/smp/boot: Remove 10ms delay from cpu_up() on modern processors")

independently removes the worst of the delays (the 10 msec delay).

The remaining delays can be addressed one by one, combined
with careful testing.

Reported-by: Huang Ying <[email protected]>
Cc: Anthony Liguori <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Gang Wei <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Jan H. Schönherr <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tim Deegan <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/smpboot.c | 58 ++++++++++++++++++-----------------------------
1 file changed, 22 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 85bd6aa..b9aaa39 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -614,34 +614,22 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
apic_icr_write(APIC_INT_LEVELTRIG | APIC_INT_ASSERT | APIC_DM_INIT,
phys_apicid);

- if (!cpu_has_x2apic) {
- pr_debug("Waiting for send to finish...\n");
- send_status = safe_apic_wait_icr_idle();
+ pr_debug("Waiting for send to finish...\n");
+ send_status = safe_apic_wait_icr_idle();

- mdelay(init_udelay);
+ mdelay(init_udelay);

- pr_debug("Deasserting INIT\n");
+ pr_debug("Deasserting INIT\n");

- /* Target chip */
- /* Send IPI */
- apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);
+ /* Target chip */
+ /* Send IPI */
+ apic_icr_write(APIC_INT_LEVELTRIG | APIC_DM_INIT, phys_apicid);

- pr_debug("Waiting for send to finish...\n");
- send_status = safe_apic_wait_icr_idle();
+ pr_debug("Waiting for send to finish...\n");
+ send_status = safe_apic_wait_icr_idle();

- mb();
- atomic_set(&init_deasserted, 1);
- } else if (tboot_enabled()) {
- /*
- * With tboot AP is actually spinning in a mini-guest before
- * receiving INIT. Upon receiving INIT ipi, AP need time to
- * VMExit, update VMCS to tracking SIPIs and VMResume.
- *
- * While AP is in root mode handling the INIT the CPU will drop
- * any SIPIs
- */
- udelay(10);
- }
+ mb();
+ atomic_set(&init_deasserted, 1);

/*
* Should we send STARTUP IPIs ?
@@ -683,22 +671,20 @@ wakeup_secondary_cpu_via_init(int phys_apicid, unsigned long start_eip)
apic_icr_write(APIC_DM_STARTUP | (start_eip >> 12),
phys_apicid);

- if (!cpu_has_x2apic) {
- /*
- * Give the other CPU some time to accept the IPI.
- */
- udelay(300);
+ /*
+ * Give the other CPU some time to accept the IPI.
+ */
+ udelay(300);

- pr_debug("Startup point 1\n");
+ pr_debug("Startup point 1\n");

- pr_debug("Waiting for send to finish...\n");
- send_status = safe_apic_wait_icr_idle();
+ pr_debug("Waiting for send to finish...\n");
+ send_status = safe_apic_wait_icr_idle();

- /*
- * Give the other CPU some time to accept the IPI.
- */
- udelay(200);
- }
+ /*
+ * Give the other CPU some time to accept the IPI.
+ */
+ udelay(200);

if (maxlvt > 3) /* Due to the Pentium erratum 3AP. */
apic_write(APIC_ESR, 0);

2015-05-14 06:37:01

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen

> [ 0.404369] x86: Booting SMP configuration:
...
> [ 2.737884] x86: Booted up 4 nodes, 120 CPUs
> [ 2.743758] smpboot: Total of 120 processors activated (671097.18 BogoMIPS)
>
> (2.743758-0.404369) = 2.339389 for all 119 processors
> /119 = .01965873109243697478 - lets call it 19ms each

For the record, the same (bare metal) box running latest tip boots
10ms/processor quicker
than upstream Linux, as expected. So this 120 processor box now
boots 1.19 seconds faster, in total.

[ 0.415969] x86: Booting SMP configuration:
...
[ 1.553658] x86: Booted up 4 nodes, 120 CPUs
[ 1.559173] smpboot: Total of 120 processors activated (671182.14 BogoMIPS)

1.553658-0.415969 = 1.137689 - seconds to bring 119 processors on-line.
./119 = .00956041176470588235 -- 9.5ms per processor, down from 19.

BTW. this time can be reduced by 7% (113 ms) by deleting announce_cpu():

[ 1.445815] x86: Booted up 4 nodes, 120 CPUs

cheers,
Len Brown, Intel Open Source Technology Center

2015-05-14 06:44:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen


* Len Brown <[email protected]> wrote:

> > [ 0.404369] x86: Booting SMP configuration:
> ...
> > [ 2.737884] x86: Booted up 4 nodes, 120 CPUs
> > [ 2.743758] smpboot: Total of 120 processors activated (671097.18 BogoMIPS)
> >
> > (2.743758-0.404369) = 2.339389 for all 119 processors
> > /119 = .01965873109243697478 - lets call it 19ms each
>
> For the record, the same (bare metal) box running latest tip boots
> 10ms/processor quicker
> than upstream Linux, as expected. So this 120 processor box now
> boots 1.19 seconds faster, in total.
>
> [ 0.415969] x86: Booting SMP configuration:
> ...
> [ 1.553658] x86: Booted up 4 nodes, 120 CPUs
> [ 1.559173] smpboot: Total of 120 processors activated (671182.14 BogoMIPS)
>
> 1.553658-0.415969 = 1.137689 - seconds to bring 119 processors on-line.
> ./119 = .00956041176470588235 -- 9.5ms per processor, down from 19.

Ok. I think we should be able to further speed that up.

> BTW. this time can be reduced by 7% (113 ms) by deleting
> announce_cpu():
>
> [ 1.445815] x86: Booted up 4 nodes, 120 CPUs

so that kind of info looks pretty useful, especially when there's
hangs/failures. I'm wondering what takes 113 msecs to print 120 CPUs -
that's about 1 msec per a few chars of printk produced, seems
excessive. Do you have any idea what's going on there? Does your
system print to a serial console perhaps?

Thanks,

Ingo

2015-05-14 07:18:45

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen

On Thu, May 14, 2015 at 2:36 AM, Len Brown <[email protected]> wrote:

>> [ 2.737884] x86: Booted up 4 nodes, 120 CPUs

> For the record, the same (bare metal) box running latest tip boots
> 10ms/processor quicker

> [ 1.553658] x86: Booted up 4 nodes, 120 CPUs

> BTW. this time can be reduced by 7% (113 ms) by deleting announce_cpu():
>
> [ 1.445815] x86: Booted up 4 nodes, 120 CPUs

I see that the x2apic optimization has been reverted from TIP.
So just for grins, I booted the same box with all the udelays in
smpboot.c removed,
and it speed up boot by only 12ms (0.8%) total:

[ 1.432946] x86: Booted up 4 nodes, 120 CPUs

cheers,
-Len Brown, Intel Open Source Technology Center

2015-05-14 14:26:27

by Jan H. Schönherr

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen

On 05/14/2015 09:18 AM, Len Brown wrote:
> On Thu, May 14, 2015 at 2:36 AM, Len Brown <[email protected]> wrote:
>
>>> [ 2.737884] x86: Booted up 4 nodes, 120 CPUs
>
>> For the record, the same (bare metal) box running latest tip boots
>> 10ms/processor quicker
>
>> [ 1.553658] x86: Booted up 4 nodes, 120 CPUs
>
>> BTW. this time can be reduced by 7% (113 ms) by deleting announce_cpu():
>>
>> [ 1.445815] x86: Booted up 4 nodes, 120 CPUs
>
> I see that the x2apic optimization has been reverted from TIP.

Gone as fast as it came. I learned that having an x2apic is different
from using code written for it. (And I'll try not to repeat something
like that.)

> So just for grins, I booted the same box with all the udelays in
> smpboot.c removed,
> and it speed up boot by only 12ms (0.8%) total:
>
> [ 1.432946] x86: Booted up 4 nodes, 120 CPUs

Is that on top of deleting announce_cpu() or instead? Because I would
have expected more than 12ms improvement just by summing up the
udelay()s.

Ingo, do you want an updated version of the original patch, which
takes care not get stuck, when the INIT deassertion is skipped,
or do you prefer to address delays "one by one" as you wrote elsewhere?

Regards
Jan

2015-05-14 17:57:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen


* "Jan H. Sch?nherr" <[email protected]> wrote:

> Ingo, do you want an updated version of the original patch, which
> takes care not get stuck, when the INIT deassertion is skipped, or
> do you prefer to address delays "one by one" as you wrote elsewhere?

So I'm not against improving this code at all, but instead of this
hard to follow mixing of old and new code, I'd find the following
approach cleaner and more acceptable: create a 'modern' and a 'legacy'
SMP-bootup variant function, and do a clean separation based on the
CPU model cutoff condition used by Len's patches:

/* if modern processor, use no delay */
if (((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && (boot_cpu_data.x86 == 6)) ||
((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && (boot_cpu_data.x86 >= 0xF)))
init_udelay = 0;

Then in the modern variant we can become even more aggressive and
remove these kinds of delays as well:

udelay(300);
udelay(200);

plus I'd suggest making these poll loops in smpboot.c loops narrower:

udelay(100);

udelay(100);

because every iteration is 0.1 msecs - if we hit these poll loops then
it adds up with a few dozen CPUs.

Thanks,

Ingo

2015-05-16 08:46:24

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen

On Thu, May 14, 2015 at 2:44 AM, Ingo Molnar <[email protected]> wrote:

>
>> BTW. this time can be reduced by 7% (113 ms) by deleting
>> announce_cpu():
>>
>> [ 1.445815] x86: Booted up 4 nodes, 120 CPUs
>
> so that kind of info looks pretty useful, especially when there's
> hangs/failures.

I think the messages we print on failure are useful.
I think the success case should be a 1-line summary.

> I'm wondering what takes 113 msecs to print 120 CPUs -
> that's about 1 msec per a few chars of printk produced, seems
> excessive. Do you have any idea what's going on there? Does your
> system print to a serial console perhaps?

Yes, serial console -- that server is actually much
closer to you than it is to me, it is in Finland:-)

I should benchmark it, because 115200 should be faster...

cheers,
Len Brown, Intel Open Source Technology Center

2015-05-16 09:08:07

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen

On Thu, May 14, 2015 at 1:57 PM, Ingo Molnar <[email protected]> wrote:
>
> * "Jan H. Schönherr" <[email protected]> wrote:
>
>> Ingo, do you want an updated version of the original patch, which
>> takes care not get stuck, when the INIT deassertion is skipped, or
>> do you prefer to address delays "one by one" as you wrote elsewhere?
>
> So I'm not against improving this code at all, but instead of this
> hard to follow mixing of old and new code, I'd find the following
> approach cleaner and more acceptable: create a 'modern' and a 'legacy'
> SMP-bootup variant function, and do a clean separation based on the
> CPU model cutoff condition used by Len's patches:
>
> /* if modern processor, use no delay */
> if (((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && (boot_cpu_data.x86 == 6)) ||
> ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && (boot_cpu_data.x86 >= 0xF)))
> init_udelay = 0;
>
> Then in the modern variant we can become even more aggressive and
> remove these kinds of delays as well:

Not sure it is worth two versions, since this is not where the big
time is spent.
See below.

>
> udelay(300);

FWIW, MPS 1.4 suggests this should be 200, not 300.

> udelay(200);
>
> plus I'd suggest making these poll loops in smpboot.c loops narrower:
>
> udelay(100);

FWIW, on my dekstop, this one executed 17 times (1700usec)
This is the time for the remote CPU to wake and get to cpu_init().
Why is it a benefit to have any udelay() before invoking schedule()?

> udelay(100);

This one didn't execute at all. Indeed, I don't understand why it exists,
per question above.

/*
* Wait till AP completes initial initialization
*/
while (!cpumask_test_cpu(cpu, cpu_callin_mask)) {
/*
* Allow other tasks to run while we wait for the
* AP to come online. This also gives a chance
* for the MTRR work(triggered by the AP coming online)
* to be completed in the stop machine context.
*/
udelay(100);
schedule();
}

So, the latest TIP has the INIT udelay(10,000) removed,
but cpu_up() still takes nearly 19,000 usec on a HSW dekstop.

A quick scan of the ftrace shows some high runners:

18949.45 us cpu_up()
2450.580 us notifier_call_chain
102.751 us thermal_throttle_cpu_callback
289.313 us dpm_sysfs_add
1019.594 us msr_class_cpu_callback
...
8455.462 us native_cpu_up()
500.000 us = udelay(300) + udelay(200) Startup IPI
500.000 us = udelay(300) + udelay(200) Startup IPI
1700.000 us = 17 x udelay(100) waiting for AP in initialized_map
2004.172 us check_tsc_warp()

7977.799 us cpu_notify()
1588.108 us cpuset_cpu_active
3043.955 us cacheinfo_cpu_callback
1146.234 us mce_cpu_callback
541.105 us cpufreq_cpu_callback
213.685 us coretemp_cpu_callback


cacheinfo_cpu_callback() time appears to be spent creating a bunch
of sysfs nodes, which is apparetly an expensive operation.

check_tsc_warp() is hard-coded to take 2ms.
I don't know if 2ms is a magic number or if shorter has same value.
It seems a bit sad to do this serially for every CPU at boot,
when we could do all the CPUs in parallel after they are on-line.
Perhaps this should be invoked only for boot-time and hot-add time.
It shouldn't be needed at all for soft online and resume.

Startup IPI delays.
MPS 1.4 actually says 200+200, not 300+200, as Linux reads.
I don't know where the 300 came from, maybe it was a typo?

msr_class_cpu_callback -- making device nodes is not fast.

I don't know if anything can be done for the 1700us wait
for the remote processor to mark itself initialized.
That is the 1st thing it does when it enters cpu_init().

On the xeon, I had see x86_init_rdrand() take 781usec --
dunno why that isn't seen on this box. I'll look at that box again next week.

cheers,
Len Brown, Intel Open Source Technology Center

2015-05-17 05:02:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen


* Len Brown <[email protected]> wrote:

> On Thu, May 14, 2015 at 2:44 AM, Ingo Molnar <[email protected]> wrote:
>
> >
> >> BTW. this time can be reduced by 7% (113 ms) by deleting
> >> announce_cpu():
> >>
> >> [ 1.445815] x86: Booted up 4 nodes, 120 CPUs
> >
> > so that kind of info looks pretty useful, especially when there's
> > hangs/failures.
>
> I think the messages we print on failure are useful.
> I think the success case should be a 1-line summary.

But we usually don't know a failure until it happens, and then people
often don't know which quirky debug option to turn on before sending a
capture of the failure.

It also pretty compressed and looks kind of cool, especially with
larger CPU counts. Would love to see a 6K CPUs system boot up ;-)

> > I'm wondering what takes 113 msecs to print 120 CPUs - that's
> > about 1 msec per a few chars of printk produced, seems excessive.
> > Do you have any idea what's going on there? Does your system print
> > to a serial console perhaps?
>
> Yes, serial console -- that server is actually much
> closer to you than it is to me, it is in Finland:-)

LOL ;-)

> I should benchmark it, because 115200 should be faster...

So 115200 baud == 14400 bytes/sec == 14.4 bytes/msec == 0.07 msecs/byte

So with 120 CPUs we print about 5-6 chars per CPU, which is 6*120==720
bytes, which should take about 50 msecs.

So serial explains about half of the observed overhead.

Thanks,

Ingo

2015-05-17 05:27:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: skip delays during SMP initialization similar to Xen


* Len Brown <[email protected]> wrote:

> On Thu, May 14, 2015 at 1:57 PM, Ingo Molnar <[email protected]> wrote:
> >
> > * "Jan H. Sch?nherr" <[email protected]> wrote:
> >
> >> Ingo, do you want an updated version of the original patch, which
> >> takes care not get stuck, when the INIT deassertion is skipped, or
> >> do you prefer to address delays "one by one" as you wrote elsewhere?
> >
> > So I'm not against improving this code at all, but instead of this
> > hard to follow mixing of old and new code, I'd find the following
> > approach cleaner and more acceptable: create a 'modern' and a 'legacy'
> > SMP-bootup variant function, and do a clean separation based on the
> > CPU model cutoff condition used by Len's patches:
> >
> > /* if modern processor, use no delay */
> > if (((boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) && (boot_cpu_data.x86 == 6)) ||
> > ((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) && (boot_cpu_data.x86 >= 0xF)))
> > init_udelay = 0;
> >
> > Then in the modern variant we can become even more aggressive and
> > remove these kinds of delays as well:
>
> Not sure it is worth two versions, since this is not where the big
> time is spent.

There aren't many low hanging fruits left, so reducing boot time is
going to be 'lots of hard work', and part of it is to make the SMP
bringup path easier to review and tweak.

That having said, we could pass in a delay scaling parameter, which
could be zero for modern hardware. So something like:

udelay(100*scale);

would automatically turn into a 0 delay thing on modern hw. On old
hardware it could be 1.

OTOH there are other legacies here as well, so maybe a split version
would still be the most robust approach, so that we can have a
completely separate legacy path we don't change all that often.

OTOH that would also mean 'we don't test it all that often', which
could result in more breakages than sharing it with modern bootup
code.

So I'm a bit torn about that.

> See below.
>
> >
> > udelay(300);
>
> FWIW, MPS 1.4 suggests this should be 200, not 300.
>
> > udelay(200);
> >
> > plus I'd suggest making these poll loops in smpboot.c loops narrower:
> >
> > udelay(100);
>
> FWIW, on my dekstop, this one executed 17 times (1700usec)
> This is the time for the remote CPU to wake and get to cpu_init().
> Why is it a benefit to have any udelay() before invoking schedule()?

See further below.

>
> > udelay(100);
>
> This one didn't execute at all. Indeed, I don't understand why it exists,
> per question above.
>
> /*
> * Wait till AP completes initial initialization
> */
> while (!cpumask_test_cpu(cpu, cpu_callin_mask)) {
> /*
> * Allow other tasks to run while we wait for the
> * AP to come online. This also gives a chance
> * for the MTRR work(triggered by the AP coming online)
> * to be completed in the stop machine context.
> */
> udelay(100);
> schedule();
> }

So this is essentially a poor man's sched_yield().

I'd really love to see this moved to a
wait_for_completion_timeout()/complete() loop instead. For that it
would have to move out from under the irqs-off section on the boot CPU
(so that timer irqs can come in), but that should be OK: the fragile
part is on the AP CPU, not on the BP, I think.

That would allow us, on truly modern hardware, to implement parallel
bringup of all 120 CPUs, i.e. to first do a loop of triggering bootup
on all the secondary CPUs, then collect them in a completion loop.

I'd add an easy debug boot option as well to serialize it all again,
in case something breaks.

This would help parallelize much of the initialization overhead in the
secondary CPUs:

> So, the latest TIP has the INIT udelay(10,000) removed,
> but cpu_up() still takes nearly 19,000 usec on a HSW dekstop.
>
> A quick scan of the ftrace shows some high runners:
>
> 18949.45 us cpu_up()
> 2450.580 us notifier_call_chain
> 102.751 us thermal_throttle_cpu_callback
> 289.313 us dpm_sysfs_add
> 1019.594 us msr_class_cpu_callback
> ...
> 8455.462 us native_cpu_up()
> 500.000 us = udelay(300) + udelay(200) Startup IPI
> 500.000 us = udelay(300) + udelay(200) Startup IPI
> 1700.000 us = 17 x udelay(100) waiting for AP in initialized_map
> 2004.172 us check_tsc_warp()
>
> 7977.799 us cpu_notify()
> 1588.108 us cpuset_cpu_active
> 3043.955 us cacheinfo_cpu_callback
> 1146.234 us mce_cpu_callback
> 541.105 us cpufreq_cpu_callback
> 213.685 us coretemp_cpu_callback
>
>
> cacheinfo_cpu_callback() time appears to be spent creating a bunch
> of sysfs nodes, which is apparetly an expensive operation.
>
> check_tsc_warp() is hard-coded to take 2ms. I don't know if 2ms is a
> magic number or if shorter has same value. It seems a bit sad to do
> this serially for every CPU at boot, when we could do all the CPUs
> in parallel after they are on-line. Perhaps this should be invoked
> only for boot-time and hot-add time. It shouldn't be needed at all
> for soft online and resume.

So how come the TSC isn't X86_FEATURE_CONSTANT_TSC? That should skip
the TSC sync-test.

> Startup IPI delays.
> MPS 1.4 actually says 200+200, not 300+200, as Linux reads.
> I don't know where the 300 came from, maybe it was a typo?

No, I think it was simple paranoia - our udelay() used to be imprecise
(and still can be).

I'd not touch the delays for legacies, I'd create a 'modern' bootup
path with aggressive optimizations, that new hardware should try hard
to hit. If it causes problems on some hw we can quirk those out. If
the quirks look too widespread, we can make things less aggressive.

> msr_class_cpu_callback -- making device nodes is not fast.
>
> I don't know if anything can be done for the 1700us wait
> for the remote processor to mark itself initialized.
> That is the 1st thing it does when it enters cpu_init().

So that 1.7 msecs delay is the firmware in essence?

Thanks,

Ingo