2017-04-19 07:49:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Reduce Linux boot time on Large scale system

On Tue, Apr 04, 2017 at 04:39:06PM +0000, Noam Camus wrote:
> Hi Peter & Vineet
>
> I wish to reduce boot time of my platform ARC/plat-eznps (4K CPUs).
> My analysis is that most boot time is spent over cpu_up() for all CPUs
> Measurements are about 66mS per CPU and Totally over 4 minutes (I got 800MHz cores).
>
> I see that smp_init() just iterate over all present cpus one by one.
> I wish to know if there was an attempt to optimize this with some parallel work?
>
> Are you aware of some method / trick that will help me to reduce boot time?
> Any suggestion how this can be done?

So attempts have been made in the past but Thomas shot them down for
being gross hacks (they were).

But Thomas has now (mostly) completed rewriting the CPU hotplug
machinery and he has at some point outlined means of achieving what
you're after.

I've added him to Cc so he can correct me where I'm wrong, as I've not
looked into this in much detail after he mucked up all I knew about CPU
hotplug.

Since each CPU is now responsible for its own bootstrap, we can now kick
all the CPUs awake without waiting for them to complete the online
stage.

There might however be code that assumes CPUs come up one at a time, so
you'll need to audit for that. Its not going to be a trivial thing.


2017-04-19 08:58:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Reduce Linux boot time on Large scale system

On Wed, 19 Apr 2017, Peter Zijlstra wrote:
> On Tue, Apr 04, 2017 at 04:39:06PM +0000, Noam Camus wrote:
> > Hi Peter & Vineet
> >
> > I wish to reduce boot time of my platform ARC/plat-eznps (4K CPUs).
> > My analysis is that most boot time is spent over cpu_up() for all CPUs
> > Measurements are about 66mS per CPU and Totally over 4 minutes (I got 800MHz cores).
> >
> > I see that smp_init() just iterate over all present cpus one by one.
> > I wish to know if there was an attempt to optimize this with some parallel work?
> >
> > Are you aware of some method / trick that will help me to reduce boot time?
> > Any suggestion how this can be done?
>
> So attempts have been made in the past but Thomas shot them down for
> being gross hacks (they were).
>
> But Thomas has now (mostly) completed rewriting the CPU hotplug
> machinery and he has at some point outlined means of achieving what
> you're after.
>
> I've added him to Cc so he can correct me where I'm wrong, as I've not
> looked into this in much detail after he mucked up all I knew about CPU
> hotplug.
>
> Since each CPU is now responsible for its own bootstrap, we can now kick
> all the CPUs awake without waiting for them to complete the online
> stage.
>
> There might however be code that assumes CPUs come up one at a time, so
> you'll need to audit for that. Its not going to be a trivial thing.

There are a couple of things to consider.

First of all we should make the whole 'kick CPU into life' and surrounding
magic generic. Every arch has it's own handshake mechanism.

That would look like this:

Step BP AP
0-9 [preparatory steps]

10 [kick cpu into life (arch callback)]
11 [Do initial arch bringup then
call in into a generic function ]
12 [handshake (generic)] [handshake (generic)]
13 [more arch specific magic] [more arch specific magic]

14-20 [ CPU starting ]
[ CPU goes online ]

40 [ CPU active, hotplug done ]

So the first step in parallelizing this would be:

for_each_present_cpu(cpu)
cpu_up(target_state = 10);

i.e. make the allocations and whatever preparatory work needs to be done
and kick the CPU into life. The target CPU would intialize the low level
stuff and then call into a generic function, which does the generic
initialization and then waits for the handshake.

So the next thing would be:

for_each_present_cpu(cpu)
cpu_up(target_state = 40);

This last step has to be single threaded for now because almost all CPU
hotplug using facilities rely on the current serialization. There are also
code pathes which use get_online_cpus() or cpu_hotplug_disable() to prevent
interaction with cpu hotplug.

The hotplug machinery is already designed so that after the handshake
(#12/13] a plugged CPU can bring up itself completely alone, but due to the
serialization expectations all over the place this won't work today.

To make it work, you have to go through every single instance of CPU
hotplug callback users and every single site which prevents hotplug via
get_online_cpus() or cpu_hotplug_disable() and audit them for concurrency
issues and fix them up.

There might also be interaction required with the state machine, i.e. stop
the state progress on a self plugging CPU between two steps to make
serialization work.

Thanks,

tglx








2017-04-19 10:08:51

by Noam Camus

[permalink] [raw]
Subject: RE: Reduce Linux boot time on Large scale system

>From: Thomas Gleixner [mailto:[email protected]]
>Sent: Wednesday, April 19, 2017 11:58 AM
>On Wed, 19 Apr 2017, Peter Zijlstra wrote:
>> On Tue, Apr 04, 2017 at 04:39:06PM +0000, Noam Camus wrote:
>> > Hi Peter & Vineet
>> >
>> > I wish to reduce boot time of my platform ARC/plat-eznps (4K CPUs).
>> > My analysis is that most boot time is spent over cpu_up() for all
>> > CPUs Measurements are about 66mS per CPU and Totally over 4 minutes (I got 800MHz cores).
>> >
>> > I see that smp_init() just iterate over all present cpus one by one.
>> > I wish to know if there was an attempt to optimize this with some parallel work?
>> >
>> > Are you aware of some method / trick that will help me to reduce boot time?
>> > Any suggestion how this can be done?
>>
>> So attempts have been made in the past but Thomas shot them down for
>> being gross hacks (they were).
>>
>> But Thomas has now (mostly) completed rewriting the CPU hotplug
>> machinery and he has at some point outlined means of achieving what
>> you're after.
>>
>> I've added him to Cc so he can correct me where I'm wrong, as I've not
>> looked into this in much detail after he mucked up all I knew about
>> CPU hotplug.
>>
>> Since each CPU is now responsible for its own bootstrap, we can now
>> kick all the CPUs awake without waiting for them to complete the
>> online stage.
>>
>> There might however be code that assumes CPUs come up one at a time,
>> so you'll need to audit for that. Its not going to be a trivial thing.

>There are a couple of things to consider.

>First of all we should make the whole 'kick CPU into life' and surrounding magic generic. Every arch has it's own handshake mechanism.

>That would look like this:

>Step BP AP
>0-9 [preparatory steps]

>10 [kick cpu into life (arch callback)]
>11 [Do initial arch bringup then
call in into a generic function ]
>12 [handshake (generic)] [handshake (generic)]
>13 [more arch specific magic] [more arch specific magic]

>14-20 [ CPU starting ]
[ CPU goes online ]

>40 [ CPU active, hotplug done ]

>So the first step in parallelizing this would be:

> for_each_present_cpu(cpu)
> cpu_up(target_state = 10);

>i.e. make the allocations and whatever preparatory work needs to be done and kick the CPU into life. The target CPU would intialize the low level stuff and then call into a generic function, which does the generic initialization and then waits for the handshake.

>So the next thing would be:

> for_each_present_cpu(cpu)
> cpu_up(target_state = 40);

>This last step has to be single threaded for now because almost all CPU hotplug using facilities rely on the current serialization. There are also code pathes which use get_online_cpus() or cpu_hotplug_disable() to prevent interaction with cpu hotplug.

>The hotplug machinery is already designed so that after the handshake (#12/13] a plugged CPU can bring up itself completely alone, but due to the serialization expectations all over the place this won't work today.

>To make it work, you have to go through every single instance of CPU hotplug callback users and every single site which prevents hotplug via
get_online_cpus() or cpu_hotplug_disable() and audit them for concurrency issues and fix them up.

>There might also be interaction required with the state machine, i.e. stop the state progress on a self plugging CPU between two steps to make serialization work.

What would be a good base to start on all above?
Would some formal release like v4.8 TAG good enough , or do I need to base on some other specific HEAD (or TAG)?

Thanks,
Noam









2017-04-19 10:35:21

by Thomas Gleixner

[permalink] [raw]
Subject: RE: Reduce Linux boot time on Large scale system

On Wed, 19 Apr 2017, Noam Camus wrote:

Please fix your mail client to do proper line wraps around 78 chars and
proper citation. Also please trim your replies. There is no point in
scrolling through 2 pages of cited mail just to get to your question.

> > From: Thomas Gleixner [mailto:[email protected]]
> > There might also be interaction required with the state machine,
> > i.e. stop the state progress on a self plugging CPU between two steps to
> > make serialization work.
>
> What would be a good base to start on all above? Would some formal
> release like v4.8 TAG good enough , or do I need to base on some other
> specific HEAD (or TAG)?

Errm, no. Development happens on Linus current tree or the particular
development tree/branch of a particular subsystem. In this case

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git smp/hotplug

which is empty right now, but about to be filled up again. Start against
Linus HEAD for now.

Thanks,

tglx