From: Noam Camus <noamca@mellanox.com>
To: Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <peterz@infradead.org>
CC: Vineet Gupta <Vineet.Gupta1@synopsys.com>,
        Chris Metcalf <cmetcalf@mellanox.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "anna-maria@linutronix.de" <anna-maria@linutronix.de>,
        Eitan Rabin <rabin@mellanox.com>,
        "Fu, Zhonghui" <zhonghui.fu@intel.com>
Subject: RE: Reduce Linux boot time on Large scale system
Thread-Topic: Reduce Linux boot time on Large scale system
Thread-Index: AdKtYe7annHBTK0DRhy02hYAU5p8dQLf5NsAAAJloQAAAgKLEA==
Date: Wed, 19 Apr 2017 10:08:45 +0000
Message-ID: <DB5PR05MB1638D630B58C2F0A3522696CAA180@DB5PR05MB1638.eurprd05.prod.outlook.com>
References: <AM4PR05MB1636E0752C76AA42A777F81AAA0B0@AM4PR05MB1636.eurprd05.prod.outlook.com>
 <20170419074944.lacscblx2ulhfcd3@hirez.programming.kicks-ass.net>
 <alpine.DEB.2.20.1704191032100.1829@nanos>
In-Reply-To: <alpine.DEB.2.20.1704191032100.1829@nanos>
Accept-Language: en-US
Content-Language: en-US
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Apr 2017 10:08:45.0770
 (UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB5PR05MB1733
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id v3JA8v4w031614
Content-Length: 3640
Lines: 88

>From: Thomas Gleixner [mailto:tglx@linutronix.de] 
>Sent: Wednesday, April 19, 2017 11:58 AM
>On Wed, 19 Apr 2017, Peter Zijlstra wrote:
>> On Tue, Apr 04, 2017 at 04:39:06PM +0000, Noam Camus wrote:
>> > Hi Peter & Vineet
>> > 
>> > I wish to reduce boot time of my platform ARC/plat-eznps (4K CPUs).
>> > My analysis is that most boot time is spent over cpu_up() for all 
>> > CPUs Measurements are about 66mS per CPU and Totally over 4 minutes (I got 800MHz cores).
>> > 
>> > I see that smp_init() just iterate over all present cpus one by one.
>> > I wish to know if there was an attempt to optimize this with some parallel work?
>> > 
>> > Are you aware of some method / trick  that will help me to reduce boot time?
>> > Any suggestion how this can be done?
>> 
>> So attempts have been made in the past but Thomas shot them down for 
>> being gross hacks (they were).
>> 
>> But Thomas has now (mostly) completed rewriting the CPU hotplug 
>> machinery and he has at some point outlined means of achieving what 
>> you're after.
>> 
>> I've added him to Cc so he can correct me where I'm wrong, as I've not 
>> looked into this in much detail after he mucked up all I knew about 
>> CPU hotplug.
>> 
>> Since each CPU is now responsible for its own bootstrap, we can now 
>> kick all the CPUs awake without waiting for them to complete the 
>> online stage.
>> 
>> There might however be code that assumes CPUs come up one at a time, 
>> so you'll need to audit for that. Its not going to be a trivial thing.

>There are a couple of things to consider.

>First of all we should make the whole 'kick CPU into life' and surrounding magic generic. Every arch has it's own handshake mechanism.

>That would look like this:

>Step BP						AP
>0-9  [preparatory steps]

>10   [kick cpu into life (arch callback)]
>11						[Do initial arch bringup then
						 call in into a generic function ]
>12   [handshake (generic)]			[handshake (generic)]
>13   [more arch specific magic]			[more arch specific magic]

>14-20						[ CPU starting ]	
						[ CPU goes online ]

>40						[ CPU active, hotplug done ]

>So the first step in parallelizing this would be:

>   for_each_present_cpu(cpu)
>   	cpu_up(target_state = 10);

>i.e. make the allocations and whatever preparatory work needs to be done and kick the CPU into life. The target CPU would intialize the low level stuff and then call into a generic function, which does the generic initialization and then waits for the handshake.

>So the next thing would be:

>   for_each_present_cpu(cpu)
> 	cpu_up(target_state = 40);

>This last step has to be single threaded for now because almost all CPU hotplug using facilities rely on the current serialization. There are also code pathes which use get_online_cpus() or cpu_hotplug_disable() to prevent interaction with cpu hotplug.

>The hotplug machinery is already designed so that after the handshake (#12/13] a plugged CPU can bring up itself completely alone, but due to the serialization expectations all over the place this won't work today.

>To make it work, you have to go through every single instance of CPU hotplug callback users and every single site which prevents hotplug via
get_online_cpus() or cpu_hotplug_disable() and audit them for concurrency issues and fix them up.

>There might also be interaction required with the state machine, i.e. stop the state progress on a self plugging CPU between two steps to make serialization work.

What would be a good base to start on all above?
Would some formal release like v4.8 TAG good enough , or do I need to base on some other specific HEAD (or TAG)?

Thanks,
Noam