Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935413Ab3DKOXI (ORCPT ); Thu, 11 Apr 2013 10:23:08 -0400 Received: from relay1.sgi.com ([192.48.179.29]:44282 "EHLO relay.sgi.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757765Ab3DKOXG (ORCPT ); Thu, 11 Apr 2013 10:23:06 -0400 Date: Thu, 11 Apr 2013 09:23:01 -0500 From: Russ Anderson To: "Srivatsa S. Bhat" Cc: Paul Mackerras , Linus Torvalds , Ingo Molnar , Robin Holt , "H. Peter Anvin" , Andrew Morton , Linux Kernel Mailing List , Shawn Guo , Thomas Gleixner , Ingo Molnar , the arch/x86 maintainers , "Paul E. McKenney" , Tejun Heo , Oleg Nesterov , Lai Jiangshan , Michel Lespinasse , "rusty@rustcorp.com.au" , Peter Zijlstra Subject: Re: Bulk CPU Hotplug (Was Re: [PATCH] Do not force shutdown/reboot to boot cpu.) Message-ID: <20130411142301.GB27990@sgi.com> Reply-To: Russ Anderson References: <20130403193743.GB29151@sgi.com> <20130408155701.GB19974@gmail.com> <5162EC1A.4050204@zytor.com> <20130408165916.GA3672@sgi.com> <20130410111620.GB29752@gmail.com> <20130411053106.GA9042@drongo> <5166B05E.8010904@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5166B05E.8010904@linux.vnet.ibm.com> User-Agent: Mutt/1.5.20 (2009-12-10) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5136 Lines: 120 On Thu, Apr 11, 2013 at 06:15:18PM +0530, Srivatsa S. Bhat wrote: > On 04/11/2013 11:01 AM, Paul Mackerras wrote: > > On Wed, Apr 10, 2013 at 08:10:05AM -0700, Linus Torvalds wrote: > >> The optimal solution would be to just speed up the > >> disable_nonboot_cpus() code so much that it isn't an issue. That would > >> be good for suspending too, although I guess suspend isn't a big issue > >> if you have a thousand CPU's. > >> > >> Has anybody checked whether we could do the cpu_down() on non-boot > >> CPU's in parallel? Right now we serialize the thing completely, with > > > > I thought Srivatsa S. Bhat had a patchset that did exactly that. > > Srivatsa? > > > > Thanks for the CC, Paul! Adding some more people to CC. > > Actually, my patchset was about removing stop_machine() from the CPU > offline path. > http://lwn.net/Articles/538819/ I certainly agree with the intent. > And here is the performance improvement I had measured in the version > prior to that: > http://article.gmane.org/gmane.linux.kernel/1435249 > > I'm planning to revive this patchset after the 3.10 merge window closes, > because it depends on doing a tree-wide sweep, and I think its a little > late to do it in time for the upcoming 3.10 merge window itself. > > Anyway, that's about removing stop_machine from CPU hotplug. > > Coming to bulk CPU hotplug, yes, I had ideas similar to what Russ suggested. > But I believe we can do more than that. > > As Russ pointed out, the notifiers are not thread-safe, so calling them > in parallel with different CPUs as arguments isn't going to work. > > So, first, we can convert all the CPU hotplug notifiers to take a cpumask > instead of a single CPU. So assuming that there are 'n' notifiers in total, > the number of function calls would become n, instead of n*1024. > But that itself most likely won't give us much benefit over the for-loop > that Russ has done in his patch, because it'll simply do longer processing > in each of those 'n' notifiers, by iterating over the cpumask inside each > notifier. As an alternative, how about each cpu have their own notifier list? Then one task per cpu can spin through that cpu's notifier list, allowing them to run in parallel. I don't know if that would be a faster solution than adding cpumask to notifiers, but it my guess is it may. > Now comes the interesting thing: > > Consider a notifier chain that looks like this: > Priority 0: A->B->C->D > > We can't invoke say notifier callback A simultaneously on 2 CPUs with 2 > different hotcpus as argument. *However*, since A, B, C, D all (more or less) > belong to different subsystems, we can call A, B, C and D in parallel on > different CPUs. They won't even serialize amongst themselves because they > take locks (if any) of different subsystems. And since they are of same > priority, the ordering (A after B or B after A) doesn't matter as well. > > So with this, if we combine the idea I wrote above about giving a cpumask > to each of these notifiers to work with, we end up in this: > > CPU 0 CPU 1 CPU2 .... > A(cpumask) B(cpumask) C(cpumask) .... > > So, for example, the CPU_DOWN_PREPARE notification can be processed in parallel > on multiple CPUs at a time, for a given cpumask! That should definitely > give us a good speed-up. > > One more thing we have to note is that, there are 4 notifiers for taking a > CPU offline: > > CPU_DOWN_PREPARE > CPU_DYING > CPU_DEAD > CPU_POST_DEAD > > The first can be run in parallel as mentioned above. The second is run in > parallel in the stop_machine() phase as shown in Russ' patch. But the third > and fourth set of notifications all end up running only on CPU0, which will > again slow down things. In my testing the third and fourth set were a small part of the overall time. Less than 10%, with cpu notifiers 90+% of the time. So you may not need the added complexity, or at least fix the cpu notifier part first. > So I suggest taking down the 1024 CPUs in multiple phases, like a binary search. > First, take 512 CPUs down, then 256 CPUs, then 128 CPUs etc. So at every bulk > CPU hotplug, we have enough online CPUs to handle the notifier load, and that > helps speed things up. Moreover, a handful of calls to stop_machine() is OK > because, stop_machine() takes progressively lesser and lesser time because > lesser CPUs are online on each iteration (and hence it reduces the > synchronization overhead of the stop-machine phase). > > The only downside to this whole idea of running the notifiers of a given > priority in parallel, is error handling - if a notifier fails, it would be > troublesome to rollback I guess. But if we forget that for a moment, we can > give this idea a try! Yes. > Regards, > Srivatsa S. Bhat -- Russ Anderson, OS RAS/Partitioning Project Lead SGI - Silicon Graphics Inc rja@sgi.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/