Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754933Ab2E3Or0 (ORCPT ); Wed, 30 May 2012 10:47:26 -0400 Received: from merlin.infradead.org ([205.233.59.134]:45630 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753946Ab2E3OrY convert rfc822-to-8bit (ORCPT ); Wed, 30 May 2012 10:47:24 -0400 Message-ID: <1338389200.26856.273.camel@twins> Subject: Re: [PATCH 00/35] AutoNUMA alpha14 From: Peter Zijlstra To: Linus Torvalds Cc: Rik van Riel , Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Hillf Danton , Dan Smith , Andrew Morton , Thomas Gleixner , Ingo Molnar , Paul Turner , Suresh Siddha , Mike Galbraith , "Paul E. McKenney" , Lai Jiangshan , Bharata B Rao , Lee Schermerhorn , Johannes Weiner , Srivatsa Vaddagiri , Christoph Lameter Date: Wed, 30 May 2012 16:46:40 +0200 In-Reply-To: References: <1337965359-29725-1-git-send-email-aarcange@redhat.com> <4FC112AB.1040605@redhat.com> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.2- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4233 Lines: 98 On Sat, 2012-05-26 at 13:42 -0700, Linus Torvalds wrote: > I'm a *firm* believer that if it cannot be done automatically "well > enough", the absolute last thing we should ever do is worry about the > crazy people who think they can tweak it to perfection with complex > interfaces. > > You can't do it, except for trivial loads (often benchmarks), and for > very specific machines. > > So I think very strongly that we should entirely dismiss all the > people who want to do manual placement and claim that they know what > their loads do. They're either full of sh*t (most likely), or they > have a very specific benchmark and platform that they are tuning for > that is totally irrelevant to everybody else. > > What we *should* try to aim for is a system that doesn't do horribly > badly right out of the box. IOW, no tuning what-so-ever (at most a > kind of "yes, I want you to try to do the NUMA thing" flag to just > enable it at all), and try to not suck. > > Seriously. "Try to avoid sucking" is *way* superior to "We can let the > user tweak things to their hearts content". Because users won't get it > right. > > Give the anal people a knob they can tweak, and tell them it does > something fancy. And never actually wire the damn thing up. They'll be > really happy with their OCD tweaking, and do lots of nice graphs that > just show how the error bars are so big that you can find any damn > pattern you want in random noise. So the thing is, my homenode-per-process approach should work for everything except the case where a single process out-strips a single node in either cpu utilization or memory consumption. Now I claim such processes are rare since nodes are big, typically 6-8 cores. Writing anything that can sustain parallel execution larger than that is very specialist (and typically already employs strong data separation). Yes there are such things out there, some use JVMs some are virtual machines some regular applications, but by and large processes are small compared to nodes. So my approach is focus on the normal case, and provide 2 system calls to replace sched_setaffinity() and mbind() for the people who use those. Now, maybe I shouldn't have bothered with the system calls.. but I thought providing something better than hard-affinity would be nice. Andrea went the other way and focused on these big processes. His approach relies on a pte scanner and faults. His code builds a page<->thread map using this data either moves memory around or processes (I'm a little vague on the details simply because I haven't seen it explained anywhere yet -- and the code is non-obvious). I have a number of problems with both the approach as well as the implementation. On the approach my biggest complaints are: - the complexity, it focuses on the rarest sort of processes and thus results in a rather complex setup. - load-balance state explosion, the page-tables become part of the load-balance state -- this is a lot of extra state making reproduction more 'interesting'. - the overhead, since its per page, it needs per-page state. - I don't see how it can reliably work for virtual machines, because the host page<->thread (vcpu) relation doesn't reflect a data<->compute relation in this case. The guest scheduler can move the guest thread (the compute) part around between the vcpus at a much higher rate than the host will update its page<->vcpu map. On the implementation: - he works around the scheduler instead of with it. - its x86 only (although he claims adding archs is trivial I've yet to see the first !x86 support). - complete lack of useful comments describing the balancing goal and approach. The worst part is that I've asked for this stuff several times, but nothing seems forth-coming. Anyway, I prefer doing the simple thing first and then seeing if there's need for more complexity, esp. given the overheads involved. But if you prefer we can dive off the deep end :-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/