Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754106Ab2E2QF5 (ORCPT ); Tue, 29 May 2012 12:05:57 -0400 Received: from merlin.infradead.org ([205.233.59.134]:55363 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752617Ab2E2QF4 convert rfc822-to-8bit (ORCPT ); Tue, 29 May 2012 12:05:56 -0400 Message-ID: <1338307528.26856.106.camel@twins> Subject: Re: [PATCH 21/35] autonuma: teach CFS about autonuma affinity From: Peter Zijlstra To: Andrea Arcangeli Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Hillf Danton , Dan Smith , Linus Torvalds , Andrew Morton , Thomas Gleixner , Ingo Molnar , Paul Turner , Suresh Siddha , Mike Galbraith , "Paul E. McKenney" , Lai Jiangshan , Bharata B Rao , Lee Schermerhorn , Rik van Riel , Johannes Weiner , Srivatsa Vaddagiri , Christoph Lameter Date: Tue, 29 May 2012 18:05:28 +0200 In-Reply-To: <1337965359-29725-22-git-send-email-aarcange@redhat.com> References: <1337965359-29725-1-git-send-email-aarcange@redhat.com> <1337965359-29725-22-git-send-email-aarcange@redhat.com> Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Mailer: Evolution 3.2.2- Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2382 Lines: 55 On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote: > The CFS scheduler is still in charge of all scheduling > decisions. AutoNUMA balancing at times will override those. But > generally we'll just relay on the CFS scheduler to keep doing its > thing, but while preferring the autonuma affine nodes when deciding > to move a process to a different runqueue or when waking it up. > > For example the idle balancing, will look into the runqueues of the > busy CPUs, but it'll search first for a task that wants to run into > the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true). > > Most of this is encoded in the can_migrate_task becoming AutoNUMA > aware and running two passes for each balancing pass, the first NUMA > aware, and the second one relaxed. > > The idle/newidle balancing is always allowed to fallback into > non-affine AutoNUMA tasks. The load_balancing (which is more a > fariness than a performance issue) is instead only able to cross over > the AutoNUMA affinity if the flag controlled by > /sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it > is set by default). This is unacceptable, and contradicts your earlier claim that you rely on the regular load-balancer. The strict mode needs to go, load-balancing is a best effort and fairness is important -- so much so to some people that I get complaints the current thing isn't strong enough. Your strict mode basically supplants any and all balancing done at node level and above. Please use something like: https://lkml.org/lkml/2012/5/19/53 with the sched_setnode() function from: https://lkml.org/lkml/2012/5/18/109 Fairness matters because people expect similar throughput or runtimes so balancing such that we first ensure equal load on cpus and only then bother with node placement should be the order. Furthermore, load-balancing does things like trying to place tasks that wake each-other closer together, your strict mode completely breaks that. Instead, if the balancer finds these tasks are related and should be together that should be a hint the memory needs to come to them, not the other way around. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/