Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755164Ab2F2OL5 (ORCPT ); Fri, 29 Jun 2012 10:11:57 -0400 Received: from mail-pb0-f46.google.com ([209.85.160.46]:59052 "EHLO mail-pb0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754164Ab2F2OLy (ORCPT ); Fri, 29 Jun 2012 10:11:54 -0400 Message-ID: <4FEDB797.3050804@gmail.com> Date: Fri, 29 Jun 2012 22:11:35 +0800 From: Nai Xia Reply-To: nai.xia@gmail.com Organization: NJU User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1 MIME-Version: 1.0 To: Peter Zijlstra CC: Andrea Arcangeli , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Hillf Danton , Dan Smith , Linus Torvalds , Andrew Morton , Thomas Gleixner , Ingo Molnar , Paul Turner , Suresh Siddha , Mike Galbraith , "Paul E. McKenney" , Lai Jiangshan , Bharata B Rao , Lee Schermerhorn , Rik van Riel , Johannes Weiner , Srivatsa Vaddagiri , Christoph Lameter , Alex Shi , Mauricio Faria de Oliveira , Konrad Rzeszutek Wilk , Don Morris , Benjamin Herrenschmidt Subject: Re: [PATCH 13/40] autonuma: CPU follow memory algorithm References: <1340888180-15355-1-git-send-email-aarcange@redhat.com> <1340888180-15355-14-git-send-email-aarcange@redhat.com> <1340894776.28750.44.camel@twins> In-Reply-To: <1340894776.28750.44.camel@twins> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7554 Lines: 179 On 2012年06月28日 22:46, Peter Zijlstra wrote: > On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote: >> +/* >> + * This function sched_autonuma_balance() is responsible for deciding >> + * which is the best CPU each process should be running on according >> + * to the NUMA statistics collected in mm->mm_autonuma and >> + * tsk->task_autonuma. >> + * >> + * The core math that evaluates the current CPU against the CPUs of >> + * all _other_ nodes is this: >> + * >> + * if (w_nid> w_other&& w_nid> w_cpu_nid) >> + * weight = w_nid - w_other + w_nid - w_cpu_nid; >> + * >> + * w_nid: NUMA affinity of the current thread/process if run on the >> + * other CPU. >> + * >> + * w_other: NUMA affinity of the other thread/process if run on the >> + * other CPU. >> + * >> + * w_cpu_nid: NUMA affinity of the current thread/process if run on >> + * the current CPU. >> + * >> + * weight: combined NUMA affinity benefit in moving the current >> + * thread/process to the other CPU taking into account both the >> higher >> + * NUMA affinity of the current process if run on the other CPU, and >> + * the increase in NUMA affinity in the other CPU by replacing the >> + * other process. > > A lot of words, all meaningless without a proper definition of w_* > stuff. How are they calculated and why. > >> + * We run the above math on every CPU not part of the current NUMA >> + * node, and we compare the current process against the other >> + * processes running in the other CPUs in the remote NUMA nodes. The >> + * objective is to select the cpu (in selected_cpu) with a bigger >> + * "weight". The bigger the "weight" the biggest gain we'll get by >> + * moving the current process to the selected_cpu (not only the >> + * biggest immediate CPU gain but also the fewer async memory >> + * migrations that will be required to reach full convergence >> + * later). If we select a cpu we migrate the current process to it. > > So you do something like: > > max_(i, node(i) != curr_node) { weight_i } > > That is, you have this weight, then what do you do? > >> + * Checking that the current process has higher NUMA affinity than >> the >> + * other process on the other CPU (w_nid> w_other) and not only that >> + * the current process has higher NUMA affinity on the other CPU than >> + * on the current CPU (w_nid> w_cpu_nid) completely avoids ping >> pongs >> + * and ensures (temporary) convergence of the algorithm (at least >> from >> + * a CPU standpoint). > > How does that follow? > >> + * It's then up to the idle balancing code that will run as soon as >> + * the current CPU goes idle to pick the other process and move it >> + * here (or in some other idle CPU if any). >> + * >> + * By only evaluating running processes against running processes we >> + * avoid interfering with the CFS stock active idle balancing, which >> + * is critical to optimal performance with HT enabled. (getting HT >> + * wrong is worse than running on remote memory so the active idle >> + * balancing has priority) > > what? > >> + * Idle balancing and all other CFS load balancing become NUMA >> + * affinity aware through the introduction of >> + * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's >> + * autonuma_node first when it needs to find idle CPUs during idle >> + * balancing or tasks to pick during load balancing. > > You talk a lot about idle balance, but there's zero mention of fairness. > This is worrysome. > >> + * The task's autonuma_node is the node selected by >> + * sched_autonuma_balance() when it migrates a task to the >> + * selected_cpu in the selected_nid > > I think I already said that strict was out of the question and hard > movement like that simply didn't make sense. > >> + * Once a process/thread has been moved to another node, closer to >> the >> + * much of memory it has recently accessed, > > closer to the recently accessed memory you mean? > >> any memory for that task >> + * not in the new node moves slowly (asynchronously in the >> background) >> + * to the new node. This is done by the knuma_migratedN (where the >> + * suffix N is the node id) daemon described in mm/autonuma.c. >> + * >> + * One non trivial bit of this logic that deserves an explanation is >> + * how the three crucial variables of the core math >> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether >> + * the other CPU is running a thread of the current process, or a >> + * thread of a different process. > > No no no,.. its not a friggin detail, its absolutely crucial. Also, if > you'd given proper definition you wouldn't need to hand wave your way > around the dynamics either because that would simply follow from the > definition. > > > >> + * Before scanning all other CPUs' runqueues to compute the above >> + * math, > > OK, lets stop calling the one isolated conditional you mentioned 'math'. > On its own its useless. > >> we also verify that the current CPU is not already in the >> + * preferred NUMA node from the point of view of both the process >> + * statistics and the thread statistics. In such case we can return >> to >> + * the caller without having to check any other CPUs' runqueues >> + * because full convergence has been already reached. > > Things being in the 'preferred' place don't have much to do with > convergence. Does your model have local minima/maxima where it can get > stuck, or does it always find a global min/max? > > >> + * This algorithm might be expanded to take all runnable processes >> + * into account but examining just the currently running processes is >> + * a good enough approximation because some runnable processes may >> run >> + * only for a short time so statistically there will always be a bias >> + * on the processes that uses most the of the CPU. This is ideal >> + * because it doesn't matter if NUMA balancing isn't optimal for >> + * processes that run only for a short time. > > Almost, but not quite.. it would be so if the sampling could be proven > to be unbiased. But its quite possible for a task to consume most cpu > time and never show up as the current task in your load-balance run. Same here, I have another similar question regarding sampling: If one process do very intensive visit of a small set of pages in this node, but occasional visit of a large set of pages in another node. Will this algorithm do a very bad judgment? I guess the answer would be: it's possible and this judgment depends on the racing pattern between the process and your knuma_scand. Usually, if we are using sampling, we are on the assumption that if this sampling would not be accurate, we only lose chance to better optimization, but NOT to do bad/false judgment. Andrea, sorry, I don't have enough time to look into all your patches details(and also since I'm not on the CCs ;-) ), But my intuition tells me that your current sampling and weight algorithm is far from optimal. > > > > As it stands you wrote a lot of words.. but none of them were really > helpful in understanding what you do. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: email@kvack.org -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/