Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755709Ab3GYKln (ORCPT ); Thu, 25 Jul 2013 06:41:43 -0400 Received: from merlin.infradead.org ([205.233.59.134]:34854 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754994Ab3GYKlm (ORCPT ); Thu, 25 Jul 2013 06:41:42 -0400 Date: Thu, 25 Jul 2013 12:41:30 +0200 From: Peter Zijlstra To: Mel Gorman Cc: Srikar Dronamraju , Ingo Molnar , Andrea Arcangeli , Johannes Weiner , Linux-MM , LKML Subject: [PATCH] sched, numa: Improve scanner Message-ID: <20130725104130.GP27075@twins.programming.kicks-ass.net> References: <1373901620-2021-1-git-send-email-mgorman@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1373901620-2021-1-git-send-email-mgorman@suse.de> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4213 Lines: 88 Subject: sched, numa: Improve scanner From: Peter Zijlstra Date: Tue Jul 23 17:02:38 CEST 2013 With a trace_printk("working\n"); right after the cmpxchg in task_numa_work() we can see that of a 4 thread process, its always the same task winning the race and doing the protection change. This is a problem since the task doing the protection change has a penalty for taking faults -- it is busy when marking the PTEs. If its always the same task the ->numa_faults[] get severely skewed. Avoid this by delaying the task doing the protection change such that it is unlikely to win the privilege again. Before: root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15 thread 0/0-3232 [022] .... 212.787402: task_numa_work: working thread 0/0-3232 [022] .... 212.888473: task_numa_work: working thread 0/0-3232 [022] .... 212.989538: task_numa_work: working thread 0/0-3232 [022] .... 213.090602: task_numa_work: working thread 0/0-3232 [022] .... 213.191667: task_numa_work: working thread 0/0-3232 [022] .... 213.292734: task_numa_work: working thread 0/0-3232 [022] .... 213.393804: task_numa_work: working thread 0/0-3232 [022] .... 213.494869: task_numa_work: working thread 0/0-3232 [022] .... 213.596937: task_numa_work: working thread 0/0-3232 [022] .... 213.699000: task_numa_work: working thread 0/0-3232 [022] .... 213.801067: task_numa_work: working thread 0/0-3232 [022] .... 213.903155: task_numa_work: working thread 0/0-3232 [022] .... 214.005201: task_numa_work: working thread 0/0-3232 [022] .... 214.107266: task_numa_work: working thread 0/0-3232 [022] .... 214.209342: task_numa_work: working After: root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15 thread 0/0-3253 [005] .... 136.865051: task_numa_work: working thread 0/2-3255 [026] .... 136.965134: task_numa_work: working thread 0/3-3256 [024] .... 137.065217: task_numa_work: working thread 0/3-3256 [024] .... 137.165302: task_numa_work: working thread 0/3-3256 [024] .... 137.265382: task_numa_work: working thread 0/0-3253 [004] .... 137.366465: task_numa_work: working thread 0/2-3255 [026] .... 137.466549: task_numa_work: working thread 0/0-3253 [004] .... 137.566629: task_numa_work: working thread 0/0-3253 [004] .... 137.666711: task_numa_work: working thread 0/1-3254 [028] .... 137.766799: task_numa_work: working thread 0/0-3253 [004] .... 137.866876: task_numa_work: working thread 0/2-3255 [026] .... 137.966960: task_numa_work: working thread 0/1-3254 [028] .... 138.067041: task_numa_work: working thread 0/2-3255 [026] .... 138.167123: task_numa_work: working thread 0/3-3256 [024] .... 138.267207: task_numa_work: working Signed-off-by: Peter Zijlstra --- kernel/sched/fair.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1316,6 +1316,12 @@ void task_numa_work(struct callback_head return; /* + * Delay this task enough that another task of this mm will likely win + * the next time around. + */ + p->node_stamp += 2 * TICK_NSEC; + + /* * Do not set pte_numa if the current running node is rate-limited. * This loses statistics on the fault but if we are unwilling to * migrate to this node, it is less likely we can do useful work @@ -1405,7 +1411,7 @@ void task_tick_numa(struct rq *rq, struc if (now - curr->node_stamp > period) { if (!curr->node_stamp) curr->numa_scan_period = task_scan_min(curr); - curr->node_stamp = now; + curr->node_stamp += period; if (!time_before(jiffies, curr->mm->numa_next_scan)) { init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/