Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932868Ab2KVWxq (ORCPT ); Thu, 22 Nov 2012 17:53:46 -0500 Received: from mail-ea0-f174.google.com ([209.85.215.174]:43246 "EHLO mail-ea0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758833Ab2KVWvj (ORCPT ); Thu, 22 Nov 2012 17:51:39 -0500 From: Ingo Molnar To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Peter Zijlstra , Paul Turner , Lee Schermerhorn , Christoph Lameter , Rik van Riel , Mel Gorman , Andrew Morton , Andrea Arcangeli , Linus Torvalds , Thomas Gleixner , Johannes Weiner , Hugh Dickins Subject: [PATCH 26/33] sched: Introduce staged average NUMA faults Date: Thu, 22 Nov 2012 23:49:47 +0100 Message-Id: <1353624594-1118-27-git-send-email-mingo@kernel.org> X-Mailer: git-send-email 1.7.11.7 In-Reply-To: <1353624594-1118-1-git-send-email-mingo@kernel.org> References: <1353624594-1118-1-git-send-email-mingo@kernel.org> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3266 Lines: 96 The current way of building the p->numa_faults[2][node] faults statistics has a sampling artifact: The continuous and immediate nature of propagating new fault stats to the numa_faults array creates a 'pulsating' dynamic, that starts at the average value at the beginning of the scan, increases monotonically until we finish the scan to about twice the average, and then drops back to half of its value due to the running average. Since we rely on these values to balance tasks, the pulsating nature resulted in false migrations and general noise in the stats. To solve this, introduce buffering of the current scan via p->task_numa_faults_curr[]. The array is co-allocated with the p->task_numa[] for efficiency reasons, but it is otherwise an ordinary separate array. At the end of the scan we propagate the latest stats into the average stats value. Most of the balancing code stays unmodified. The cost of this change is that we delay the effects of the latest round of faults by 1 scan - but using the partial faults info was creating artifacts. This instantly stabilized the page fault stats and improved numa02-alike workloads by making them faster to converge. Cc: Peter Zijlstra Cc: Linus Torvalds Cc: Andrew Morton Cc: Andrea Arcangeli Cc: Rik van Riel Cc: Mel Gorman Cc: Hugh Dickins Signed-off-by: Ingo Molnar --- include/linux/sched.h | 1 + kernel/sched/fair.c | 20 +++++++++++++++++--- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 8f65323..92b41b4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1511,6 +1511,7 @@ struct task_struct { u64 node_stamp; /* migration stamp */ unsigned long numa_weight; unsigned long *numa_faults; + unsigned long *numa_faults_curr; struct callback_head numa_work; #endif /* CONFIG_NUMA_BALANCING */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 9c46b45..1ab11be 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -852,12 +852,26 @@ static void task_numa_placement(struct task_struct *p) p->numa_scan_seq = seq; + /* + * Update the fault average with the result of the latest + * scan: + */ for (node = 0; node < nr_node_ids; node++) { faults = 0; for (priv = 0; priv < 2; priv++) { - faults += p->numa_faults[2*node + priv]; - total[priv] += p->numa_faults[2*node + priv]; - p->numa_faults[2*node + priv] /= 2; + unsigned int new_faults; + unsigned int idx; + + idx = 2*node + priv; + new_faults = p->numa_faults_curr[idx]; + p->numa_faults_curr[idx] = 0; + + /* Keep a simple running average: */ + p->numa_faults[idx] += new_faults; + p->numa_faults[idx] /= 2; + + faults += p->numa_faults[idx]; + total[priv] += p->numa_faults[idx]; } if (faults > max_faults) { max_faults = faults; -- 1.7.11.7 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/