Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752876AbdGPOOM (ORCPT ); Sun, 16 Jul 2017 10:14:12 -0400 Received: from shadbolt.e.decadent.org.uk ([88.96.1.126]:44258 "EHLO shadbolt.e.decadent.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752855AbdGPOOH (ORCPT ); Sun, 16 Jul 2017 10:14:07 -0400 Content-Type: text/plain; charset="UTF-8" Content-Disposition: inline Content-Transfer-Encoding: 8bit MIME-Version: 1.0 From: Ben Hutchings To: linux-kernel@vger.kernel.org, stable@vger.kernel.org CC: akpm@linux-foundation.org, "Linus Torvalds" , "Vincent Guittot" , "Peter Zijlstra (Intel)" , "Morten Rasmussen" , "Thomas Gleixner" , "Matt Fleming" , "Mike Galbraith" , "Ingo Molnar" , "Mike Galbraith" , "Frederic Weisbecker" Date: Sun, 16 Jul 2017 14:56:46 +0100 Message-ID: X-Mailer: LinuxStableQueue (scripts by bwh) Subject: [PATCH 3.16 052/178] sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting In-Reply-To: X-SA-Exim-Connect-IP: 2a02:8011:400e:2:6f00:88c8:c921:d332 X-SA-Exim-Mail-From: ben@decadent.org.uk X-SA-Exim-Scanned: No (on shadbolt.decadent.org.uk); SAEximRunCond expanded to false Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3059 Lines: 83 3.16.46-rc1 review patch. If anyone has any objections, please let me know. ------------------ From: Matt Fleming commit 6e5f32f7a43f45ee55c401c0b9585eb01f9629a8 upstream. If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to the pending sample window time on exit, setting the next update not one window into the future, but two. This situation on exiting NO_HZ is described by: this_rq->calc_load_update < jiffies < calc_load_update In this scenario, what we should be doing is: this_rq->calc_load_update = calc_load_update [ next window ] But what we actually do is: this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ] This has the effect of delaying load average updates for potentially up to ~9seconds. This can result in huge spikes in the load average values due to per-cpu uninterruptible task counts being out of sync when accumulated across all CPUs. It's safe to update the per-cpu active count if we wake between sample windows because any load that we left in 'calc_load_idle' will have been zero'd when the idle load was folded in calc_global_load(). This issue is easy to reproduce before, commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking") just by forking short-lived process pipelines built from ps(1) and grep(1) in a loop. I'm unable to reproduce the spikes after that commit, but the bug still seems to be present from code review. Signed-off-by: Matt Fleming Signed-off-by: Peter Zijlstra (Intel) Cc: Frederic Weisbecker Cc: Linus Torvalds Cc: Mike Galbraith Cc: Mike Galbraith Cc: Morten Rasmussen Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Vincent Guittot Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again") Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar [bwh: Backported to 3.16: adjust filename] Signed-off-by: Ben Hutchings --- kernel/sched/proc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- a/kernel/sched/proc.c +++ b/kernel/sched/proc.c @@ -205,8 +205,9 @@ void calc_load_exit_idle(void) struct rq *this_rq = this_rq(); /* - * If we're still before the sample window, we're done. + * If we're still before the pending sample window, we're done. */ + this_rq->calc_load_update = calc_load_update; if (time_before(jiffies, this_rq->calc_load_update)) return; @@ -215,7 +216,6 @@ void calc_load_exit_idle(void) * accounted through the nohz accounting, so skip the entire deal and * sync up for the next window. */ - this_rq->calc_load_update = calc_load_update; if (time_before(jiffies, this_rq->calc_load_update + 10)) this_rq->calc_load_update += LOAD_FREQ; }