Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1174201AbdDXUO7 (ORCPT ); Mon, 24 Apr 2017 16:14:59 -0400 Received: from mail-it0-f65.google.com ([209.85.214.65]:34949 "EHLO mail-it0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S977555AbdDXUOt (ORCPT ); Mon, 24 Apr 2017 16:14:49 -0400 Date: Mon, 24 Apr 2017 13:14:44 -0700 From: Tejun Heo To: Ingo Molnar , Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Linus Torvalds , Vincent Guittot , Mike Galbraith , Paul Turner , Chris Mason , kernel-team@fb.com Subject: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Message-ID: <20170424201444.GC14169@wtj.duckdns.org> References: <20170424201344.GA14169@wtj.duckdns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170424201344.GA14169@wtj.duckdns.org> User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5383 Lines: 158 We noticed that with cgroup CPU controller in use, the scheduling latency gets wonky regardless of nesting level or weight configuration. This is easily reproducible with Chris Mason's schbench[1]. All tests are run on a single socket, 16 cores, 32 threads machine. While the machine is mostly idle, it isn't completey. There's always some variable management load going on. The command used is schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 which measures scheduling latency with 32 threads each of which repeatedly runs for 15ms and then sleeps for 10ms. Here's a representative result when running from the root cgroup. # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 Latency percentiles (usec) 50.0000th: 26 75.0000th: 62 90.0000th: 74 95.0000th: 86 *99.0000th: 887 99.5000th: 3692 99.9000th: 10832 min=0, max=13374 The following is inside a first level CPU cgroup with the maximum weight. # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 Latency percentiles (usec) 50.0000th: 31 75.0000th: 65 90.0000th: 71 95.0000th: 91 *99.0000th: 7288 99.5000th: 10352 99.9000th: 12496 min=0, max=13023 Note the drastic increase in p99 scheduling latency. After investigation, it turned out that the update_sd_lb_stats(), which is used by load_balance() to pick the most loaded group, was often picking the wrong group. A CPU which has one schbench running and another queued wouldn't report the correspondingly higher weighted_cpuload() and get looked over as the target of load balancing. weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the sum of the load_avg of all queued sched_entities. Without cgroups or at the root cgroup, each task's load_avg contributes directly to the sum. When a task wakes up or goes to sleep, the change is immediately reflected on runnable_load_avg which in turn affects load balancing. However, when CPU cgroup is in use, a nesting cfs_rq blocks this immediate reflection. When a task wakes up inside a cgroup, the nested cfs_rq's runnable_load_avg is updated but doesn't get propagated to the parent. It only affects the matching sched_entity's load_avg over time which then gets propagated to the runnable_load_avg at that level. This makes weighted_cpuload() often temporarily out of sync leading to suboptimal behavior of load_balance() and increase in scheduling latencies as shown above. This patch fixes the issue by updating propagate_entity_load_avg() to always propagate to the parent's runnable_load_avg. Combined with the previous patch, this keeps a cfs_rq's runnable_load_avg always the sum of the scaled loads of all tasks queued below removing the artifacts from nesting cfs_rqs. The following is from inside three levels of nesting with the patch applied. # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 Latency percentiles (usec) 50.0000th: 40 75.0000th: 71 90.0000th: 89 95.0000th: 108 *99.0000th: 679 99.5000th: 3500 99.9000th: 10960 min=0, max=13790 [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git Signed-off-by: Tejun Heo Cc: Vincent Guittot Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Mike Galbraith Cc: Paul Turner Cc: Chris Mason --- kernel/sched/fair.c | 34 +++++++++++++++++++++------------- 1 file changed, 21 insertions(+), 13 deletions(-) --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq /* Take into account change of load of a child task group */ static inline void -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se) +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, + bool propagate_avg) { struct cfs_rq *gcfs_rq = group_cfs_rq(se); long load = 0, delta; @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq se->avg.load_avg = load; se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX; - /* Update parent cfs_rq load */ - add_positive(&cfs_rq->avg.load_avg, delta); - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; + if (propagate_avg) { + /* Update parent cfs_rq load */ + add_positive(&cfs_rq->avg.load_avg, delta); + cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; + } /* * If the sched_entity is already enqueued, we also have to update the @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_ /* Update task and its cfs_rq load average */ static inline int propagate_entity_load_avg(struct sched_entity *se) { - struct cfs_rq *cfs_rq; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + bool propagate_avg; if (entity_is_task(se)) return 0; - if (!test_and_clear_tg_cfs_propagate(se)) - return 0; - - cfs_rq = cfs_rq_of(se); + propagate_avg = test_and_clear_tg_cfs_propagate(se); - set_tg_cfs_propagate(cfs_rq); + /* + * We want to keep @cfs_rq->runnable_load_avg always in sync so + * that the load balancer can accurately determine the busiest CPU + * regardless of cfs_rq nesting. + */ + update_tg_cfs_load(cfs_rq, se, propagate_avg); - update_tg_cfs_util(cfs_rq, se); - update_tg_cfs_load(cfs_rq, se); + if (propagate_avg) { + set_tg_cfs_propagate(cfs_rq); + update_tg_cfs_util(cfs_rq, se); + } - return 1; + return propagate_avg; } #else /* CONFIG_FAIR_GROUP_SCHED */