Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1947441AbdDYM74 (ORCPT ); Tue, 25 Apr 2017 08:59:56 -0400 Received: from mail-oi0-f45.google.com ([209.85.218.45]:35209 "EHLO mail-oi0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1947298AbdDYM7j (ORCPT ); Tue, 25 Apr 2017 08:59:39 -0400 MIME-Version: 1.0 In-Reply-To: References: <20170424201344.GA14169@wtj.duckdns.org> <20170424201444.GC14169@wtj.duckdns.org> From: Vincent Guittot Date: Tue, 25 Apr 2017 14:59:18 +0200 Message-ID: Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg To: Tejun Heo Cc: Ingo Molnar , Peter Zijlstra , linux-kernel , Linus Torvalds , Mike Galbraith , Paul Turner , Chris Mason , kernel-team@fb.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7570 Lines: 186 On 25 April 2017 at 11:05, Vincent Guittot wrote: > On 25 April 2017 at 10:46, Vincent Guittot wrote: >> On 24 April 2017 at 22:14, Tejun Heo wrote: >>> We noticed that with cgroup CPU controller in use, the scheduling >>> latency gets wonky regardless of nesting level or weight >>> configuration. This is easily reproducible with Chris Mason's >>> schbench[1]. >>> >>> All tests are run on a single socket, 16 cores, 32 threads machine. >>> While the machine is mostly idle, it isn't completey. There's always >>> some variable management load going on. The command used is >>> >>> schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 >>> >>> which measures scheduling latency with 32 threads each of which >>> repeatedly runs for 15ms and then sleeps for 10ms. Here's a >>> representative result when running from the root cgroup. >>> >>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 >>> Latency percentiles (usec) >>> 50.0000th: 26 >>> 75.0000th: 62 >>> 90.0000th: 74 >>> 95.0000th: 86 >>> *99.0000th: 887 >>> 99.5000th: 3692 >>> 99.9000th: 10832 >>> min=0, max=13374 >>> >>> The following is inside a first level CPU cgroup with the maximum >>> weight. >>> >>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 >>> Latency percentiles (usec) >>> 50.0000th: 31 >>> 75.0000th: 65 >>> 90.0000th: 71 >>> 95.0000th: 91 >>> *99.0000th: 7288 >>> 99.5000th: 10352 >>> 99.9000th: 12496 >>> min=0, max=13023 >>> >>> Note the drastic increase in p99 scheduling latency. After >>> investigation, it turned out that the update_sd_lb_stats(), which is >>> used by load_balance() to pick the most loaded group, was often >>> picking the wrong group. A CPU which has one schbench running and >>> another queued wouldn't report the correspondingly higher >>> weighted_cpuload() and get looked over as the target of load >>> balancing. >>> >>> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the >>> sum of the load_avg of all queued sched_entities. Without cgroups or >>> at the root cgroup, each task's load_avg contributes directly to the >>> sum. When a task wakes up or goes to sleep, the change is immediately >>> reflected on runnable_load_avg which in turn affects load balancing. >>> >>> However, when CPU cgroup is in use, a nesting cfs_rq blocks this >>> immediate reflection. When a task wakes up inside a cgroup, the >>> nested cfs_rq's runnable_load_avg is updated but doesn't get >>> propagated to the parent. It only affects the matching sched_entity's >>> load_avg over time which then gets propagated to the runnable_load_avg >>> at that level. This makes weighted_cpuload() often temporarily out of >>> sync leading to suboptimal behavior of load_balance() and increase in >>> scheduling latencies as shown above. >>> >>> This patch fixes the issue by updating propagate_entity_load_avg() to >>> always propagate to the parent's runnable_load_avg. Combined with the >>> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum >>> of the scaled loads of all tasks queued below removing the artifacts >>> from nesting cfs_rqs. The following is from inside three levels of >>> nesting with the patch applied. >> >> So you are changing the purpose of propagate_entity_load_avg which >> aims to propagate load_avg/util_avg changes only when a task migrate >> and you also want to propagate the enqueue/dequeue in the parent >> cfs_rq->runnable_load_avg > > In fact you want that sched_entity load_avg reflects > cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg I have run a quick test with your patches and schbench on my platform. I haven't been able to reproduce your regression but my platform is quite different from yours (only 8 cores without SMT) But most importantly, the parent cfs_rq->runnable_load_avg never reaches 0 (or almost 0) when it is idle. Instead, it still has a runnable_load_avg (this is not due to rounding computation) whereas runnable_load_avg should be 0 Just to be curious, Is your regression still there if you disable SMT/hyperthreading on your paltform? Regards, Vincent > >> >>> >>> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 >>> Latency percentiles (usec) >>> 50.0000th: 40 >>> 75.0000th: 71 >>> 90.0000th: 89 >>> 95.0000th: 108 >>> *99.0000th: 679 >>> 99.5000th: 3500 >>> 99.9000th: 10960 >>> min=0, max=13790 >>> >>> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git >>> >>> Signed-off-by: Tejun Heo >>> Cc: Vincent Guittot >>> Cc: Ingo Molnar >>> Cc: Peter Zijlstra >>> Cc: Mike Galbraith >>> Cc: Paul Turner >>> Cc: Chris Mason >>> --- >>> kernel/sched/fair.c | 34 +++++++++++++++++++++------------- >>> 1 file changed, 21 insertions(+), 13 deletions(-) >>> >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq >>> >>> /* Take into account change of load of a child task group */ >>> static inline void >>> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se) >>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, >>> + bool propagate_avg) >>> { >>> struct cfs_rq *gcfs_rq = group_cfs_rq(se); >>> long load = 0, delta; >>> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq >>> se->avg.load_avg = load; >>> se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX; >>> >>> - /* Update parent cfs_rq load */ >>> - add_positive(&cfs_rq->avg.load_avg, delta); >>> - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; >>> + if (propagate_avg) { >>> + /* Update parent cfs_rq load */ >>> + add_positive(&cfs_rq->avg.load_avg, delta); >>> + cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; >>> + } >>> >>> /* >>> * If the sched_entity is already enqueued, we also have to update the >>> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_ >>> /* Update task and its cfs_rq load average */ >>> static inline int propagate_entity_load_avg(struct sched_entity *se) >>> { >>> - struct cfs_rq *cfs_rq; >>> + struct cfs_rq *cfs_rq = cfs_rq_of(se); >>> + bool propagate_avg; >>> >>> if (entity_is_task(se)) >>> return 0; >>> >>> - if (!test_and_clear_tg_cfs_propagate(se)) >>> - return 0; >>> - >>> - cfs_rq = cfs_rq_of(se); >>> + propagate_avg = test_and_clear_tg_cfs_propagate(se); >>> >>> - set_tg_cfs_propagate(cfs_rq); >>> + /* >>> + * We want to keep @cfs_rq->runnable_load_avg always in sync so >>> + * that the load balancer can accurately determine the busiest CPU >>> + * regardless of cfs_rq nesting. >>> + */ >>> + update_tg_cfs_load(cfs_rq, se, propagate_avg); >>> >>> - update_tg_cfs_util(cfs_rq, se); >>> - update_tg_cfs_load(cfs_rq, se); >>> + if (propagate_avg) { >>> + set_tg_cfs_propagate(cfs_rq); >>> + update_tg_cfs_util(cfs_rq, se); >>> + } >>> >>> - return 1; >>> + return propagate_avg; >>> } >>> >>> #else /* CONFIG_FAIR_GROUP_SCHED */