Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1177242AbdDYJGC (ORCPT ); Tue, 25 Apr 2017 05:06:02 -0400 Received: from mail-oi0-f44.google.com ([209.85.218.44]:33592 "EHLO mail-oi0-f44.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1176932AbdDYJF0 (ORCPT ); Tue, 25 Apr 2017 05:05:26 -0400 MIME-Version: 1.0 In-Reply-To: References: <20170424201344.GA14169@wtj.duckdns.org> <20170424201444.GC14169@wtj.duckdns.org> From: Vincent Guittot Date: Tue, 25 Apr 2017 11:05:05 +0200 Message-ID: Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg To: Tejun Heo Cc: Ingo Molnar , Peter Zijlstra , linux-kernel , Linus Torvalds , Mike Galbraith , Paul Turner , Chris Mason , kernel-team@fb.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6773 Lines: 170 On 25 April 2017 at 10:46, Vincent Guittot wrote: > On 24 April 2017 at 22:14, Tejun Heo wrote: >> We noticed that with cgroup CPU controller in use, the scheduling >> latency gets wonky regardless of nesting level or weight >> configuration. This is easily reproducible with Chris Mason's >> schbench[1]. >> >> All tests are run on a single socket, 16 cores, 32 threads machine. >> While the machine is mostly idle, it isn't completey. There's always >> some variable management load going on. The command used is >> >> schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 >> >> which measures scheduling latency with 32 threads each of which >> repeatedly runs for 15ms and then sleeps for 10ms. Here's a >> representative result when running from the root cgroup. >> >> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 >> Latency percentiles (usec) >> 50.0000th: 26 >> 75.0000th: 62 >> 90.0000th: 74 >> 95.0000th: 86 >> *99.0000th: 887 >> 99.5000th: 3692 >> 99.9000th: 10832 >> min=0, max=13374 >> >> The following is inside a first level CPU cgroup with the maximum >> weight. >> >> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 >> Latency percentiles (usec) >> 50.0000th: 31 >> 75.0000th: 65 >> 90.0000th: 71 >> 95.0000th: 91 >> *99.0000th: 7288 >> 99.5000th: 10352 >> 99.9000th: 12496 >> min=0, max=13023 >> >> Note the drastic increase in p99 scheduling latency. After >> investigation, it turned out that the update_sd_lb_stats(), which is >> used by load_balance() to pick the most loaded group, was often >> picking the wrong group. A CPU which has one schbench running and >> another queued wouldn't report the correspondingly higher >> weighted_cpuload() and get looked over as the target of load >> balancing. >> >> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the >> sum of the load_avg of all queued sched_entities. Without cgroups or >> at the root cgroup, each task's load_avg contributes directly to the >> sum. When a task wakes up or goes to sleep, the change is immediately >> reflected on runnable_load_avg which in turn affects load balancing. >> >> However, when CPU cgroup is in use, a nesting cfs_rq blocks this >> immediate reflection. When a task wakes up inside a cgroup, the >> nested cfs_rq's runnable_load_avg is updated but doesn't get >> propagated to the parent. It only affects the matching sched_entity's >> load_avg over time which then gets propagated to the runnable_load_avg >> at that level. This makes weighted_cpuload() often temporarily out of >> sync leading to suboptimal behavior of load_balance() and increase in >> scheduling latencies as shown above. >> >> This patch fixes the issue by updating propagate_entity_load_avg() to >> always propagate to the parent's runnable_load_avg. Combined with the >> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum >> of the scaled loads of all tasks queued below removing the artifacts >> from nesting cfs_rqs. The following is from inside three levels of >> nesting with the patch applied. > > So you are changing the purpose of propagate_entity_load_avg which > aims to propagate load_avg/util_avg changes only when a task migrate > and you also want to propagate the enqueue/dequeue in the parent > cfs_rq->runnable_load_avg In fact you want that sched_entity load_avg reflects cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg > >> >> # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 >> Latency percentiles (usec) >> 50.0000th: 40 >> 75.0000th: 71 >> 90.0000th: 89 >> 95.0000th: 108 >> *99.0000th: 679 >> 99.5000th: 3500 >> 99.9000th: 10960 >> min=0, max=13790 >> >> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git >> >> Signed-off-by: Tejun Heo >> Cc: Vincent Guittot >> Cc: Ingo Molnar >> Cc: Peter Zijlstra >> Cc: Mike Galbraith >> Cc: Paul Turner >> Cc: Chris Mason >> --- >> kernel/sched/fair.c | 34 +++++++++++++++++++++------------- >> 1 file changed, 21 insertions(+), 13 deletions(-) >> >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq >> >> /* Take into account change of load of a child task group */ >> static inline void >> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se) >> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, >> + bool propagate_avg) >> { >> struct cfs_rq *gcfs_rq = group_cfs_rq(se); >> long load = 0, delta; >> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq >> se->avg.load_avg = load; >> se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX; >> >> - /* Update parent cfs_rq load */ >> - add_positive(&cfs_rq->avg.load_avg, delta); >> - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; >> + if (propagate_avg) { >> + /* Update parent cfs_rq load */ >> + add_positive(&cfs_rq->avg.load_avg, delta); >> + cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; >> + } >> >> /* >> * If the sched_entity is already enqueued, we also have to update the >> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_ >> /* Update task and its cfs_rq load average */ >> static inline int propagate_entity_load_avg(struct sched_entity *se) >> { >> - struct cfs_rq *cfs_rq; >> + struct cfs_rq *cfs_rq = cfs_rq_of(se); >> + bool propagate_avg; >> >> if (entity_is_task(se)) >> return 0; >> >> - if (!test_and_clear_tg_cfs_propagate(se)) >> - return 0; >> - >> - cfs_rq = cfs_rq_of(se); >> + propagate_avg = test_and_clear_tg_cfs_propagate(se); >> >> - set_tg_cfs_propagate(cfs_rq); >> + /* >> + * We want to keep @cfs_rq->runnable_load_avg always in sync so >> + * that the load balancer can accurately determine the busiest CPU >> + * regardless of cfs_rq nesting. >> + */ >> + update_tg_cfs_load(cfs_rq, se, propagate_avg); >> >> - update_tg_cfs_util(cfs_rq, se); >> - update_tg_cfs_load(cfs_rq, se); >> + if (propagate_avg) { >> + set_tg_cfs_propagate(cfs_rq); >> + update_tg_cfs_util(cfs_rq, se); >> + } >> >> - return 1; >> + return propagate_avg; >> } >> >> #else /* CONFIG_FAIR_GROUP_SCHED */