Received: by 10.223.176.46 with SMTP id f43csp451189wra; Wed, 24 Jan 2018 00:26:24 -0800 (PST) X-Google-Smtp-Source: AH8x227bTj8Co24ua9mwtHMcGmNRj1fSP0uikzUJuGeOOinJnYHZmizutfPwocGe9YnCRyGQBECE X-Received: by 10.101.92.129 with SMTP id a1mr10452949pgt.198.1516782384212; Wed, 24 Jan 2018 00:26:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516782384; cv=none; d=google.com; s=arc-20160816; b=bGZzqSIS80wxsfedM5AcT2EVW1EMKp9BuuL97x8GAcDxaJsfzIp8Anp8X6BNbn5kBG DMIWLkUtnDpOobuiH8BIyMXNkwdI7DCk7dKB/nq9nL7ErWPSDQb35wL4xt3r/VDKQPfz vIJFS4hEiq1D7i2Qv/Rs6tQvIh+/q9Xw+CSXi/hLSIPFtkULOgsXBKh62RR1huXitJt8 CTXvMQHyLo8t/FE5hS8aeTN4zkhJQhqFeBLiyzPgL8Gp48Q3/EBss0qF9B5OrTAvZCxo iVI7G97guUFUBJy9KAS0xw8TbA5pK6J3qy/gF10j40EZSxizm39Tezoxn7EqchKz4S7d rSjA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=zEiZImYhX07ElTm+RmDBQvu72GcXqL0PosJWR5VFUtg=; b=N/HPl4ieuWDxilhbQq5pIDUKn4yi8SEOzzGsydUGeAgUnPTEGjP9Ei6u7EeilQ7BAD 6ldcpU1NWSs4/RF3Zj+DdJk3ap7UZODItRMGU4jp2b8JTyY9eD+oWBYTgw3t/C4UuTrg k1uvay3Aod+XYarKsCz88n6cu3mHL33MiznQVR+D1P/onf9abyW2p4Zep3P5AGUivm4+ qtis27fSpDPiNvjIXlaVBXU/6kAC7RGkpZHP/k1UIo2kXoeomi1kSEObAVNyIQn7w4Bq Bq5j6/S+h1j0Hps1DyPrjW0xfKYWKJswBGtBtd3fUNtEdLeNUAHRkAI/FBDjEufrp3GN q/rg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=KDa6jtsU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q14-v6si407022pll.621.2018.01.24.00.26.08; Wed, 24 Jan 2018 00:26:24 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=KDa6jtsU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932166AbeAXIZo (ORCPT + 99 others); Wed, 24 Jan 2018 03:25:44 -0500 Received: from mail-wm0-f65.google.com ([74.125.82.65]:39115 "EHLO mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751596AbeAXIZm (ORCPT ); Wed, 24 Jan 2018 03:25:42 -0500 Received: by mail-wm0-f65.google.com with SMTP id b21so6776310wme.4 for ; Wed, 24 Jan 2018 00:25:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=zEiZImYhX07ElTm+RmDBQvu72GcXqL0PosJWR5VFUtg=; b=KDa6jtsU1R5SerPOAV9PvG+i/4hL9YLONpi/ill/PdKmtcb3ddjV/WP5Sk06TAgSGC WO8EDDM4Viqc3xvLQY1z2InKrtYF6lZjOeXbDGEqONlF82yfG39GJn0JuCojdvxqF9AB RKpUpoi1oSsCM5WsSz/njHBkKwbcVAk633iKk= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=zEiZImYhX07ElTm+RmDBQvu72GcXqL0PosJWR5VFUtg=; b=azduX3GCKLZPDWuhMxLOrrMYzYlvnbOIJjH2dOMDSNrKMWzsmGsQAKMakykSgBXfiL sWJFv15LMSQnjY3wYRi9X3ZMqmWKTk/nwJOqbSEvISPo9OmxaVVXPGGsOne4F7c083Ir 18kZMLK9PJTw1YmTJjojiAtucNSXBq/fC+qKeptU2b49X1wytI/gZfE0/0KKHajawQ+E 0wTvQyj1q6ZMv519/zYjEhTw361jHJsTBAY7RTeVGOVW6hNwU7zlr8zssPEJy2b6DMqF /h1ZzeHX/PTACZrIiIsvcNUAr02V6a84+DFJwDQVVCYQ4sqC9zBaltoEr9OhlEKqTFde pF7g== X-Gm-Message-State: AKwxytf9/flAi1QhDdn3HZB0Zh6QRZtmRpLAnsz4nBSGaEZXPLUlv9kJ RBsxz4RN0QPLO989YRO05JUJESrtyQg= X-Received: by 10.28.131.210 with SMTP id f201mr4285407wmd.117.1516782340898; Wed, 24 Jan 2018 00:25:40 -0800 (PST) Received: from linaro.org ([2a01:e0a:f:6020:f55e:9f31:16c2:f3c4]) by smtp.gmail.com with ESMTPSA id 94sm3100036wrb.40.2018.01.24.00.25.38 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 24 Jan 2018 00:25:38 -0800 (PST) Date: Wed, 24 Jan 2018 09:25:36 +0100 From: Vincent Guittot To: Peter Zijlstra , Morten Rasmussen Cc: Ingo Molnar , linux-kernel , Brendan Jackman , Dietmar Eggemann , Morten Rasmussen Subject: Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK Message-ID: <20180124082536.GA32318@linaro.org> References: <20171222075934.f6yenvcb2zkf2ysd@hirez.programming.kicks-ass.net> <20171222082915.4lcb7xyyooqyjpia@hirez.programming.kicks-ass.net> <20171222091221.ow5vn3ydx3hj4nht@hirez.programming.kicks-ass.net> <20171222185629.lysjebfifgdwvvhu@hirez.programming.kicks-ass.net> <20171222204247.kyc6ugyyu3ei7zhs@hirez.programming.kicks-ass.net> <20180115082609.GA6320@linaro.org> <20180118103807.GD28799@e105550-lin.cambridge.arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20180118103807.GD28799@e105550-lin.cambridge.arm.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Le Thursday 18 Jan 2018 ? 10:38:07 (+0000), Morten Rasmussen a ?crit : > On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: > > Le Wednesday 03 Jan 2018 ? 10:16:00 (+0100), Vincent Guittot a ?crit : > > > Hi Peter, > > > > > > On 22 December 2017 at 21:42, Peter Zijlstra wrote: > > > > On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: > > > >> Right; but I figured we'd try and do it 'right' and see how horrible it > > > >> is before we try and do funny things. > > > > > > > > So now it should have a 32ms tick for up to .5s when the system goes > > > > completely idle. > > > > > > > > No idea how bad that is.. > > > > > > I have tested your branch but the timer doesn't seem to fire correctly > > > because i can still see blocked load in the use case i have run. > > > I haven't found the reason yet > > > > Hi Peter, > > > > With the patch below on top of your branch, the blocked loads are updated and > > decayed regularly. The main differences are: > > - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. > > The main drawback of this solution is that the load is blocked when the > > system is fully idle with the advantage of not waking up a fully idle > > system. We have to wait for the next tick or newly idle event for updating > > blocked load when the system leaves idle stat which can be up to a tick long. > > If this is too long, we can check for kicking ilb when task wakes up so the > > blocked load will be updated as soon as the system leaves idle state. > > The main advantage is that we don't wake up a fully idle system every 32ms to > > update blocked load that will be not used. > > - I'm working on one more improvement to use nohz_idle_balance in the newly > > idle case when the system is not overloaded and > > (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to > > use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed > > this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up > > of an idle cpus. > > This sound like what I meant in my other reply :-) > > It seems pointless to have a timer to update PELT if the system is > completely idle, and when it isn't we can piggy back other events to > make the updates happen. The patch below implements what has been described above. It calls part of nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much time. This removes part of ilb that are kicked on an idle cpu for updating the blocked load but the ratio really depends on when the tick happens compared to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that enables to update the blocked loads when a cpu becomes idle 1 period before kicking an ilb and there is far less ilb because we give more chance to the newly idle case (time_after is replaced by time_after_eq in idle_balance()). The patch also uses a function cfs_rq_has_blocked, which only checks the util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This reduce significantly the number of update of blocked load. the *_avg will be fully decayed in around 300~400ms but it's far longer for the *_sum which have a higher resolution and we can easily reach almost seconds. But only the *_avg are used to make decision so keeping some blocked *_sum is acceptable. --- kernel/sched/fair.c | 121 +++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 92 insertions(+), 29 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 898785d..ed90303 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) return true; } +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) +{ + if (cfs_rq->avg.load_avg) + return true; + + if (cfs_rq->avg.util_avg) + return true; + + return false; +} + #ifdef CONFIG_FAIR_GROUP_SCHED static void update_blocked_averages(int cpu) @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) */ if (cfs_rq_is_decayed(cfs_rq)) list_del_leaf_cfs_rq(cfs_rq); - else + + /* Don't need periodic decay once load/util_avg are null */ + if (cfs_rq_has_blocked(cfs_rq)) done = false; } @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu) update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); #ifdef CONFIG_NO_HZ_COMMON rq->last_blocked_load_update_tick = jiffies; - if (cfs_rq_is_decayed(cfs_rq)) + if (cfs_rq_has_blocked(cfs_rq)) rq->has_blocked_load = 0; #endif rq_unlock_irqrestore(rq, &rf); @@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance) *next_balance = next; } +static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, enum cpu_idle_type idle); static void kick_ilb(unsigned int flags); /* @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) update_next_balance(sd, &next_balance); rcu_read_unlock(); - if (time_after(jiffies, next) && atomic_read(&nohz.stats_state)) + /* + * Update blocked idle load if it has not been done for a + * while. Try to do it locally before entering idle but kick a + * ilb if it takes too much time and might delay next local + * wake up + */ + if (time_after(jiffies, next) && atomic_read(&nohz.stats_state) && + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE)) kick_ilb(NOHZ_STATS_KICK); goto out; @@ -9237,6 +9258,7 @@ void nohz_balance_enter_idle(int cpu) if (!housekeeping_cpu(cpu, HK_FLAG_SCHED)) return; + rq->has_blocked_load = 1; if (rq->nohz_tick_stopped) return; @@ -9247,7 +9269,6 @@ void nohz_balance_enter_idle(int cpu) return; rq->nohz_tick_stopped = 1; - rq->has_blocked_load = 1; cpumask_set_cpu(cpu, nohz.idle_cpus_mask); atomic_inc(&nohz.nr_cpus); @@ -9259,7 +9280,6 @@ void nohz_balance_enter_idle(int cpu) * enable the periodic update of the load of idle cpus */ atomic_set(&nohz.stats_state, 1); - } #else static inline void nohz_balancer_kick(struct rq *rq) { } @@ -9385,10 +9405,13 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) #ifdef CONFIG_NO_HZ_COMMON /* - * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the - * rebalancing for all the cpus for whom scheduler ticks are stopped. + * Internal function that runs load balance for all idle cpus. The load balance + * can be a simple update of blocked load or a complete load balance with + * tasks movement depending of flags. + * For newly idle mode, we abort the loop if it takes too much time and return + * false to notify that the loop has not be completed and a ilb shoud be kick. */ -static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) +static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, enum cpu_idle_type idle) { /* Earliest time when we have to do rebalance again */ unsigned long now = jiffies; @@ -9396,24 +9419,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) bool has_blocked_load = false; int update_next_balance = 0; int this_cpu = this_rq->cpu; - unsigned int flags; int balance_cpu; + int ret = false; struct rq *rq; - - if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK)) - return false; - - if (idle != CPU_IDLE) { - atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); - return false; - } - - /* - * barrier, pairs with nohz_balance_enter_idle(), ensures ... - */ - flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); - if (!(flags & NOHZ_KICK_MASK)) - return false; + u64 curr_cost = 0; SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); @@ -9428,6 +9437,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) atomic_set(&nohz.stats_state, 0); for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { + u64 t0, domain_cost; + + t0 = sched_clock_cpu(this_cpu); + if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) continue; @@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) */ if (need_resched()) { has_blocked_load = true; - break; + goto abort; + } + + /* + * If the update is done while CPU becomes idle, we abort + * the update when its cost is higher than the average idle + * time in orde to not delay a possible wake up. + */ + if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle < curr_cost) { + has_blocked_load = true; + goto abort; } rq = cpu_rq(balance_cpu); @@ -9453,10 +9476,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) if (time_after_eq(jiffies, rq->next_balance)) { struct rq_flags rf; - rq_lock_irq(rq, &rf); + rq_lock_irqsave(rq, &rf); update_rq_clock(rq); cpu_load_update_idle(rq); - rq_unlock_irq(rq, &rf); + rq_unlock_irqrestore(rq, &rf); if (flags & NOHZ_BALANCE_KICK) rebalance_domains(rq, CPU_IDLE); @@ -9466,10 +9489,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) next_balance = rq->next_balance; update_next_balance = 1; } + + domain_cost = sched_clock_cpu(this_cpu) - t0; + curr_cost += domain_cost; + } - update_blocked_averages(this_cpu); - has_blocked_load |= this_rq->has_blocked_load; + /* Newly idle CPU doesn't need an update */ + if (idle != CPU_NEWLY_IDLE) { + update_blocked_averages(this_cpu); + has_blocked_load |= this_rq->has_blocked_load; + } if (flags & NOHZ_BALANCE_KICK) rebalance_domains(this_rq, CPU_IDLE); @@ -9477,6 +9507,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) WRITE_ONCE(nohz.next_stats, now + msecs_to_jiffies(LOAD_AVG_PERIOD)); + /* The full idle balance loop has been done */ + ret = true; + +abort: /* There is still blocked load, enable periodic update */ if (has_blocked_load) atomic_set(&nohz.stats_state, 1); @@ -9489,6 +9523,35 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) if (likely(update_next_balance)) nohz.next_balance = next_balance; + return ret; +} + +/* + * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the + * rebalancing for all the cpus for whom scheduler ticks are stopped. + */ +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) +{ + int this_cpu = this_rq->cpu; + unsigned int flags; + + if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK)) + return false; + + if (idle != CPU_IDLE) { + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); + return false; + } + + /* + * barrier, pairs with nohz_balance_enter_idle(), ensures ... + */ + flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); + if (!(flags & NOHZ_KICK_MASK)) + return false; + + _nohz_idle_balance(this_rq, flags, idle); + return true; } #else -- 2.7.4