Received: by 10.223.176.5 with SMTP id f5csp3300551wra; Mon, 29 Jan 2018 11:31:56 -0800 (PST) X-Google-Smtp-Source: AH8x226+so08VLvgS1uACI9ha0AmZilgQvZ5JsDqNlWWhK7PSenMNW6PBMjHVYxW9okH4iL/cK7l X-Received: by 2002:a17:902:32a2:: with SMTP id z31-v6mr23229267plb.345.1517254316144; Mon, 29 Jan 2018 11:31:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517254316; cv=none; d=google.com; s=arc-20160816; b=tkeTBdScqbFZHqPa1qH92NwPVakMMt4Oj6bnazqt5iM37dceQ6cZmWP/17VGIrkO8/ OeB1wvG0P5x3k4Be/dW8RhIakRWJLynvZ2sYkbDy4jEHP/Miw1bYyhV+v8Z4EoJyrFb5 dLZi+u+35JM3zI6Nd8EM9rTEE5HTTYYOeJF60+z2lAR8sOHhpzRDxexOSH8mILHmVPh+ N5EPqxepUwQLpm2I7z3At6skkTsLxv5SDcMmp5X45qByeZopvoLkvzFL21tWehpvCrKo owpND6vGz+VkCL54FI5JwH6NIHGjRbBU067f5tVhvMsL/lwSIWEDZkMnJM6gS4RTRykg y3qg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=jQxpVM1RFj/PrXo1+h+OXYYyeQrPRttg9ntHdHMz+xI=; b=0I+m2z4w4teiwkKzAB2Ib+4pu70lvuEVbzetOZ9HU2OEdT+dYOd2WVx+kxdC72x7s9 0ojtsO4QevWh/QLpnHaUOD51jq8huCFVZGYSTvVW6EC2jiIu3yiFWqejCy9Xhboi4z43 HDkyaPa8G8rwMbNqTVHjoPybO5s4i3CzJyZNt5o/kx9SDgHJj/Q2J3S4zAkcx8k8AF0M NE/1i2AhDzVKJ2/ejv8oAa4xieo0BevZL4gZD6QFAmGEApapbiW/lV2KzukKVC5JvceB UBLpmqH86jgeRfrsbjDb+SgZtU5/OkH0wGh56TvcDYrHTMcC+E3qHE5adnIiN0ha9HOK ywew== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x1si8541548pfk.3.2018.01.29.11.31.41; Mon, 29 Jan 2018 11:31:56 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751611AbeA2TbL (ORCPT + 99 others); Mon, 29 Jan 2018 14:31:11 -0500 Received: from foss.arm.com ([217.140.101.70]:45094 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751219AbeA2TbK (ORCPT ); Mon, 29 Jan 2018 14:31:10 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id C8DEA80D; Mon, 29 Jan 2018 11:31:09 -0800 (PST) Received: from [10.1.206.74] (e113632-lin.cambridge.arm.com [10.1.206.74]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 4281F3F487; Mon, 29 Jan 2018 11:31:08 -0800 (PST) Subject: Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK To: Vincent Guittot , Peter Zijlstra , Morten Rasmussen Cc: Ingo Molnar , linux-kernel , Brendan Jackman , Dietmar Eggemann , Morten Rasmussen References: <20171222075934.f6yenvcb2zkf2ysd@hirez.programming.kicks-ass.net> <20171222082915.4lcb7xyyooqyjpia@hirez.programming.kicks-ass.net> <20171222091221.ow5vn3ydx3hj4nht@hirez.programming.kicks-ass.net> <20171222185629.lysjebfifgdwvvhu@hirez.programming.kicks-ass.net> <20171222204247.kyc6ugyyu3ei7zhs@hirez.programming.kicks-ass.net> <20180115082609.GA6320@linaro.org> <20180118103807.GD28799@e105550-lin.cambridge.arm.com> <20180124082536.GA32318@linaro.org> From: Valentin Schneider Message-ID: Date: Mon, 29 Jan 2018 19:31:07 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: <20180124082536.GA32318@linaro.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Vincent, Peter, I've been running some tests on your patches (Peter's base + the 2 from Vincent). The results themselves are hosted at [1]. The base of those tests is the same: a task ("accumulator") is ran for 5 seconds (arbitrary value) to accumulate some load, then goes to sleep for .5 seconds. I've set up 3 test scenarios: Update by nohz_balance_kick() ----------------------------- Right before the "accumulator" task goes to sleep, a CPU-hogging task (100% utilization) is spawned on another CPU. It won't go idle so the only way to update the blocked load generated by "accumulator" is to kick an ILB (NOHZ_STATS_KICK). The test shows that this is behaving nicely - we keep kicking an ILB every ~36ms (see next test for comments on that) until there is no more blocked load. I did however notice some interesting scenarios: after the load has been fully decayed, a tiny background task can spawn and end in less than a scheduling period. However, it still goes through nohz_balance_enter_idle(), and thus sets nohz.stats_state, which will later cause an ILB kick. This makes me wonder if it's worth kicking ILBs for such tiny load values - perhaps it could be worth having a margin to set rq->has_blocked_load ? Furthermore, this tiny task will cause the ILB to iterate over all of the idle CPUs, although only one has stale load. For load update via NEWLY_IDLE load_balance() we use: static bool update_nohz_stats(struct rq *rq) {     if (!rq->has_blocked_load)      return false;     [...] } But for load update via _nohz_idle_balance(), we iterate through all of the nohz CPUS and unconditionally call update_blocked_averages(). This could be avoided by remembering which CPUs have stale load before going idle. Initially I thought that was what nohz.stats_state was for, but it isn't. With Vincent's patches it's only ever set to either 0 or 1, but we could use it as a CPU mask, and use it to skip nohz CPUs that don't have stale load in _nohz_idle_balance() (when NOHZ_STATS_KICK). Update by idle_balance() ------------------------ Right before the "accumulator" task goes to sleep, a tiny periodic (period=32ms) task is spawned on another CPU. It's expected that it will update the blocked load in idle_balance(), either by running _nohz_idle_balance() locally or kicking an ILB (The overload flag shouldn't be set in this test case, so we shouldn't go through the NEWLY_IDLE load_balance()). This also seems to be working fine, but I'm noticing a delay between load updates that is closer to 64ms than 32ms. After digging into it I found out that the time checks done in idle_balance() and nohz_balancer_kick() are time_after(jiffies, next_stats), but IMHO they should be time_after_eq(jiffies, next_stats) to have 32ms-based updates. This also explains the 36ms periodicity of the updates in the test above. No update (idle system) ----------------------- Nothing special here, just making sure nothing happens when the system is fully idle. On a sidenote, that's relatively hard to achieve - I had to switch over to Juno because my HiKey960 gets interrupts every 16ms. The Juno still gets woken up every now and then but it's a bit quieter. [1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0 On 01/24/2018 08:25 AM, Vincent Guittot wrote: > Hi, > > Le Thursday 18 Jan 2018 à 10:38:07 (+0000), Morten Rasmussen a écrit : >> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: >>> Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit : >>>> Hi Peter, >>>> >>>> On 22 December 2017 at 21:42, Peter Zijlstra wrote: >>>>> On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote: >>>>>> Right; but I figured we'd try and do it 'right' and see how horrible it >>>>>> is before we try and do funny things. >>>>> So now it should have a 32ms tick for up to .5s when the system goes >>>>> completely idle. >>>>> >>>>> No idea how bad that is.. >>>> I have tested your branch but the timer doesn't seem to fire correctly >>>> because i can still see blocked load in the use case i have run. >>>> I haven't found the reason yet >>> Hi Peter, >>> >>> With the patch below on top of your branch, the blocked loads are updated and >>> decayed regularly. The main differences are: >>> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle. >>> The main drawback of this solution is that the load is blocked when the >>> system is fully idle with the advantage of not waking up a fully idle >>> system. We have to wait for the next tick or newly idle event for updating >>> blocked load when the system leaves idle stat which can be up to a tick long. >>> If this is too long, we can check for kicking ilb when task wakes up so the >>> blocked load will be updated as soon as the system leaves idle state. >>> The main advantage is that we don't wake up a fully idle system every 32ms to >>> update blocked load that will be not used. >>> - I'm working on one more improvement to use nohz_idle_balance in the newly >>> idle case when the system is not overloaded and >>> (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to >>> use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed >>> this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up >>> of an idle cpus. >> This sound like what I meant in my other reply :-) >> >> It seems pointless to have a timer to update PELT if the system is >> completely idle, and when it isn't we can piggy back other events to >> make the updates happen. > The patch below implements what has been described above. It calls part of > nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much > time. This removes part of ilb that are kicked on an idle cpu for updating > the blocked load but the ratio really depends on when the tick happens compared > to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that > enables to update the blocked loads when a cpu becomes idle 1 period before > kicking an ilb and there is far less ilb because we give more chance to the > newly idle case (time_after is replaced by time_after_eq in idle_balance()). > > The patch also uses a function cfs_rq_has_blocked, which only checks the > util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This > reduce significantly the number of update of blocked load. the *_avg will be > fully decayed in around 300~400ms but it's far longer for the *_sum which have > a higher resolution and we can easily reach almost seconds. But only the *_avg > are used to make decision so keeping some blocked *_sum is acceptable. > > --- > kernel/sched/fair.c | 121 +++++++++++++++++++++++++++++++++++++++------------- > 1 file changed, 92 insertions(+), 29 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 898785d..ed90303 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq) > return true; > } > > +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) > +{ > + if (cfs_rq->avg.load_avg) > + return true; > + > + if (cfs_rq->avg.util_avg) > + return true; > + > + return false; > +} > + > #ifdef CONFIG_FAIR_GROUP_SCHED > > static void update_blocked_averages(int cpu) > @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) > */ > if (cfs_rq_is_decayed(cfs_rq)) > list_del_leaf_cfs_rq(cfs_rq); > - else > + > + /* Don't need periodic decay once load/util_avg are null */ > + if (cfs_rq_has_blocked(cfs_rq)) > done = false; > } > > @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu) > update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); > #ifdef CONFIG_NO_HZ_COMMON > rq->last_blocked_load_update_tick = jiffies; > - if (cfs_rq_is_decayed(cfs_rq)) > + if (cfs_rq_has_blocked(cfs_rq)) > rq->has_blocked_load = 0; > #endif > rq_unlock_irqrestore(rq, &rf); > @@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance) > *next_balance = next; > } > > +static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, enum cpu_idle_type idle); > static void kick_ilb(unsigned int flags); > > /* > @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf) > update_next_balance(sd, &next_balance); > rcu_read_unlock(); > > - if (time_after(jiffies, next) && atomic_read(&nohz.stats_state)) > + /* > + * Update blocked idle load if it has not been done for a > + * while. Try to do it locally before entering idle but kick a > + * ilb if it takes too much time and might delay next local > + * wake up > + */ > + if (time_after(jiffies, next) && atomic_read(&nohz.stats_state) && > + !_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE)) > kick_ilb(NOHZ_STATS_KICK); > > goto out; > @@ -9237,6 +9258,7 @@ void nohz_balance_enter_idle(int cpu) > if (!housekeeping_cpu(cpu, HK_FLAG_SCHED)) > return; > > + rq->has_blocked_load = 1; > if (rq->nohz_tick_stopped) > return; > > @@ -9247,7 +9269,6 @@ void nohz_balance_enter_idle(int cpu) > return; > > rq->nohz_tick_stopped = 1; > - rq->has_blocked_load = 1; > > cpumask_set_cpu(cpu, nohz.idle_cpus_mask); > atomic_inc(&nohz.nr_cpus); > @@ -9259,7 +9280,6 @@ void nohz_balance_enter_idle(int cpu) > * enable the periodic update of the load of idle cpus > */ > atomic_set(&nohz.stats_state, 1); > - > } > #else > static inline void nohz_balancer_kick(struct rq *rq) { } > @@ -9385,10 +9405,13 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) > > #ifdef CONFIG_NO_HZ_COMMON > /* > - * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the > - * rebalancing for all the cpus for whom scheduler ticks are stopped. > + * Internal function that runs load balance for all idle cpus. The load balance > + * can be a simple update of blocked load or a complete load balance with > + * tasks movement depending of flags. > + * For newly idle mode, we abort the loop if it takes too much time and return > + * false to notify that the loop has not be completed and a ilb shoud be kick. > */ > -static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > +static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, enum cpu_idle_type idle) > { > /* Earliest time when we have to do rebalance again */ > unsigned long now = jiffies; > @@ -9396,24 +9419,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > bool has_blocked_load = false; > int update_next_balance = 0; > int this_cpu = this_rq->cpu; > - unsigned int flags; > int balance_cpu; > + int ret = false; > struct rq *rq; > - > - if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK)) > - return false; > - > - if (idle != CPU_IDLE) { > - atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); > - return false; > - } > - > - /* > - * barrier, pairs with nohz_balance_enter_idle(), ensures ... > - */ > - flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); > - if (!(flags & NOHZ_KICK_MASK)) > - return false; > + u64 curr_cost = 0; > > SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK); > > @@ -9428,6 +9437,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > atomic_set(&nohz.stats_state, 0); > > for_each_cpu(balance_cpu, nohz.idle_cpus_mask) { > + u64 t0, domain_cost; > + > + t0 = sched_clock_cpu(this_cpu); > + > if (balance_cpu == this_cpu || !idle_cpu(balance_cpu)) > continue; > > @@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > */ > if (need_resched()) { > has_blocked_load = true; > - break; > + goto abort; > + } > + > + /* > + * If the update is done while CPU becomes idle, we abort > + * the update when its cost is higher than the average idle > + * time in orde to not delay a possible wake up. > + */ > + if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle < curr_cost) { > + has_blocked_load = true; > + goto abort; > } > > rq = cpu_rq(balance_cpu); > @@ -9453,10 +9476,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > if (time_after_eq(jiffies, rq->next_balance)) { > struct rq_flags rf; > > - rq_lock_irq(rq, &rf); > + rq_lock_irqsave(rq, &rf); > update_rq_clock(rq); > cpu_load_update_idle(rq); > - rq_unlock_irq(rq, &rf); > + rq_unlock_irqrestore(rq, &rf); > > if (flags & NOHZ_BALANCE_KICK) > rebalance_domains(rq, CPU_IDLE); > @@ -9466,10 +9489,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > next_balance = rq->next_balance; > update_next_balance = 1; > } > + > + domain_cost = sched_clock_cpu(this_cpu) - t0; > + curr_cost += domain_cost; > + > } > > - update_blocked_averages(this_cpu); > - has_blocked_load |= this_rq->has_blocked_load; > + /* Newly idle CPU doesn't need an update */ > + if (idle != CPU_NEWLY_IDLE) { > + update_blocked_averages(this_cpu); > + has_blocked_load |= this_rq->has_blocked_load; > + } > > if (flags & NOHZ_BALANCE_KICK) > rebalance_domains(this_rq, CPU_IDLE); > @@ -9477,6 +9507,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > WRITE_ONCE(nohz.next_stats, > now + msecs_to_jiffies(LOAD_AVG_PERIOD)); > > + /* The full idle balance loop has been done */ > + ret = true; > + > +abort: > /* There is still blocked load, enable periodic update */ > if (has_blocked_load) > atomic_set(&nohz.stats_state, 1); > @@ -9489,6 +9523,35 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > if (likely(update_next_balance)) > nohz.next_balance = next_balance; > > + return ret; > +} > + > +/* > + * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the > + * rebalancing for all the cpus for whom scheduler ticks are stopped. > + */ > +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) > +{ > + int this_cpu = this_rq->cpu; > + unsigned int flags; > + > + if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK)) > + return false; > + > + if (idle != CPU_IDLE) { > + atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); > + return false; > + } > + > + /* > + * barrier, pairs with nohz_balance_enter_idle(), ensures ... > + */ > + flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu)); > + if (!(flags & NOHZ_KICK_MASK)) > + return false; > + > + _nohz_idle_balance(this_rq, flags, idle); > + > return true; > } > #else