Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Subject: Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK
To:     Vincent Guittot <vincent.guittot@linaro.org>,
        Peter Zijlstra <peterz@infradead.org>,
        Morten Rasmussen <morten.rasmussen@foss.arm.com>
Cc:     Ingo Molnar <mingo@kernel.org>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Brendan Jackman <brendan.jackman@arm.com>,
        Dietmar Eggemann <dietmar.eggemann@arm.com>,
        Morten Rasmussen <morten.rasmussen@arm.com>
References: <20171222075934.f6yenvcb2zkf2ysd@hirez.programming.kicks-ass.net>
 <CAKfTPtC0+5zjvgNXMyVcvnnyJh0giQgJHtx+5p-GEpjUW1x8eg@mail.gmail.com>
 <20171222082915.4lcb7xyyooqyjpia@hirez.programming.kicks-ass.net>
 <20171222091221.ow5vn3ydx3hj4nht@hirez.programming.kicks-ass.net>
 <CAKfTPtAzutuyA7D2qOErKPwdFaRK082YAtVn4ci8hAOUvL3zMQ@mail.gmail.com>
 <20171222185629.lysjebfifgdwvvhu@hirez.programming.kicks-ass.net>
 <20171222204247.kyc6ugyyu3ei7zhs@hirez.programming.kicks-ass.net>
 <CAKfTPtD13p-n9otxq23rQZZKUfA1r03bxYBSDtSdm5HbWX7H6Q@mail.gmail.com>
 <20180115082609.GA6320@linaro.org>
 <20180118103807.GD28799@e105550-lin.cambridge.arm.com>
 <20180124082536.GA32318@linaro.org>
From:   Valentin Schneider <valentin.schneider@arm.com>
Message-ID: <b1efd041-2faa-76e0-7d9b-3d2917e03012@arm.com>
Date:   Mon, 29 Jan 2018 19:31:07 +0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.5.0
MIME-Version: 1.0
In-Reply-To: <20180124082536.GA32318@linaro.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Content-Language: en-US
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

Hi Vincent, Peter,

I've been running some tests on your patches (Peter's base + the 2 from 
Vincent). The results themselves are hosted at [1].
The base of those tests is the same: a task ("accumulator") is ran for 5 
seconds (arbitrary value) to accumulate some load, then goes to sleep 
for .5 seconds.

I've set up 3 test scenarios:

Update by nohz_balance_kick()
-----------------------------
Right before the "accumulator" task goes to sleep, a CPU-hogging task 
(100% utilization) is spawned on another CPU. It won't go idle so the 
only way to update the blocked load generated by "accumulator" is to 
kick an ILB (NOHZ_STATS_KICK).

The test shows that this is behaving nicely - we keep kicking an ILB 
every ~36ms (see next test for comments on that) until there is no more 
blocked load. I did however notice some interesting scenarios: after the 
load has been fully decayed, a tiny background task can spawn and end in 
less than a scheduling period. However, it still goes through 
nohz_balance_enter_idle(), and thus sets nohz.stats_state, which will 
later cause an ILB kick.

This makes me wonder if it's worth kicking ILBs for such tiny load 
values - perhaps it could be worth having a margin to set 
rq->has_blocked_load ?

Furthermore, this tiny task will cause the ILB to iterate over all of 
the idle CPUs, although only one has stale load. For load update via 
NEWLY_IDLE load_balance() we use:

static bool update_nohz_stats(struct rq *rq)
{
     if (!rq->has_blocked_load)
      return false;
     [...]
}

But for load update via _nohz_idle_balance(), we iterate through all of 
the nohz CPUS and unconditionally call update_blocked_averages(). This 
could be avoided by remembering which CPUs have stale load before going 
idle. Initially I thought that was what nohz.stats_state was for, but it 
isn't.
With Vincent's patches it's only ever set to either 0 or 1, but we could 
use it as a CPU mask, and use it to skip nohz CPUs that don't have stale 
load in _nohz_idle_balance() (when NOHZ_STATS_KICK).

Update by idle_balance()
------------------------
Right before the "accumulator" task goes to sleep, a tiny periodic 
(period=32ms) task is spawned on another CPU. It's expected that it will 
update the blocked load in idle_balance(), either by running 
_nohz_idle_balance() locally or kicking an ILB (The overload flag 
shouldn't be set in this test case, so we shouldn't go through the 
NEWLY_IDLE load_balance()).

This also seems to be working fine, but I'm noticing a delay between 
load updates that is closer to 64ms than 32ms. After digging into it I 
found out that the time checks done in idle_balance() and 
nohz_balancer_kick() are time_after(jiffies, next_stats), but IMHO they 
should be time_after_eq(jiffies, next_stats) to have 32ms-based updates. 
This also explains the 36ms periodicity of the updates in the test above.


No update (idle system)
-----------------------
Nothing special here, just making sure nothing happens when the system 
is fully idle. On a sidenote, that's relatively hard to achieve - I had 
to switch over to Juno because my HiKey960 gets interrupts every 16ms. 
The Juno still gets woken up every now and then but it's a bit quieter.


[1]: https://gist.github.com/valschneider/a8da7bb8e11fb1ec63a419710f56c0a0


On 01/24/2018 08:25 AM, Vincent Guittot wrote:
> Hi,
>
> Le Thursday 18 Jan 2018 à 10:38:07 (+0000), Morten Rasmussen a écrit :
>> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote:
>>> Le Wednesday 03 Jan 2018 à 10:16:00 (+0100), Vincent Guittot a écrit :
>>>> Hi Peter,
>>>>
>>>> On 22 December 2017 at 21:42, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>> On Fri, Dec 22, 2017 at 07:56:29PM +0100, Peter Zijlstra wrote:
>>>>>> Right; but I figured we'd try and do it 'right' and see how horrible it
>>>>>> is before we try and do funny things.
>>>>> So now it should have a 32ms tick for up to .5s when the system goes
>>>>> completely idle.
>>>>>
>>>>> No idea how bad that is..
>>>> I have tested your branch but the timer doesn't seem to fire correctly
>>>> because i can still see blocked load in the use case i have run.
>>>> I haven't found the reason yet
>>> Hi Peter,
>>>
>>> With the patch below on top of your branch, the blocked loads are updated and
>>> decayed regularly. The main differences are:
>>> - It doesn't use a timer to trig ilb but the tick and when a cpu becomes idle.
>>>    The main drawback of this solution is that the load is blocked when the
>>>    system is fully idle with the advantage of not waking up a fully idle
>>>    system. We have to wait for the next tick or newly idle event for updating
>>>    blocked load when the system leaves idle stat which can be up to a tick long.
>>>    If this is too long, we can check for kicking ilb when task wakes up so the
>>>    blocked load will be updated as soon as the system leaves idle state.
>>>    The main advantage is that we don't wake up a fully idle system every 32ms to
>>>    update blocked load that will be not used.
>>> - I'm working on one more improvement to use nohz_idle_balance in the newly
>>>    idle case when the system is not overloaded and
>>>    (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we can try to
>>>    use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it exceed
>>>    this_rq->avg_idle. This will remove some calls to kick_ilb and some wake up
>>>    of an idle cpus.
>> This sound like what I meant in my other reply :-)
>>
>> It seems pointless to have a timer to update PELT if the system is
>> completely idle, and when it isn't we can piggy back other events to
>> make the updates happen.
> The patch below implements what has been described above. It calls part of
> nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too much
> time. This removes part of ilb that are kicked on an idle cpu for updating
> the blocked load but the ratio really depends on when the tick happens compared
> to a cpu becoming idle and the 32ms boundary. I have an additionnal patch that
> enables to update the blocked loads when a cpu becomes idle 1 period before
> kicking an ilb and there is far less ilb because we give more chance to the
> newly idle case (time_after is replaced by time_after_eq in idle_balance()).
>
> The patch also uses a function cfs_rq_has_blocked, which only checks the
> util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. This
> reduce significantly the number of update of blocked load. the *_avg will be
> fully decayed in around 300~400ms but it's far longer for the *_sum which have
> a higher resolution and we can easily reach almost seconds. But only the *_avg
> are used to make decision so keeping some blocked *_sum is acceptable.
>
> ---
>   kernel/sched/fair.c | 121 +++++++++++++++++++++++++++++++++++++++-------------
>   1 file changed, 92 insertions(+), 29 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 898785d..ed90303 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>   	return true;
>   }
>   
> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq)
> +{
> +	if (cfs_rq->avg.load_avg)
> +		return true;
> +
> +	if (cfs_rq->avg.util_avg)
> +		return true;
> +
> +	return false;
> +}
> +
>   #ifdef CONFIG_FAIR_GROUP_SCHED
>   
>   static void update_blocked_averages(int cpu)
> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu)
>   		 */
>   		if (cfs_rq_is_decayed(cfs_rq))
>   			list_del_leaf_cfs_rq(cfs_rq);
> -		else
> +
> +		/* Don't need periodic decay once load/util_avg are null */
> +		if (cfs_rq_has_blocked(cfs_rq))
>   			done = false;
>   	}
>   
> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int cpu)
>   	update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq);
>   #ifdef CONFIG_NO_HZ_COMMON
>   	rq->last_blocked_load_update_tick = jiffies;
> -	if (cfs_rq_is_decayed(cfs_rq))
> +	if (cfs_rq_has_blocked(cfs_rq))
>   		rq->has_blocked_load = 0;
>   #endif
>   	rq_unlock_irqrestore(rq, &rf);
> @@ -8818,6 +8831,7 @@ update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
>   		*next_balance = next;
>   }
>   
> +static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, enum cpu_idle_type idle);
>   static void kick_ilb(unsigned int flags);
>   
>   /*
> @@ -8861,7 +8875,14 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>   			update_next_balance(sd, &next_balance);
>   		rcu_read_unlock();
>   
> -		if (time_after(jiffies, next) && atomic_read(&nohz.stats_state))
> +		/*
> +		 * Update blocked idle load if it has not been done for a
> +		 * while. Try to do it locally before entering idle but kick a
> +		 * ilb if it takes too much time and might delay next local
> +		 * wake up
> +		 */
> +		if (time_after(jiffies, next) && atomic_read(&nohz.stats_state) &&
> +				!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE))
>   			kick_ilb(NOHZ_STATS_KICK);
>   
>   		goto out;
> @@ -9237,6 +9258,7 @@ void nohz_balance_enter_idle(int cpu)
>   	if (!housekeeping_cpu(cpu, HK_FLAG_SCHED))
>   		return;
>   
> +	rq->has_blocked_load = 1;
>   	if (rq->nohz_tick_stopped)
>   		return;
>   
> @@ -9247,7 +9269,6 @@ void nohz_balance_enter_idle(int cpu)
>   		return;
>   
>   	rq->nohz_tick_stopped = 1;
> -	rq->has_blocked_load = 1;
>   
>   	cpumask_set_cpu(cpu, nohz.idle_cpus_mask);
>   	atomic_inc(&nohz.nr_cpus);
> @@ -9259,7 +9280,6 @@ void nohz_balance_enter_idle(int cpu)
>   	 * enable the periodic update of the load of idle cpus
>   	 */
>   	atomic_set(&nohz.stats_state, 1);
> -
>   }
>   #else
>   static inline void nohz_balancer_kick(struct rq *rq) { }
> @@ -9385,10 +9405,13 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>   
>   #ifdef CONFIG_NO_HZ_COMMON
>   /*
> - * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
> - * rebalancing for all the cpus for whom scheduler ticks are stopped.
> + * Internal function that runs load balance for all idle cpus. The load balance
> + * can be a simple update of blocked load or a complete load balance with
> + * tasks movement depending of flags.
> + * For newly idle mode, we abort the loop if it takes too much time and return
> + * false to notify that the loop has not be completed and a ilb shoud be kick.
>    */
> -static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> +static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, enum cpu_idle_type idle)
>   {
>   	/* Earliest time when we have to do rebalance again */
>   	unsigned long now = jiffies;
> @@ -9396,24 +9419,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>   	bool has_blocked_load = false;
>   	int update_next_balance = 0;
>   	int this_cpu = this_rq->cpu;
> -	unsigned int flags;
>   	int balance_cpu;
> +	int ret = false;
>   	struct rq *rq;
> -
> -	if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
> -		return false;
> -
> -	if (idle != CPU_IDLE) {
> -		atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> -		return false;
> -	}
> -
> -	/*
> -	 * barrier, pairs with nohz_balance_enter_idle(), ensures ...
> -	 */
> -	flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> -	if (!(flags & NOHZ_KICK_MASK))
> -		return false;
> +	u64 curr_cost = 0;
>   
>   	SCHED_WARN_ON((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
>   
> @@ -9428,6 +9437,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>   	atomic_set(&nohz.stats_state, 0);
>   
>   	for_each_cpu(balance_cpu, nohz.idle_cpus_mask) {
> +		u64 t0, domain_cost;
> +
> +		t0 = sched_clock_cpu(this_cpu);
> +
>   		if (balance_cpu == this_cpu || !idle_cpu(balance_cpu))
>   			continue;
>   
> @@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>   		 */
>   		if (need_resched()) {
>   			has_blocked_load = true;
> -			break;
> +			goto abort;
> +		}
> +
> +		/*
> +		 * If the update is done while CPU becomes idle, we abort
> +		 * the update when its cost is higher than the average idle
> +		 * time in orde to not delay a possible wake up.
> +		 */
> +		if (idle == CPU_NEWLY_IDLE && this_rq->avg_idle < curr_cost) {
> +			has_blocked_load = true;
> +			goto abort;
>   		}
>   
>   		rq = cpu_rq(balance_cpu);
> @@ -9453,10 +9476,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>   		if (time_after_eq(jiffies, rq->next_balance)) {
>   			struct rq_flags rf;
>   
> -			rq_lock_irq(rq, &rf);
> +			rq_lock_irqsave(rq, &rf);
>   			update_rq_clock(rq);
>   			cpu_load_update_idle(rq);
> -			rq_unlock_irq(rq, &rf);
> +			rq_unlock_irqrestore(rq, &rf);
>   
>   			if (flags & NOHZ_BALANCE_KICK)
>   				rebalance_domains(rq, CPU_IDLE);
> @@ -9466,10 +9489,17 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>   			next_balance = rq->next_balance;
>   			update_next_balance = 1;
>   		}
> +
> +		domain_cost = sched_clock_cpu(this_cpu) - t0;
> +		curr_cost += domain_cost;
> +
>   	}
>   
> -	update_blocked_averages(this_cpu);
> -	has_blocked_load |= this_rq->has_blocked_load;
> +	/* Newly idle CPU doesn't need an update */
> +	if (idle != CPU_NEWLY_IDLE) {
> +		update_blocked_averages(this_cpu);
> +		has_blocked_load |= this_rq->has_blocked_load;
> +	}
>   
>   	if (flags & NOHZ_BALANCE_KICK)
>   		rebalance_domains(this_rq, CPU_IDLE);
> @@ -9477,6 +9507,10 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>   	WRITE_ONCE(nohz.next_stats,
>   		now + msecs_to_jiffies(LOAD_AVG_PERIOD));
>   
> +	/* The full idle balance loop has been done */
> +	ret = true;
> +
> +abort:
>   	/* There is still blocked load, enable periodic update */
>   	if (has_blocked_load)
>   		atomic_set(&nohz.stats_state, 1);
> @@ -9489,6 +9523,35 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>   	if (likely(update_next_balance))
>   		nohz.next_balance = next_balance;
>   
> +	return ret;
> +}
> +
> +/*
> + * In CONFIG_NO_HZ_COMMON case, the idle balance kickee will do the
> + * rebalancing for all the cpus for whom scheduler ticks are stopped.
> + */
> +static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> +{
> +	int this_cpu = this_rq->cpu;
> +	unsigned int flags;
> +
> +	if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
> +		return false;
> +
> +	if (idle != CPU_IDLE) {
> +		atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> +		return false;
> +	}
> +
> +	/*
> +	 * barrier, pairs with nohz_balance_enter_idle(), ensures ...
> +	 */
> +	flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> +	if (!(flags & NOHZ_KICK_MASK))
> +		return false;
> +
> +	_nohz_idle_balance(this_rq, flags, idle);
> +
>   	return true;
>   }
>   #else