Received: by 10.223.176.5 with SMTP id f5csp3991759wra; Tue, 30 Jan 2018 00:01:40 -0800 (PST) X-Google-Smtp-Source: AH8x225LsrIZ9yl74shaMsbL+PYrM6gp7cg1zN0thsSle4zGyGQG7UI5bb11J/9kU5hdWhKsw07m X-Received: by 2002:a17:902:28a4:: with SMTP id f33-v6mr21517591plb.192.1517299299939; Tue, 30 Jan 2018 00:01:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517299299; cv=none; d=google.com; s=arc-20160816; b=Nf4z8yjE6+sWSiOcw3XeCsClss8aFh3gfbC1t4IAtT07DHZgKc/2E1BwY0TYvsfuad 7KBB1t8ynziXZHFqctASpttM7i2CbkEstY9yznZWAxjJvZs1xum1gvsMN/x5jeQttvOz WyMLlAFNtTOmX6sb7yw7k4YNaTsW/YybNBguMzgz49CQwJzrOSVDwyc18ioJo03klgff bJA9uj566ADpYsyk2FDeh/4Gta784JcfIxoqzY6R6JI+snc0ODutEt91tc4LUw+hvzlK ibDQkOIXhRf/IzJCYxnZ0BuUBVieVUcaEcxxBGz0Uol1+GKqkDQ84khIT2p2feda3fJx dzhw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:references:in-reply-to:mime-version :dkim-signature:arc-authentication-results; bh=ssVTkrXGkOutnPYWHS8nWBgUqcLoGxDfEMuB+qT5RxQ=; b=VmvvalZ/x/x55SIr8/UpNMbly8U1WPuRp37zzF8tNlhu1EnQM6rzeva+xxiH0tG74R eSi1ae3tZYa3LpgTplGzpx5xlUkiPL3DSDE2moaVqf7P7yRh0E4xSgCoG9PzWp50POgm pkN2gG+EY0G1W5JWtw9/EB07Q64+bS+zP7OgNnSBmJpXO51DAufqBTZiZqecF6JpYDFP dn2ycsx2IzpGUVQ6y/v+nnqNjjDvxYS2dKqVekFZ6pziwAHHaBlKzmi3o9ol775FYkqN 2SPsn3/nNWlRhaA11HoQAAsB9Q8ErgpXIeBKZylPod5A29FobUwMiBkUcyn9UAzq09Ja IHWw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=gSprqG86; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y8si8895988pgr.74.2018.01.30.00.01.25; Tue, 30 Jan 2018 00:01:39 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=gSprqG86; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751889AbeA3IA5 (ORCPT + 99 others); Tue, 30 Jan 2018 03:00:57 -0500 Received: from mail-io0-f196.google.com ([209.85.223.196]:37133 "EHLO mail-io0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751683AbeA3IAz (ORCPT ); Tue, 30 Jan 2018 03:00:55 -0500 Received: by mail-io0-f196.google.com with SMTP id f89so10479574ioj.4 for ; Tue, 30 Jan 2018 00:00:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=ssVTkrXGkOutnPYWHS8nWBgUqcLoGxDfEMuB+qT5RxQ=; b=gSprqG86CkidQKvH7geWePzykdPV9ZDEKoExTuvKYmYJwxX94tnJ0eAVCUvZ0DhX6L cMHxz12UP+TScAxdT2qjfxaaVNDICtzlFIv49yqKowf2MNKKu7kS+qEpUBLYn55JFhzK mvyH63cmCJS33PtJJKJmZN7o6mB7blx5TV+Eo= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=ssVTkrXGkOutnPYWHS8nWBgUqcLoGxDfEMuB+qT5RxQ=; b=gqZujSyWvnTfQgv8VHDv7n4G+jNIDpT1OXOCNYNB+zw5Ev+8LxJArqGtQ0+z85OxGL k94RjFP8efUbfFfocpiiT+Twy8qD1VB1W7jjjvmSFENrOZKzfebPgzTX0POpHALPKGuS h1ihfEDbp1icvbh9+OTU2YS2w9Vy6AB5H0ZvzUCEfz7P3pz2ETvZSISTUjOK2aX66E9w 9dJ6nhFd3iNcXS0GNNJSYqLxhdQZqLOsR8pX3P1WOI173lYLSu+59URc4Ft0dsBdXgD1 FvciwuWcmbqzu3muVP3QsQR1z8m7YhDC+g7F8rmVE/CSEjSAZII4BaPevqNTJB0tA5pK r3Fg== X-Gm-Message-State: AKwxytfVWfJmpY5BgJFpTuUr3rXlI3jwXYsvSAkfuABlb0zRHGpxO8ga F5of8Zg9jORHzE9Z3gimXctqU2YblWa3905BS9J5lg== X-Received: by 10.107.178.195 with SMTP id b186mr31635842iof.9.1517299254936; Tue, 30 Jan 2018 00:00:54 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.50.198 with HTTP; Tue, 30 Jan 2018 00:00:34 -0800 (PST) In-Reply-To: References: <20171222075934.f6yenvcb2zkf2ysd@hirez.programming.kicks-ass.net> <20171222082915.4lcb7xyyooqyjpia@hirez.programming.kicks-ass.net> <20171222091221.ow5vn3ydx3hj4nht@hirez.programming.kicks-ass.net> <20171222185629.lysjebfifgdwvvhu@hirez.programming.kicks-ass.net> <20171222204247.kyc6ugyyu3ei7zhs@hirez.programming.kicks-ass.net> <20180115082609.GA6320@linaro.org> <20180118103807.GD28799@e105550-lin.cambridge.arm.com> <20180124082536.GA32318@linaro.org> From: Vincent Guittot Date: Tue, 30 Jan 2018 09:00:34 +0100 Message-ID: Subject: Re: [RFC PATCH 2/5] sched: Add NOHZ_STATS_KICK To: Dietmar Eggemann Cc: Peter Zijlstra , Morten Rasmussen , Ingo Molnar , linux-kernel , Brendan Jackman , Morten Rasmussen Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 29 January 2018 at 19:43, Dietmar Eggemann wr= ote: > On 01/24/2018 09:25 AM, Vincent Guittot wrote: >> >> Hi, >> >> Le Thursday 18 Jan 2018 =C3=A0 10:38:07 (+0000), Morten Rasmussen a =C3= =A9crit : >>> >>> On Mon, Jan 15, 2018 at 09:26:09AM +0100, Vincent Guittot wrote: >>>> >>>> Le Wednesday 03 Jan 2018 =C3=A0 10:16:00 (+0100), Vincent Guittot a = =C3=A9crit : > > > [...] > > >>>> >>>> Hi Peter, >>>> >>>> With the patch below on top of your branch, the blocked loads are >>>> updated and >>>> decayed regularly. The main differences are: >>>> - It doesn't use a timer to trig ilb but the tick and when a cpu becom= es >>>> idle. >>>> The main drawback of this solution is that the load is blocked when >>>> the >>>> system is fully idle with the advantage of not waking up a fully id= le >>>> system. We have to wait for the next tick or newly idle event for >>>> updating >>>> blocked load when the system leaves idle stat which can be up to a >>>> tick long. >>>> If this is too long, we can check for kicking ilb when task wakes u= p >>>> so the >>>> blocked load will be updated as soon as the system leaves idle stat= e. >>>> The main advantage is that we don't wake up a fully idle system eve= ry >>>> 32ms to >>>> update blocked load that will be not used. >>>> - I'm working on one more improvement to use nohz_idle_balance in the >>>> newly >>>> idle case when the system is not overloaded and >>>> (this_rq->avg_idle > sysctl_sched_migration_cost). In this case, we >>>> can try to >>>> use nohz_idle_balance with NOHZ_STATS_KICK and abort as soon as it >>>> exceed >>>> this_rq->avg_idle. This will remove some calls to kick_ilb and some >>>> wake up >>>> of an idle cpus. >>> >>> >>> This sound like what I meant in my other reply :-) >>> >>> It seems pointless to have a timer to update PELT if the system is >>> completely idle, and when it isn't we can piggy back other events to >>> make the updates happen. >> >> >> The patch below implements what has been described above. It calls part = of >> nohz_idle_balance when a cpu becomes idle and kick a ilb if it takes too >> much >> time. This removes part of ilb that are kicked on an idle cpu for updati= ng >> the blocked load but the ratio really depends on when the tick happens >> compared >> to a cpu becoming idle and the 32ms boundary. I have an additionnal patc= h >> that >> enables to update the blocked loads when a cpu becomes idle 1 period >> before >> kicking an ilb and there is far less ilb because we give more chance to >> the >> newly idle case (time_after is replaced by time_after_eq in >> idle_balance()). >> >> The patch also uses a function cfs_rq_has_blocked, which only checks the >> util/load_avg, instead of the cfs_rq_is_decayed which check *_sum too. >> This >> reduce significantly the number of update of blocked load. the *_avg wil= l >> be >> fully decayed in around 300~400ms but it's far longer for the *_sum whic= h >> have >> a higher resolution and we can easily reach almost seconds. But only the >> *_avg >> are used to make decision so keeping some blocked *_sum is acceptable. >> >> --- >> kernel/sched/fair.c | 121 >> +++++++++++++++++++++++++++++++++++++++------------- >> 1 file changed, 92 insertions(+), 29 deletions(-) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 898785d..ed90303 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -7356,6 +7356,17 @@ static inline bool cfs_rq_is_decayed(struct cfs_r= q >> *cfs_rq) >> return true; >> } >> +static inline bool cfs_rq_has_blocked(struct cfs_rq *cfs_rq) >> +{ >> + if (cfs_rq->avg.load_avg) >> + return true; >> + >> + if (cfs_rq->avg.util_avg) >> + return true; >> + >> + return false; >> +} >> + > > > Can we not change cfs_rq_is_decayed() to use avg.foo_avg instead of > avg.foo_sum ? I don't think so because the *_sum are used to keep coherency bewteen cfs_rq and cgroups when task migrates and are enqueued/dequeued so wwe can't remove it until the *_sum are null otherwise the all cfs_rq and group will be out of sync > >> #ifdef CONFIG_FAIR_GROUP_SCHED >> static void update_blocked_averages(int cpu) >> @@ -7393,7 +7404,9 @@ static void update_blocked_averages(int cpu) >> */ >> if (cfs_rq_is_decayed(cfs_rq)) >> list_del_leaf_cfs_rq(cfs_rq); >> - else >> + >> + /* Don't need periodic decay once load/util_avg are null >> */ >> + if (cfs_rq_has_blocked(cfs_rq)) >> done =3D false; >> } >> @@ -7463,7 +7476,7 @@ static inline void update_blocked_averages(int >> cpu) >> update_cfs_rq_load_avg(cfs_rq_clock_task(cfs_rq), cfs_rq); >> #ifdef CONFIG_NO_HZ_COMMON >> rq->last_blocked_load_update_tick =3D jiffies; >> - if (cfs_rq_is_decayed(cfs_rq)) >> + if (cfs_rq_has_blocked(cfs_rq)) > > > Schouldn't this be !cfs_rq_has_blocked(cfs_rq) ? yes. I copy/pasted too quickly from sched_group_fair to not sched_group_fai= r > >> rq->has_blocked_load =3D 0; >> #endif >> rq_unlock_irqrestore(rq, &rf); > > > [...] > > >> @@ -9438,7 +9451,17 @@ static bool nohz_idle_balance(struct rq *this_rq, >> enum cpu_idle_type idle) >> */ >> if (need_resched()) { >> has_blocked_load =3D true; >> - break; >> + goto abort; >> + } >> + >> + /* >> + * If the update is done while CPU becomes idle, we abor= t >> + * the update when its cost is higher than the average >> idle >> + * time in orde to not delay a possible wake up. >> + */ >> + if (idle =3D=3D CPU_NEWLY_IDLE && this_rq->avg_idle < >> curr_cost) { >> + has_blocked_load =3D true; >> + goto abort; >> } >> rq =3D cpu_rq(balance_cpu); >> @@ -9453,10 +9476,10 @@ static bool nohz_idle_balance(struct rq *this_rq= , >> enum cpu_idle_type idle) >> if (time_after_eq(jiffies, rq->next_balance)) { >> struct rq_flags rf; >> - rq_lock_irq(rq, &rf); >> + rq_lock_irqsave(rq, &rf); >> update_rq_clock(rq); >> cpu_load_update_idle(rq); >> - rq_unlock_irq(rq, &rf); >> + rq_unlock_irqrestore(rq, &rf); >> if (flags & NOHZ_BALANCE_KICK) >> rebalance_domains(rq, CPU_IDLE); >> @@ -9466,10 +9489,17 @@ static bool nohz_idle_balance(struct rq *this_rq= , >> enum cpu_idle_type idle) >> next_balance =3D rq->next_balance; >> update_next_balance =3D 1; >> } > > > Why do you do this cpu_load_update_idle(rq) even this was called with > CPU_NEWLY_IDLE? Wouldn't it be sufficient to jump to the curr_cost > calculation in this case? just to keep thing similar to what happen with kick_ilb and that's an occasion to also update the cpu_load > >> + >> + domain_cost =3D sched_clock_cpu(this_cpu) - t0; >> + curr_cost +=3D domain_cost; >> + >> } >> - update_blocked_averages(this_cpu); >> - has_blocked_load |=3D this_rq->has_blocked_load; >> + /* Newly idle CPU doesn't need an update */ >> + if (idle !=3D CPU_NEWLY_IDLE) { >> + update_blocked_averages(this_cpu); >> + has_blocked_load |=3D this_rq->has_blocked_load; >> + } >> if (flags & NOHZ_BALANCE_KICK) >> rebalance_domains(this_rq, CPU_IDLE); > > > [...]