Received: by 10.192.165.148 with SMTP id m20csp1008897imm; Wed, 25 Apr 2018 11:04:07 -0700 (PDT) X-Google-Smtp-Source: AIpwx4/k0SUvOeQNVmUqyBxp5dXi8rGx4BLjE2ldvv+dJ2M3JwumuoFLRv44oC2JZ0jfy4o2EALy X-Received: by 10.99.143.3 with SMTP id n3mr24383788pgd.136.1524679446988; Wed, 25 Apr 2018 11:04:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524679446; cv=none; d=google.com; s=arc-20160816; b=vwxZFkd3WkFuGMQ7JQeBD2bxWwdfKLElg5tdauBQzGcMXJ8UntMcgxbcf0MeQM1IE9 xGd/PdNDvMvbgVjQgA/bT7qwkVqHI0LYsFwlIjnZAHoRMAb8juqTtoWnab23kvfLc49A g49ztneoLc/TM7AYwPxgwG2Cd2AXSauhp+RCHUFbE5UgTmjoeCelBMEZ8U/SslEA4R8s xdXAoiJ6fOLCAr7xsl69LGUAJUfZFRcv8l9YannygA+DOysmDvpZP/9Qnb8QqdBohua9 FkwuEpdxp17F0tKmx4qqX3e/cdP/NhcpW/6hGjvP6ZhK2DVVL5mRhr2y8QJg6OFzgUSj wQfw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=EYWxK9zoQK14dppxgbBDa4NLfmdehcUgLXS+5va0AtY=; b=WIabCvv9NlTC/e2ah65NPlEJkY0fI8bm44QwMVekDnRngKfVszkRxr9vw0KBQz6vNo HJokogT9ntnk1Swq36t8DOpc1XeosL6tzg4nHzX6sW+1GclOOg+Ekxs1bw7co8Es6hPi OLmZ367NxEsfhg3KbcCpvpeqq3CWQAm78IRO0U3MCevqU6dw0LJkqaTytlJzEDS1URdU wmwJaoywwVbJhDUaMrCLc/FE4bGimFfGeRVrONAtVMUi1ngQQI9PIHCRS3EcgHCHx2tS xyo/Y/DLXbyz5cxkol6oWqrYNX5weccfqU/8EkoalmsrpXv5sr+GWNTxJ5t//N+0LCpu N74A== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=merlin.20170209 header.b=aSF6afmN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k28si9609516pgn.129.2018.04.25.11.03.51; Wed, 25 Apr 2018 11:04:06 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=merlin.20170209 header.b=aSF6afmN; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755572AbeDYSCD (ORCPT + 99 others); Wed, 25 Apr 2018 14:02:03 -0400 Received: from merlin.infradead.org ([205.233.59.134]:35156 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750935AbeDYSCA (ORCPT ); Wed, 25 Apr 2018 14:02:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=merlin.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=EYWxK9zoQK14dppxgbBDa4NLfmdehcUgLXS+5va0AtY=; b=aSF6afmNdUAXCWqb8y6+10srj AcMBzACFCD370Zi0F4vBCUK5aJ4qBXC82QKdkfnzNMnub/ohZrRIH+7P8rKk/S//LrDl2dgGMV8o4 oJd/lHlhdlYApdRu5k21QbYcQA4c0Be67R+NhdehtsTWltafRrzS59iYsnALI9enedYbFELsN7OMD 4XeH13BqhEt6LdCfMZ4qPbEL2Xs854Zd1nrUzolnE/6TU1vozxtNPsNkQWGK6FKb0qfUSTtciBWNn N2h9HREbbNj38RlRe+RwQqc7b9KBNb83NOu9UvktDB1ax6gADOy9Qq/7qG/7xfYnh+K5/3xUxAilF 6Q3e+jEHQ==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=hirez.programming.kicks-ass.net) by merlin.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1fBOje-0003ZO-NG; Wed, 25 Apr 2018 18:01:43 +0000 Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 552BF2029FD58; Wed, 25 Apr 2018 20:01:40 +0200 (CEST) Date: Wed, 25 Apr 2018 20:01:40 +0200 From: Peter Zijlstra To: Subhra Mazumdar Cc: linux-kernel@vger.kernel.org, mingo@redhat.com, daniel.lezcano@linaro.org, steven.sistare@oracle.com, dhaval.giani@oracle.com, rohit.k.jain@oracle.com Subject: Re: [PATCH 3/3] sched: limit cpu search and rotate search window for scalability Message-ID: <20180425180140.GU4129@hirez.programming.kicks-ass.net> References: <20180424004116.28151-1-subhra.mazumdar@oracle.com> <20180424004116.28151-4-subhra.mazumdar@oracle.com> <20180424125349.GU4082@hirez.programming.kicks-ass.net> <20180425153600.GA4043@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180425153600.GA4043@hirez.programming.kicks-ass.net> User-Agent: Mutt/1.9.3 (2018-01-21) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 25, 2018 at 05:36:00PM +0200, Peter Zijlstra wrote: > On Tue, Apr 24, 2018 at 05:10:34PM -0700, Subhra Mazumdar wrote: > > On 04/24/2018 05:53 AM, Peter Zijlstra wrote: > > > > Why do you need to put a max on? Why isn't the proportional thing > > > working as is? (is the average no good because of big variance or what) > > > Firstly the choosing of 512 seems arbitrary. > > It is; it is a crud attempt to deal with big variance. The comment says > as much. > > > Secondly the logic here is that the enqueuing cpu should search up to > > time it can get work itself. Why is that the optimal amount to > > search? > > 1/512-th of the time in fact, per the above random number, but yes. > Because searching for longer than we're expecting to be idle for is > clearly bad, at that point we're inhibiting doing useful work. > > But while thinking about all this, I think I've spotted a few more > issues, aside from the variance: > > Firstly, while avg_idle estimates the average duration for _when_ we go > idle, it doesn't give a good measure when we do not in fact go idle. So > when we get wakeups while fully busy, avg_idle is a poor measure. > > Secondly, the number of wakeups performed is also important. If we have > a lot of wakeups, we need to look at aggregate wakeup time over a > period. Not just single wakeup time. > > And thirdly, we're sharing the idle duration with newidle balance. > > And I think the 512 is a result of me not having recognised these > additional issues when looking at the traces, I saw variance and left it > there. > > > This leaves me thinking we need a better estimator for wakeups. Because > if there really is significant idle time, not looking for idle CPUs to > run on is bad. Placing that upper limit, especially such a low one, is > just an indication of failure. > > > I'll see if I can come up with something. Something like so _could_ work. Again, completely untested. We give idle time to wake_avg, we subtract select_idle_sibling 'runtime' from wake_avg, such that when there's lots of wakeups we don't use more time than there was reported idle time. And we age wake_avg, such that if there hasn't been idle for a number of ticks (we've been real busy) we also stop scanning wide. But it could eat your granny and set your cat on fire :-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 5e10aaeebfcc..bc910e5776cc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1671,6 +1671,9 @@ static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags, if (rq->avg_idle > max) rq->avg_idle = max; + rq->wake_stamp = jiffies; + rq->wake_avg = rq->avg_idle / 2; + rq->idle_stamp = 0; } #endif @@ -6072,6 +6075,8 @@ void __init sched_init(void) rq->online = 0; rq->idle_stamp = 0; rq->avg_idle = 2*sysctl_sched_migration_cost; + rq->wake_stamp = jiffies; + rq->wake_avg = rq->avg_idle; rq->max_idle_balance_cost = sysctl_sched_migration_cost; INIT_LIST_HEAD(&rq->cfs_tasks); diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 54dc31e7ab9b..fee31dbe15ed 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6369,7 +6369,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target) { struct sched_domain *this_sd; + unsigned long now = jiffies; u64 avg_cost, avg_idle; + struct rq *this_rq; u64 time, cost; s64 delta; int cpu, nr = INT_MAX; @@ -6378,11 +6380,18 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t if (!this_sd) return -1; - /* - * Due to large variance we need a large fuzz factor; hackbench in - * particularly is sensitive here. - */ - avg_idle = this_rq()->avg_idle / 512; + this_rq = this_rq(); + if (sched_feat(SIS_NEW)) { + /* age the remaining idle time */ + if (unlikely(this_rq->wake_stamp < now)) { + while (this_rq->wake_stamp < now && this_rq->wake_avg) + this_rq->wake_avg >>= 1; + } + avg_idle = this_rq->wake_avg; + } else { + avg_idle = this_rq->avg_idle / 512; + } + avg_cost = this_sd->avg_scan_cost + 1; if (sched_feat(SIS_AVG_CPU) && avg_idle < avg_cost) @@ -6412,6 +6421,12 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t delta = (s64)(time - cost) / 8; this_sd->avg_scan_cost += delta; + /* you can only spend the time once */ + if (this_rq->wake_avg > time) + this_rq->wake_avg -= time; + else + this_rq->wake_avg = 0; + return cpu; } diff --git a/kernel/sched/features.h b/kernel/sched/features.h index 85ae8488039c..f5f86a59aac4 100644 --- a/kernel/sched/features.h +++ b/kernel/sched/features.h @@ -57,6 +57,7 @@ SCHED_FEAT(TTWU_QUEUE, true) */ SCHED_FEAT(SIS_AVG_CPU, false) SCHED_FEAT(SIS_PROP, true) +SCHED_FEAT(SIS_NEW, false) /* * Issue a WARN when we do multiple update_rq_clock() calls diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 15750c222ca2..c4d6ddf907b5 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -831,6 +831,9 @@ struct rq { u64 idle_stamp; u64 avg_idle; + unsigned long wake_stamp; + u64 wake_avg; + /* This is used to determine avg_idle's max value */ u64 max_idle_balance_cost; #endif