Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp336879ybn; Wed, 25 Sep 2019 00:13:47 -0700 (PDT) X-Google-Smtp-Source: APXvYqzqFAFYLscihrxCgp7y75mxCAJnYOGOM4Kn/xOkGWexLQog/Cicr6NZW9hmVy7+yhY2cxbT X-Received: by 2002:a05:600c:215a:: with SMTP id v26mr6067779wml.9.1569395627253; Wed, 25 Sep 2019 00:13:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569395627; cv=none; d=google.com; s=arc-20160816; b=LjjZBaOGDY1W+YzY5SmRQqASb4BTI9hLSwLkC33LilCZcu7gcmGepsrq0Phrh1mXKC 6WPCMua1Kwmtw7AvENJVVk36t22Rs/kfx6HwBH8Ah0PwNoeDKv2tKDVEd7+wjlMWgOCV db7JQIBmym+A4ySrgLOQb6D2NhfPzs0cwK4UrBkWobfJdqhDGatPyxTm3k2Spgv6eZtd 78o0W8NQoXFA6m1d24UD97MEm1GwDYxsQADID6uNSv4vq8dVJ19/J73HcvcdrIkeVvaN dsiWS6/Mc4wmHiEiI/Uno9AU4NTM/0Y2VJGKIKDZmrfl1DcZgAl4Ig+/XEKDrnOOamUy MgEQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=ytJ+EGm6YuF+mhMM7eo4coyC//3Tk+SGrSgjkV4wGWA=; b=RjqkK6ZffDpZ4UfiQqU47kcouIWSymn5HvwL6XCxIqs8mZS3TLZKBCO8uAqVEaCRWz pxKdJjTOvKbBZ7Nse/VddvmYVhEOCU25eIXwNiz221G0HfUfRKvCNzq6lqtybRs2ucpP q04J4OpAIw5Fu+Bmpo/NxzLCcCCxykrpVQVoLBQBTVBTqVhzPPoHZkkc1GCJvwkagt/b un//+5YGvbhe/3/t2OXcfzgxDAaZF4wSQW2veaRw3GmUzwp+BVpYcR5vYzmI6b5+Oqnu 5aS39S+8hJ5o7ddwNgySY482IvJ4Hm0wko+5BHGBtyYRFCXbJSgba3BuHWjl+cBCOlgQ Rp2A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m10si2205060ejc.337.2019.09.25.00.13.12; Wed, 25 Sep 2019 00:13:47 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2436994AbfIWIF3 (ORCPT + 99 others); Mon, 23 Sep 2019 04:05:29 -0400 Received: from foss.arm.com ([217.140.110.172]:38226 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2436751AbfIWIF3 (ORCPT ); Mon, 23 Sep 2019 04:05:29 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B41541000; Mon, 23 Sep 2019 01:05:28 -0700 (PDT) Received: from [192.168.1.113] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 1FE1A3F59C; Mon, 23 Sep 2019 01:08:00 -0700 (PDT) Subject: Re: [PATCH 1/1] sched/eas: introduce system-wide overutil indicator To: YT Chang , Peter Zijlstra , Matthias Brugger Cc: wsd_upstream@mediatek.com, linux-kernel@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-mediatek@lists.infradead.org References: <1568877622-28073-1-git-send-email-yt.chang@mediatek.com> From: Dietmar Eggemann Message-ID: <17c5f3bf-b739-b041-c71a-3d568be6f46d@arm.com> Date: Mon, 23 Sep 2019 10:05:25 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <1568877622-28073-1-git-send-email-yt.chang@mediatek.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 9/19/19 9:20 AM, YT Chang wrote: > When the system is overutilization, the load-balance crossing > clusters will be triggered and scheduler will not use energy > aware scheduling to choose CPUs. We're currently transitioning from traditional big.LITTLE (the CPUs of 1 cluster (all having the same CPU (original) capacity) represent a DIE Sched Domain (SD) level Sched Group (SG)) to DynamIQ systems. Later can share CPUs with different CPU (original) capacity in one cluster. In Linux mainline with today's DynamIQ systems (1 cluster) you will only have 1 cluster, i.e. 1 MC SD level SG. For those systems the current approach is much more applicable. Or do you apply the out-of-tree Phantom Domain concept, which creates n (n=2 or 3 ((huge,) big, little)) DIE SGs on your 1 cluster DynamIQ system? > The overutilization means the loading of ANY CPUs > exceeds threshold (80%). > > However, only 1 heavy task or while-1 program will run on highest > capacity CPUs and it still result to trigger overutilization. So > the system will not use Energy Aware scheduling. The patch-header of commit 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator") mentioned why the current approach is so conservatively defined. > To avoid it, a system-wide over-utilization indicator to trigger > load-balance cross clusters. > > The policy is: > The loading of "ALL CPUs in the highest capacity" > exceeds threshold(80%) or > The loading of "Any CPUs not in the highest capacity" > exceed threshold(80%) We experimented with an overutilized (tipping point) indicator per SD from Thara Gopinath (Linaro), mentioned by Vincent already, till v2 of the Energy Aware Scheduling patch-set in 2018 but we couldn't find any advantage using it over the one you now find in mainline. https://lore.kernel.org/r/20180406153607.17815-4-dietmar.eggemann@arm.com Maybe you can have a look at this patch and see if it gives you an advantage with your use cases and system topology layout? The 'system-wide' in the name of the patch is misleading. The current approach is also system-wide, we have the overutilized information on the root domain (system here stands for root domain). You change the detection mechanism from per-CPU to a mixed-mode detection (per-CPU and per-SG). > Signed-off-by: YT Chang > --- > kernel/sched/fair.c | 76 +++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 65 insertions(+), 11 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 036be95..f4c3d70 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5182,10 +5182,71 @@ static inline bool cpu_overutilized(int cpu) > static inline void update_overutilized_status(struct rq *rq) > { > if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu)) { > - WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED); > - trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED); > + if (capacity_orig_of(cpu_of(rq)) < rq->rd->max_cpu_capacity) { > + WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED); > + trace_sched_overutilized_tp(rq->rd, SG_OVERUTILIZED); > + } > } > } > + > +static > +void update_system_overutilized(struct sched_domain *sd, struct cpumask *cpus) > +{ > + unsigned long group_util; > + bool intra_overutil = false; > + unsigned long max_capacity; > + struct sched_group *group = sd->groups; > + struct root_domain *rd; > + int this_cpu; > + bool overutilized; > + int i; > + > + this_cpu = smp_processor_id(); > + rd = cpu_rq(this_cpu)->rd; > + overutilized = READ_ONCE(rd->overutilized); > + max_capacity = rd->max_cpu_capacity; > + > + do { > + group_util = 0; > + for_each_cpu_and(i, sched_group_span(group), cpus) { > + group_util += cpu_util(i); > + if (cpu_overutilized(i)) { > + if (capacity_orig_of(i) < max_capacity) { > + intra_overutil = true; > + break; > + } > + } > + } > + > + /* > + * A capacity base hint for over-utilization. > + * Not to trigger system overutiled if heavy tasks > + * in Big.cluster, so > + * add the free room(20%) of Big.cluster is impacted which means > + * system-wide over-utilization, > + * that considers whole cluster not single cpu > + */ > + if (group->group_weight > 1 && (group->sgc->capacity * 1024 < > + group_util * capacity_margin)) { Why 'group->group_weight > 1' ? Do you have some out-of-tree code which lets SGs with 1 CPU survive? [...]