Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp371157pxj; Thu, 10 Jun 2021 02:54:31 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxBfQu5SJcx/F/OtPx7SCGgAWnLnd4Cxy6UXhMZ5xKhrpew0ANQbzbCLgXjEOOtN/BM9NmK X-Received: by 2002:a17:906:b212:: with SMTP id p18mr3729436ejz.109.1623318871448; Thu, 10 Jun 2021 02:54:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623318871; cv=none; d=google.com; s=arc-20160816; b=yRb033eO2ZO2RYtjbsj4fpdbP2YKF4OuwmNfhDrxJJxk9K4DcqOQSPf8HCux/Ntl5c PkN0kgBdOWZMcy67SK1EUWjEuKji5DibeBD/7yf0SlUo8OMqJfJ+E66szQg3sXdIm5Xy MrJrgWQJ6ASZE2yQVCQ1oJpPKUCdyKzQ+1oOnlltl2dgpZyoY6c4Mz9F+8QG65ce28U/ hC4zA9UEUJ4nxdz8D0Ah+c3v4+AROERu0jOcsC+ygOgPJ4lWXaRlptXeM5ncvAEds/Ww kTwIP1GaHPqaK2tOaTMAX915rAicnijaXOV3FNINcloaPEP664tWj0DWI9AXL6TekWhn pEVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=b7uOKnjc2KErKW8iBKFNIxOU9vzctHkiU7fqdqbkSBs=; b=c2lir5KIjt0ldIYC5jJ9TpfpRB6Z4k7PKtOgFQbZ3ClPEXlI+DNuhyLH26j1N+1yxg LfuHhiUkShECqSABLMpvhE/q0jOo76AZU/F5K81uVmLrH6p8bkDn1oKFeutp+qJAbzy/ u0ggY1eSR5pPTIE0G9hUDt2qcZ7rtodEMCU+Txzi5qCIC5XVbrSX7vbnDTcSJk8+5TPD vy6C7+uycsjdePa6F6NUtvTWYRaK+fNGV9rpcH1Tn+r6G5tVIqFk1B6p2ANmA+9pWeq0 UFj9mP/cM+xI+9xjXWrI3Z86bxUqy3UcJQLSKKeJ6fDs0iRgdVMu+TTQVt2/64hDzOhV i+kQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id m3si1937087edp.580.2021.06.10.02.54.07; Thu, 10 Jun 2021 02:54:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230117AbhFJJyW (ORCPT + 99 others); Thu, 10 Jun 2021 05:54:22 -0400 Received: from foss.arm.com ([217.140.110.172]:55520 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229961AbhFJJyV (ORCPT ); Thu, 10 Jun 2021 05:54:21 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 97B90ED1; Thu, 10 Jun 2021 02:52:25 -0700 (PDT) Received: from [10.57.4.220] (unknown [10.57.4.220]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E96363F694; Thu, 10 Jun 2021 02:52:22 -0700 (PDT) Subject: Re: [PATCH v2 1/2] sched/fair: Take thermal pressure into account while estimating energy To: Vincent Guittot Cc: linux-kernel , "open list:THERMAL" , Peter Zijlstra , "Rafael J. Wysocki" , Viresh Kumar , Quentin Perret , Dietmar Eggemann , Vincent Donnefort , Beata Michalska , Ingo Molnar , Juri Lelli , Steven Rostedt , segall@google.com, Mel Gorman , Daniel Bristot de Oliveira References: <20210604080954.13915-1-lukasz.luba@arm.com> <20210604080954.13915-2-lukasz.luba@arm.com> <8f4156a7-46ca-361d-bcb7-1cbdc860ef37@arm.com> From: Lukasz Luba Message-ID: <488e4982-cca4-a655-b527-76f69bd37069@arm.com> Date: Thu, 10 Jun 2021 10:52:21 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 6/10/21 10:41 AM, Vincent Guittot wrote: > On Thu, 10 Jun 2021 at 11:36, Lukasz Luba wrote: >> >> >> >> On 6/10/21 10:11 AM, Vincent Guittot wrote: >>> On Thu, 10 Jun 2021 at 10:42, Lukasz Luba wrote: >>>> >>>> >>>> >>>> On 6/10/21 8:59 AM, Vincent Guittot wrote: >>>>> On Fri, 4 Jun 2021 at 10:10, Lukasz Luba wrote: >>>>>> >>>>>> Energy Aware Scheduling (EAS) needs to be able to predict the frequency >>>>>> requests made by the SchedUtil governor to properly estimate energy used >>>>>> in the future. It has to take into account CPUs utilization and forecast >>>>>> Performance Domain (PD) frequency. There is a corner case when the max >>>>>> allowed frequency might be reduced due to thermal. SchedUtil is aware of >>>>>> that reduced frequency, so it should be taken into account also in EAS >>>>>> estimations. >>>>>> >>>>>> SchedUtil, as a CPUFreq governor, knows the maximum allowed frequency of >>>>>> a CPU, thanks to cpufreq_driver_resolve_freq() and internal clamping >>>>>> to 'policy::max'. SchedUtil is responsible to respect that upper limit >>>>>> while setting the frequency through CPUFreq drivers. This effective >>>>>> frequency is stored internally in 'sugov_policy::next_freq' and EAS has >>>>>> to predict that value. >>>>>> >>>>>> In the existing code the raw value of arch_scale_cpu_capacity() is used >>>>>> for clamping the returned CPU utilization from effective_cpu_util(). >>>>>> This patch fixes issue with too big single CPU utilization, by introducing >>>>>> clamping to the allowed CPU capacity. The allowed CPU capacity is a CPU >>>>>> capacity reduced by thermal pressure signal. We rely on this load avg >>>>>> geometric series in similar way as other mechanisms in the scheduler. >>>>>> >>>>>> Thanks to knowledge about allowed CPU capacity, we don't get too big value >>>>>> for a single CPU utilization, which is then added to the util sum. The >>>>>> util sum is used as a source of information for estimating whole PD energy. >>>>>> To avoid wrong energy estimation in EAS (due to capped frequency), make >>>>>> sure that the calculation of util sum is aware of allowed CPU capacity. >>>>>> >>>>>> Signed-off-by: Lukasz Luba >>>>>> --- >>>>>> kernel/sched/fair.c | 17 ++++++++++++++--- >>>>>> 1 file changed, 14 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>>>>> index 161b92aa1c79..1aeddecabc20 100644 >>>>>> --- a/kernel/sched/fair.c >>>>>> +++ b/kernel/sched/fair.c >>>>>> @@ -6527,6 +6527,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd) >>>>>> struct cpumask *pd_mask = perf_domain_span(pd); >>>>>> unsigned long cpu_cap = arch_scale_cpu_capacity(cpumask_first(pd_mask)); >>>>>> unsigned long max_util = 0, sum_util = 0; >>>>>> + unsigned long _cpu_cap = cpu_cap; >>>>>> int cpu; >>>>>> >>>>>> /* >>>>>> @@ -6558,14 +6559,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd) >>>>>> cpu_util_next(cpu, p, -1) + task_util_est(p); >>>>>> } >>>>>> >>>>>> + /* >>>>>> + * Take the thermal pressure from non-idle CPUs. They have >>>>>> + * most up-to-date information. For idle CPUs thermal pressure >>>>>> + * signal is not updated so often. >>>>> >>>>> What do you mean by "not updated so often" ? Do you have a value ? >>>>> >>>>> Thermal pressure is updated at the same rate as other PELT values of >>>>> an idle CPU. Why is it a problem there ? >>>>> >>>> >>>> >>>> For idle CPU the value is updated 'remotely' by some other CPU >>>> running nohz_idle_balance(). That goes into >>>> update_blocked_averages() if the flags and checks are OK inside >>>> update_nohz_stats(). Sometimes this is not called >>>> because other_have_blocked() returned false. It can happen for a long >>> >>> So i miss that you were in a loop and the below was called for each >>> cpu and _cpu_cap was overwritten >>> >>> + if (!idle_cpu(cpu)) >>> + _cpu_cap = cpu_cap - thermal_load_avg(cpu_rq(cpu)); >>> >>> But that also means that if the 1st cpus of the pd are idle, they will >>> use original capacity whereas the other ones will remove the thermal >>> pressure. Isn't this a problem ? You don't use the same capacity for >>> all cpus in the performance domain regarding the thermal pressure? >> >> True, but in the experiments for idle CPUs I haven't >> observed that they still have some big util (bigger than _cpu_cap). >> It decayed already, so it's not a problem for idle CPUs. > > But it's a problem because there is a random behavior : some idle cpu > will use original capacity whereas others will use the capped value > set by non idle CPUs. You must have consistent behavior across all > idle cpus. > > Then, if it's not a problem why adding the if (!idle_cpu(cpu)) To capture the signal value from a running CPU, which then I pass into the em_cpu_energy() in path 2/2. My apologies for confusion, this can be just local variable for patch 1/2. I can create the _cpu_cap as local variable inside this loop, just for this patch. Then in patch 2/2 I will remove it and define above the loop, to be available for the call to em_cpu_energy().