Received: by 2002:a05:6602:2086:0:0:0:0 with SMTP id a6csp4436612ioa; Wed, 27 Apr 2022 03:59:39 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyEjeXemgmw5rFy5MVMK6jxCPmN0gFPv7Ac5oZQHlcDKrGb0VxCoVpo1QKy7IDUzDCfyEhD X-Received: by 2002:a17:902:b613:b0:156:7d82:c09b with SMTP id b19-20020a170902b61300b001567d82c09bmr27881179pls.80.1651057179161; Wed, 27 Apr 2022 03:59:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651057179; cv=none; d=google.com; s=arc-20160816; b=a20+7dakc6G8864Hx/7/nnu6mtf/rPWNXgGYAgfUQujqZZDH/VnveAeuKP5pMIsrIR L5VNLmQvP0a6s1Gy0xODsy+jXIzNTTJrq0w9SdD9sRZCFkiHRpAdajTOsB/tp15w4dOm fomL5UH8dX3D9fU1/m4YUDzou5doIQfKONG1DjFTSrQvcNsRaE3AgsCkni5dj3hWNkJc 0U5kvOeYm6Sq5ucpCC86QQgUoQdpuWxzO19f9oJTuAGZTD1S9GEYg65i9jbkiGA6yLRz dQcRgDmBcbgUvwIAIHdZSF1UCbLoe46D66c3t1Us7Dfu4zbUnOXZc3Qom58FAE/G4BB8 zMvw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=nOYTOypfiQ3w4AlL2zzjSfFLp05SmWU2IisDPhyrvgw=; b=tow3HIoYMX0Q3pUM4B8b5mA3xNpVF407Gf03Q4ciO5fC75QugmzNi1EcWxg7sKZftj v4sf2tMG9Kr9WIweBszFOPx1xltGRxnwzejpaFbm9kfJjKOenET8cZJCfB3RUME3ZO2G DSQlb+HZjrWcDebDgOln0ZQ1AmycS2kJzMSkLJHR8n1dJjNdrmNkSiOLOeDsfOE2SmcN jOuYR5kASZr6YoHkq0LXZAr12g9pSGSV5Up+9G/uY4t4h17TOseXYvJKUFSPtvOgAJv0 Sfqpqg64QfCqKwRk9M6wqYFz75IIjj8zVC0RyYnRezwhabnxiMW5rQy1EQf6PBCezt6z kkWw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id t3-20020a170902bc4300b00155f0855954si1139152plz.524.2022.04.27.03.59.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Apr 2022 03:59:39 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 61AE32FCE2B; Wed, 27 Apr 2022 03:07:19 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345193AbiDZKCv (ORCPT + 99 others); Tue, 26 Apr 2022 06:02:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58942 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229742AbiDZKCl (ORCPT ); Tue, 26 Apr 2022 06:02:41 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id A62EFC8 for ; Tue, 26 Apr 2022 02:21:53 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 614BF23A; Tue, 26 Apr 2022 02:21:53 -0700 (PDT) Received: from airbuntu (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C91533F73B; Tue, 26 Apr 2022 02:21:51 -0700 (PDT) Date: Tue, 26 Apr 2022 10:21:42 +0100 From: Qais Yousef To: Xuewen Yan Cc: Xuewen Yan , dietmar.eggemann@arm.com, lukasz.luba@arm.com, rafael@kernel.org, viresh.kumar@linaro.org, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, rostedt@goodmis.org, linux-kernel@vger.kernel.org, di.shen@unisoc.com Subject: Re: [PATCH] sched: Take thermal pressure into account when determine rt fits capacity Message-ID: <20220426092142.lppfj5eqgt3d24nb@airbuntu> References: <20220407051932.4071-1-xuewen.yan@unisoc.com> <20220420135127.o7ttm5tddwvwrp2a@airbuntu> <20220421161509.asz25zmh25eurgrk@airbuntu> <20220425161209.ydugtrs3b7gyy3kk@airbuntu> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 04/26/22 10:07, Xuewen Yan wrote: > On Tue, Apr 26, 2022 at 12:12 AM Qais Yousef wrote: > > > > On 04/25/22 09:31, Xuewen Yan wrote: > > > On Fri, Apr 22, 2022 at 12:15 AM Qais Yousef wrote: > > > > Is it okay to share what the capacities of the littles, mediums and bigs on > > > > your system? And how they change under worst case scenario thermal pressure? > > > > Only IF you have these numbers handy :-) > > > > > > Okay, the little/mid/big cpu scale capacity is 350/930/1024, but the > > > cpu frequency point is discrete, the big core's high freq point may is > > > just a few more than the mid core's highest. > > > In this case, once the thermal decrease the scaling_max_freq, the > > > maximum frequency of the large core is easily lower than that of the > > > medium core. > > > Of course, the corner case is due to the frequency design of the soc > > > and our thermal algorithm. > > > > Okay, thanks for the info! > > > > > > > > > > > > > Is it actually an indication of a potential other problem if you swing into > > > > capacity inversion in the bigs that often? I've seen a lot of systems where the > > > > difference between the meds and bigs is small. But frequent inversion could be > > > > suspicious still. > > > > > > > > Do the littles and the mediums experience any significant thermal pressure too? > > > > > > In our platform, it's not. > > > > Good. > > > > > > It doesn't seem it'll cause a significant error, but still it seems to me this > > > > function wants the original capacity passed to it. > > > > > > > > There are similar questions to be asked since you modify sg_cpu->max. Every > > > > user needs to be audited if they're fine with this change or not. > > > > > > > > I'm not sure still what we are achieving here. You want to force schedutil not > > > > to request higher frequencies if thermal pressure is high? Should schedutil > > > > actually care? Shouldn't the cpufreq driver reject this request and pick the > > > > next best thing if it can't satisfy it? I could be missing something, I haven't > > > > looked that hard tbh :-) > > > > > > I changed this just want to make it more responsive to the real > > > capacity of the cpu, if it will cause other problems, maybe it would > > > be better not to change it.:) > > > > There are others who can give you a better opinion. But AFAICS we're not fixing > > anything but risking breaking other things. So I vote for not to change it :) > > > > > > It depends on the severity of the problem. The simplest thing I can suggest is > > > > to check if the cpu is in capacity inversion state, and if it is, then make > > > > rt_task_fits_capacity() return false always. > > > > > > > > If we need a generic solution to handle thermal pressure omitting OPPs, then > > > > the search needs to become more complex. The proposal in this patch is not > > > > adequate because tasks that want to run at capacity_orig_of(cpu) will wrongly > > > > omit some cpus because of any tiny thermal pressure. For example if the > > > > capacity_orig_of(medium_cpu) = 700, and uclamp_min for RT is set to 700, then > > > > any small thermal pressure on mediums will cause these tasks to run on big cpus > > > > only, which is not what we want. Especially if these big cpus can end up in > > > > capacity inversion later ;-) > > > > > > > > So if we want to handle this case, then we need to ensure the search returns > > > > false only if > > > > > > > > 1. Thermal pressure results in real OPP to be omitted. > > > > 2. Another CPU that can provide this performance level is available. > > > > > > > > Otherwise we should still fit it on this CPU because it'll give us the closest > > > > thing to what was requested. > > > > > > > > I can think of 2 ways to implement this, but none of them seem particularly > > > > pretty :-/ > > > > > > Maybe as Lukasz Luba said: > > > > > > https://lore.kernel.org/all/ae98a861-8945-e630-8d4c-8112723d1007@arm.com/ > > > > > > > Let's meet in the middle: > > > > 1) use the thermal PELT signal in RT: > > > > capacity = capacity_orig_of(cpu) - thermal_load_avg(cpu_rq(cpu)) > > > > 2) introduce a more configurable thermal_pressure shifter instead > > > > 'sched_thermal_decay_shift', which would allow not only to make the > > > > decaying longer, but also shorter when the platform already might do > > > > that, to not cause too much traffic. > > > > > > But even if this is changed, there will still be the same problem, I > > > look forward to Lukasz's patch:) > > > > This will not address my concern unless I missed something. > > > > The best (simplest) way forward IMHO is to introduce a new function > > > > bool cpu_in_capacity_inversion(int cpu); > > > > (feel free to pick another name) which will detect the scenario you're in. You > > can use this function then in rt_task_fits_capacity() > > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > > index a32c46889af8..d48811a7e956 100644 > > --- a/kernel/sched/rt.c > > +++ b/kernel/sched/rt.c > > @@ -462,6 +462,9 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu) > > if (!static_branch_unlikely(&sched_asym_cpucapacity)) > > return true; > > > > + if (cpu_in_capacity_inversion(cpu)) > > + return false; > > + > > min_cap = uclamp_eff_value(p, UCLAMP_MIN); > > max_cap = uclamp_eff_value(p, UCLAMP_MAX); > > > > You'll probably need to do something similar in dl_task_fits_capacity(). > > > > This might be a bit aggressive though as we'll steer away all RT tasks from > > this CPU (as long as there's another CPU that can fit it). I need to think more > > about it. But we could do something like this too > > > > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c > > index a32c46889af8..f2a34946a7ab 100644 > > --- a/kernel/sched/rt.c > > +++ b/kernel/sched/rt.c > > @@ -462,11 +462,14 @@ static inline bool rt_task_fits_capacity(struct task_struct *p, int cpu) > > if (!static_branch_unlikely(&sched_asym_cpucapacity)) > > return true; > > > > + cpu_cap = capacity_orig_of(cpu); > > + > > + if (cpu_in_capacity_inversion(cpu)) > > It's a good idea, but as you said, in mainline, the > sysctl_sched_uclamp_util_min_rt_default is always 1024, > Maybe it's better to add it to the judgment? I don't think so. If we want to handle finding the next best thing, we need to make the search more complex than that. This is no worse than having 2 RT tasks waking up at the same time while there's only a single big CPU. One of them will end up on a medium or a little and we don't provide better guarantees here. Basically we need to improve our fallback mechanism to try to pick the next biggest cpu. Which is something we can do. We just need to be wary not to increase the wake up latency by making our search more expensive. > > + if (sysctl_sched_uclamp_util_min_rt_default != > SCHED_CAPACITY_SCALE && cpu_in_capacity_inversion(cpu)) > > > + cpu_cap -= thermal_load_avg(cpu_rq(cpu)); > > Why use thermal_load_avg? If thermal is always in effect,the > thermal_load_avg would get bigger and bigger, as a result, the cpu_cap > maybe smaller than (capacity_orig - thermal_pressure). I just copied the suggestion from Lukasz to be honest without thinking much about it. Thanks -- Qais Yousef