Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp1597440ybh; Thu, 16 Jul 2020 17:22:42 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx+CcHXpjkdzLU3/13WJVCH5U4l/A1NbJTAwu9+Bc9f7jWTxdmvugukdpVFUsQBUz0a172H X-Received: by 2002:aa7:cd52:: with SMTP id v18mr6511812edw.196.1594945362704; Thu, 16 Jul 2020 17:22:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594945362; cv=none; d=google.com; s=arc-20160816; b=keqj3yc0NyjWU3UbXnAkPkhvVsQ1vMYvpecUa0+mDqKMW4BEA/RlIXRchOVenrvs9o r+vkHEQeKXzRVyRiSSTeyG1JOqmEAN4bnFWnNzTn/bSlEQCEllfTjjtTgn8q/9b0Xgn7 5XAp1QtaNcpuY+BHhh71sPT2J3fLPOZwmtmXaZPe6wA2/Vy+RHh61P2GCkkd6AB1QgbY L+/WiyTPzK/EZZ7nC+bXRms/uSbFNOL3LSvhDGsmoW4aYpZbeX3LGl+OOMXl+GMQBg8w SgyEkFA72ZY/dsyRE5cXx6pMvOhCVI8c202ICEAw2Wk+vnMl/sOWtMyK1keUG/HFFLy0 5Tvg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:dkim-signature; bh=PCGkNTsSiCs3JdIC96+51lEcZ9vrBgq5A6lQtmnVzqM=; b=NJkL4+VIKF/vd1Kj2qjQFi06zJ4zuO0DigQ6SdAQYT1QwymxvAA6h91Rqtngu1Ac23 Kkvm8VJ3o2NS8k5egt9BkbRQL2IFBy9gw+EdTodcz+Vd9Cx0Fhs7bN0i0JyAdHUb9VR7 pjLij6gkoT8whzlG+am6GWMWoJZCIA8AjLKHuno70u9XWhAMH6vXNjkm6Ybr1SyUKZDx W6pBmgmvo/618Orr/o0Wk6LTSwsIxkZf+xyj6V+G3xoNHIrILEi2nSONNWa0bUfHtEWs clOtMgugQ0yOvDiDEyt2fhqqQkxj+4OJ97EnQP1j9PFaGVzeZRqpwzoTuGhBicl5yEwS ZxnQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@riseup.net header.s=squak header.b="Zhwxaph/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=riseup.net Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id fi17si3946256ejb.724.2020.07.16.17.22.19; Thu, 16 Jul 2020 17:22:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@riseup.net header.s=squak header.b="Zhwxaph/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=riseup.net Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726238AbgGQAV7 (ORCPT + 99 others); Thu, 16 Jul 2020 20:21:59 -0400 Received: from mx1.riseup.net ([198.252.153.129]:46592 "EHLO mx1.riseup.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726125AbgGQAV7 (ORCPT ); Thu, 16 Jul 2020 20:21:59 -0400 Received: from bell.riseup.net (bell-pn.riseup.net [10.0.1.178]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (Client CN "*.riseup.net", Issuer "Sectigo RSA Domain Validation Secure Server CA" (not verified)) by mx1.riseup.net (Postfix) with ESMTPS id 4B7BdK2Wk7zFdxH; Thu, 16 Jul 2020 17:21:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=riseup.net; s=squak; t=1594945317; bh=BmnIv7ZR407+4jD62/CWvvLq9h82RJ85V04b0qNdYKw=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From; b=Zhwxaph/kgpC45r5LYvsMU+XXdnbsAb066uBP0Ev1YE6XxOq5ZhgWDatWdYW8U491 4+pxL2yRxAhtMnzidG86HmK8/Ih/FRdxODoLphoEwXI/ZQN+xcc4O94IzFHkDmrrht 0Gcj1ZS0AWpc21np1P/bmwvZRWJWM8hafhsQHYAo= X-Riseup-User-ID: CFD4FD5DAE5AF3236C877A29E9C1EF0B43280DD76569B3416C047F30BF98F886 Received: from [127.0.0.1] (localhost [127.0.0.1]) by bell.riseup.net (Postfix) with ESMTPSA id 4B7BdJ3s0HzJnSk; Thu, 16 Jul 2020 17:21:56 -0700 (PDT) From: Francisco Jerez To: "Rafael J. Wysocki" Cc: "Rafael J. Wysocki" , "Rafael J. Wysocki" , Linux PM , Linux Documentation , LKML , Peter Zijlstra , Srinivas Pandruvada , Giovanni Gherdovich , Doug Smythies Subject: Re: [PATCH] cpufreq: intel_pstate: Implement passive mode with HWP enabled In-Reply-To: References: <3955470.QvD6XneCf3@kreacher> <87r1tdiqpu.fsf@riseup.net> <87imeoihqs.fsf@riseup.net> Date: Thu, 16 Jul 2020 17:21:53 -0700 Message-ID: <875zanhty6.fsf@riseup.net> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; charset=utf-8 Content-Disposition: inline "Rafael J. Wysocki" writes: > On Wed, Jul 15, 2020 at 11:35 PM Francisco Jerez wrote: >> >> "Rafael J. Wysocki" writes: >> >> > On Wed, Jul 15, 2020 at 2:09 AM Francisco Jerez wrote: >> >> >> >> "Rafael J. Wysocki" writes: >> >> >> >> > From: Rafael J. Wysocki >> >> > >> >> > Allow intel_pstate to work in the passive mode with HWP enabled and >> >> > make it set the HWP minimum performance limit (HWP floor) to the >> >> > P-state value given by the target frequency supplied by the cpufreq >> >> > governor, so as to prevent the HWP algorithm and the CPU scheduler >> >> > from working against each other, at least when the schedutil governor >> >> > is in use, and update the intel_pstate documentation accordingly. >> >> > >> >> > Among other things, this allows utilization clamps to be taken >> >> > into account, at least to a certain extent, when intel_pstate is >> >> > in use and makes it more likely that sufficient capacity for >> >> > deadline tasks will be provided. >> >> > >> >> > After this change, the resulting behavior of an HWP system with >> >> > intel_pstate in the passive mode should be close to the behavior >> >> > of the analogous non-HWP system with intel_pstate in the passive >> >> > mode, except that in the frequency range below the base frequency >> >> > (ie. the frequency retured by the base_frequency cpufreq attribute >> >> > in sysfs on HWP systems) the HWP algorithm is allowed to go above >> >> > the floor P-state set by intel_pstate with or without hardware >> >> > coordination of P-states among CPUs in the same package. >> >> > >> >> > Also note that the setting of the HWP floor may not be taken into >> >> > account by the processor in the following cases: >> >> > >> >> > * For the HWP floor in the range of P-states above the base >> >> > frequency, referred to as the turbo range, the processor has a >> >> > license to choose any P-state from that range, either below or >> >> > above the HWP floor, just like a non-HWP processor in the case >> >> > when the target P-state falls into the turbo range. >> >> > >> >> > * If P-states of the CPUs in the same package are coordinated >> >> > at the hardware level, the processor may choose a P-state >> >> > above the HWP floor, just like a non-HWP processor in the >> >> > analogous case. >> >> > >> >> > With this change applied, intel_pstate in the passive mode >> >> > assumes complete control over the HWP request MSR and concurrent >> >> > changes of that MSR (eg. via the direct MSR access interface) are >> >> > overridden by it. >> >> > >> >> > Signed-off-by: Rafael J. Wysocki >> >> > --- >> >> > >> >> > This basically unifies the passive mode behavior of intel_pstate for systems >> >> > with and without HWP enabled. The only case in which there is a difference >> >> > between the two (after this patch) is below the turbo range, where the HWP >> >> > algorithm can go above the floor regardless of whether or not P-state are >> >> > coordinated package-wide (this means the systems with per-core P-states >> >> > mostly is where the difference can be somewhat visible). >> >> > >> >> > Since the passive mode hasn't worked with HWP at all, and it is not going to >> >> > the default for HWP systems anyway, I don't see any drawbacks related to making >> >> > this change, so I would consider this as 5.9 material unless there are any >> >> > serious objections. >> >> > >> >> > Thanks! >> >> > >> >> > --- >> >> > Documentation/admin-guide/pm/intel_pstate.rst | 89 +++++++--------- >> >> > drivers/cpufreq/intel_pstate.c | 141 ++++++++++++++++++++------ >> >> > 2 files changed, 152 insertions(+), 78 deletions(-) >> >> > >> >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c >> >> > =================================================================== >> >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c >> >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c >> >> > @@ -36,6 +36,7 @@ >> >> > #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC) >> >> > >> >> > #define INTEL_CPUFREQ_TRANSITION_LATENCY 20000 >> >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 >> >> > #define INTEL_CPUFREQ_TRANSITION_DELAY 500 >> >> > >> >> > #ifdef CONFIG_ACPI >> >> > @@ -222,6 +223,7 @@ struct global_params { >> >> > * preference/bias >> >> > * @epp_saved: Saved EPP/EPB during system suspend or CPU offline >> >> > * operation >> >> > + * @epp_cached Cached HWP energy-performance preference value >> >> > * @hwp_req_cached: Cached value of the last HWP Request MSR >> >> > * @hwp_cap_cached: Cached value of the last HWP Capabilities MSR >> >> > * @last_io_update: Last time when IO wake flag was set >> >> > @@ -259,6 +261,7 @@ struct cpudata { >> >> > s16 epp_policy; >> >> > s16 epp_default; >> >> > s16 epp_saved; >> >> > + s16 epp_cached; >> >> > u64 hwp_req_cached; >> >> > u64 hwp_cap_cached; >> >> > u64 last_io_update; >> >> > @@ -676,6 +679,8 @@ static int intel_pstate_set_energy_pref_ >> >> > >> >> > value |= (u64)epp << 24; >> >> > ret = wrmsrl_on_cpu(cpu_data->cpu, MSR_HWP_REQUEST, value); >> >> > + >> >> > + WRITE_ONCE(cpu_data->epp_cached, epp); >> >> >> >> Why introduce a new EPP cache variable if there is already >> >> hwp_req_cached? If intel_pstate_set_energy_pref_index() is failing to >> >> update hwp_req_cached maybe we should fix that instead. That will save >> >> you a little bit of work in intel_cpufreq_adjust_hwp(). >> > >> > Yes, it would, but then we'd need to explicitly synchronize >> > intel_pstate_set_energy_pref_index() with the scheduler context which >> > I'd rather avoid. >> > >> >> How does using a differently named variable save you from doing that? > > It is a separate variable. > > The only updater of epp_cached, except for the initialization, is > intel_pstate_set_energy_pref_index() and it cannot race with another > instance of itself, so there are no concurrent writes to epp_cached. > > In the passive mode the only updater of hwp_req_cached, except for the > initialization, is intel_cpufreq_adjust_hwp() (or there is a bug in > the patch that I have missed) and it cannot race with another instance > of itself for the same CPU, so there are no concurrent writes to > hwp_req_cached. > > if intel_pstate_set_energy_pref_index() updated hwp_req_cached > directly, however, it might be updated in two places concurrently and > so explicit synchronization would be necessary. > That's fair, but we may need to add such synchronization anyway due to the bug I pointed out above, so it might be simpler to avoid introducing additional state and simply stick to hwp_req_cached with proper synchronization. >> And won't the EPP setting programmed by intel_pstate_set_energy_pref_index() >> be lost if intel_pstate_hwp_boost_up() or some other user of >> hwp_req_cached is executed afterwards with the current approach? > > The value written to the register by it may be overwritten by a > concurrent intel_cpufreq_adjust_hwp(), but that is not a problem, > because next time intel_cpufreq_adjust_hwp() runs for the target CPU, > it will pick up the updated epp_cached value which will be written to > the register. However intel_cpufreq_adjust_hwp() may never be executed afterwards if intel_pstate is in active mode, in which case the overwritten value may remain there forever potentially. > So there may be a short time window after the > intel_pstate_set_energy_pref_index() invocation in which the new EPP > value may not be in effect, but in general there is no guarantee that > the new EPP will take effect immediately after updating the MSR > anyway, so that race doesn't matter. > > That said, that race is avoidable, but I was thinking that trying to > avoid it might not be worth it. Now I see a better way to avoid it, > though, so I'm going to update the patch to that end. > >> Seems like a bug to me. > > It is racy, but not every race is a bug. > Still seems like there is a bug in intel_pstate_set_energy_pref_index() AFAICT. >> >> > } else { >> >> > if (epp == -EINVAL) >> >> > epp = (pref_index - 1) << 2; >> >> > @@ -2047,6 +2052,7 @@ static int intel_pstate_init_cpu(unsigne >> >> > cpu->epp_default = -EINVAL; >> >> > cpu->epp_powersave = -EINVAL; >> >> > cpu->epp_saved = -EINVAL; >> >> > + WRITE_ONCE(cpu->epp_cached, -EINVAL); >> >> > } >> >> > >> >> > cpu = all_cpu_data[cpunum]; >> >> > @@ -2245,7 +2251,10 @@ static int intel_pstate_verify_policy(st >> >> > >> >> > static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy) >> >> > { >> >> > - intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]); >> >> > + if (hwp_active) >> >> > + intel_pstate_hwp_force_min_perf(policy->cpu); >> >> > + else >> >> > + intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]); >> >> > } >> >> > >> >> > static void intel_pstate_stop_cpu(struct cpufreq_policy *policy) >> >> > @@ -2253,12 +2262,10 @@ static void intel_pstate_stop_cpu(struct >> >> > pr_debug("CPU %d exiting\n", policy->cpu); >> >> > >> >> > intel_pstate_clear_update_util_hook(policy->cpu); >> >> > - if (hwp_active) { >> >> > + if (hwp_active) >> >> > intel_pstate_hwp_save_state(policy); >> >> > - intel_pstate_hwp_force_min_perf(policy->cpu); >> >> > - } else { >> >> > - intel_cpufreq_stop_cpu(policy); >> >> > - } >> >> > + >> >> > + intel_cpufreq_stop_cpu(policy); >> >> > } >> >> > >> >> > static int intel_pstate_cpu_exit(struct cpufreq_policy *policy) >> >> > @@ -2388,13 +2395,82 @@ static void intel_cpufreq_trace(struct c >> >> > fp_toint(cpu->iowait_boost * 100)); >> >> > } >> >> > >> >> > +static void intel_cpufreq_adjust_hwp(struct cpudata *cpu, u32 target_pstate, >> >> > + bool fast_switch) >> >> > +{ >> >> > + u64 prev = READ_ONCE(cpu->hwp_req_cached), value = prev; >> >> > + s16 epp; >> >> > + >> >> > + value &= ~HWP_MIN_PERF(~0L); >> >> > + value |= HWP_MIN_PERF(target_pstate); >> >> > + >> >> > + /* >> >> > + * The entire MSR needs to be updated in order to update the HWP min >> >> > + * field in it, so opportunistically update the max too if needed. >> >> > + */ >> >> > + value &= ~HWP_MAX_PERF(~0L); >> >> > + value |= HWP_MAX_PERF(cpu->max_perf_ratio); >> >> > + >> >> > + /* >> >> > + * In case the EPP has been adjusted via sysfs, write the last cached >> >> > + * value of it to the MSR as well. >> >> > + */ >> >> > + epp = READ_ONCE(cpu->epp_cached); >> >> > + if (epp >= 0) { >> >> > + value &= ~GENMASK_ULL(31, 24); >> >> > + value |= (u64)epp << 24; >> >> > + } >> >> > + >> >> > + if (value == prev) >> >> > + return; >> >> > + >> >> > + WRITE_ONCE(cpu->hwp_req_cached, value); >> >> > + if (fast_switch) >> >> > + wrmsrl(MSR_HWP_REQUEST, value); >> >> > + else >> >> > + wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value); >> >> > +} >> >> >> >> I've asked this question already but you may have missed it: Given that >> >> you are of the opinion that [1] should be implemented in schedutil >> >> instead with intel_pstate in HWP passive mode, what's your plan for >> >> exposing the HWP_MAX_PERF knob to the governor in addition to >> >> HWP_MIN_PERF, since the interface implemented here only allows the >> >> governor to provide a single frequency? >> >> >> >> [1] https://lwn.net/ml/linux-pm/20200428032258.2518-1-currojerez@riseup.net/ >> > >> > This is not just about the schedutil governor, but about cpufreq >> > governors in general (someone may still want to use the performance >> > governor on top of intel_pstate, for example). >> > >> > And while governors can only provide one frequency, the policy limits >> > in the cpufreq framework are based on QoS lists now and so it is >> > possible to add a max limit request, say from a driver, to the max QoS >> > list, and update it as needed, causing the max policy limit to be >> > adjusted. >> > >> > That said I'm not exactly sure how useful the max limit generally is >> > in practice on HWP systems, given that setting it above the base >> > frequency causes it to be ignored, effectively, and the turbo range >> > may be wider than the range of P-states below the base frequency. >> > >> >> I don't think that's accurate. I've looked at hundreds of traces while >> my series [1] was in control of HWP_REQ_MAX and I've never seen an >> excursion above the maximum HWP_REQ_MAX control specified by it within a >> given P-state domain, even while that maximum specified was well into >> the turbo range. > > I'm not going to argue with your experience. :-) > > What I'm saying is that there is no guarantee that the processor will > always select P-states below HWP_REQ_MAX in the turbo range. That may > not happen in practice, but it is not precluded AFAICS. > > Also while HWP_REQ_MAX can work in practice most of the time with HWP > enabled, without HWP there is no easy way to limit the max frequency > if the current request falls into the turbo range. The HWP case is > more important nowadays, but there still are systems without it and > ideally they should be covered as well. > In the non-HWP case we have a single P-state control so the question of how to plumb an extra P-state control from the CPUFREQ governor seems largely irrelevant. The current interface seems fine as-is for such systems. >> So, yeah, I agree that HWP_REQ_MAX is nothing like a >> hard limit, particularly when multiple threads are running on the same >> clock domain, but the processor will still make its best effort to limit >> the clock frequency to the maximum of the requested maximums, even if it >> happens to be within the turbo range. That doesn't make it useless. > > I haven't used the word "useless" anywhere in my previous message. > > Using the max frequency to control power has merit, but how much of it > is there depends on some factors that may change from one system to > another. > > The alternative power control methods may be more reliable in general. > That's precisely what I've been calling into question. IIRC the alternative power control methods we have discussed in the past are: - EPP: The semantics of this control are largely unspecified beyond higher values being more energy-efficient than lower values. The set of energy efficiency optimizations controlled by it and the thresholds at which they become active are fully platform-specific. I guess you didn't mean this one as example of a more reliable and less platform-specific control. - RAPL: The semantics of this control are indeed well-defined, it's able to set an absolute average power constraint to the involved power planes. However, the relationship between the information available to the kernel about a given workload (e.g. from CPU performance counters) and the optimal value of the RAPL constraints is highly platform-specific, requiring multiple iterations of adjustments and performance monitoring in order to approach the optimal value (unlike HWP_REQ_MAX since there is a simple, platform-independent relationship between observed frequency and... frequency -- More on that below). - P-code mailbox interface: Available to the graphics driver when GuC submission is in use, which is not available currently on any production platform. It won't allow the energy efficiency optimization I'm proposing to be taken advantage of by discrete graphics nor IO devices other than the GPU. Like HWP_REQ_MAX it sets a constraint on the CPU P-states so most caveats of HWP_REQ_MAX would apply to it too. But unlike HWP_REQ_MAX it has global effect on the system limiting its usefulness in a multitasking environment. Requires a governor to run in a GPU microcontroller with more limited information than CPUFREQ. So I'm either missing some alternative power control method or I strongly disagree that there is a more reliable and platform-independent alternative to HWP_REQ_MAX. >> The exact same thing can be said about controlling HWP_REQ_MIN as you're >> doing now in this revision of your patch, BTW. > > Which has been clearly stated in the changelog I believe. > Right, which is why I found it surprising to hear the same point as a counterargument against HWP_REQ_MAX. > The point here is that this is as good as using the perf control > register to ask for a given P-state without HWP which trying to drive > the max too is added complexity. > >> If you don't believe me here is the turbostat sample with maximum >> Bzy_MHz I get on the computer I'm sitting on right now while compiling a >> kernel on CPU0 if I set HWP_REQ_MAX to 0x1c (within the turbo range): >> >> | Core CPU Avg_MHz Busy% Bzy_MHz HWP_REQ PkgWatt CorWatt >> | - - 757 27.03 2800 0x0000000000000000 7.13 4.90 >> | 0 0 2794 99.77 2800 0x0000000080001c04 7.13 4.90 >> | 0 2 83 2.98 2800 0x0000000080001c04 >> | 1 1 73 2.60 2800 0x0000000080001c04 >> | 1 3 78 2.79 2800 0x0000000080001c04 >> >> With the default HWP_REQUEST: >> >> | Core CPU Avg_MHz Busy% Bzy_MHz HWP_REQ PkgWatt CorWatt >> | - - 814 27.00 3015 0x0000000000000000 8.49 6.18 >> | 0 0 2968 98.24 3021 0x0000000080001f04 8.49 6.18 >> | 0 2 84 2.81 2982 0x0000000080001f04 >> | 1 1 99 3.34 2961 0x0000000080001f04 >> | 1 3 105 3.60 2921 0x0000000080001f04 >> >> > Generally, I'm not quite convinced that limiting the max frequency is >> > really the right choice for controlling the processor's power draw on >> > the systems in question. There are other ways to do that, which in >> > theory should be more effective. I mentioned RAPL somewhere in this >> > context and there's the GUC firmware too. >> >> I feel like we've had that conversation before and it's somewhat >> off-topic so I'll keep it short: Yes, in theory RAPL is more effective >> than HWP_REQ_MAX as a mechanism to limit the absolute power consumption >> of the processor package, but that's not the purpose of [1], its purpose >> is setting a lower limit to the energy efficiency of the processor when >> the maximum usable CPU frequency is known (due to the existence of an IO >> device bottleneck) > > Whether or not that frequency is actually known seems quite > questionable to me, but that's aside. > It's not actually known, but it can be approximated easily under a widely-applicable assumption -- More on that below. > More important, it is unclear to me what you mean by "a lower limit to > the energy efficiency of the processor". > If we define the instantaneous energy efficiency of a CPU (eta) to be the ratio between its instantaneous frequency (f) and power consumption (P), I want to be able to set a lower limit to that ratio in cases where I can determine that doing so won't impact the performance of the application: | eta_min <= eta = f / P Setting such a lower limit to the instantaneous energy efficiency of the processor can only lower the total amount of energy consumed by the processor in order to perform a given amount of work (If you don't believe me on that feel free to express it as the integral of P over time, with P recovered from the expression above), therefore it can only improve the average energy efficiency of the workload in the long run. Because of the convex relationship between P and f above a certain inflection point (AKA maximum efficiency ratio, AKA min_pstate in intel_pstate.c), eta is monotonically decreasing with respect to frequency above that point, therefore setting a lower limit to the energy efficiency of the processor is equivalent to setting an upper limit to its frequency within that range. > I guess what you mean is that the processor might decide to go for a > more energy-efficient configuration by increasing its frequency in a > "race to idle" fashion (in response to a perceived utilization spike) > and you want to prevent that from occurring. > No, a race to idle response to a utilization spike would only be more energy efficient than the performance-equivalent constant-frequency response in cases where the latter constant frequency is in the concavity region of the system's power curve (below the inflection point). I certainly don't want to prevent that from occurring when it's the most energy-efficient thing to do. > Or, generally speaking, that the CPU performance scaling logic, either > in the kernel or in the processor itself, might select a higher > operating frequency of a CPU in response to a perceived utilization > spike, but that may be a mistake in the presence of another data > processing device sharing the power budget with the processor, so you > want to prevent that from taking place. > Yes, I do. > In both cases, I wouldn't call that limiting the energy-efficiency of > the processor. Effectively, this means putting a limit on the > processor's power budget, which is exactly what RAPL is for. > No, limiting the processor frequency also imposes a limit to its energy efficiency due to the reason explained above. >> -- And if the maximum usable CPU frequency is the >> information we have at hand, > > How so? > > How can you tell what that frequency is? > In the general case it would take a crystal ball to know the amount of work the CPU is going to have to do in the future, however as soon as the system has reached a steady state (which amounts to a large fraction of the time and energy consumption of many workloads, therefore it's an interesting case to optimize for) its previous behavior can be taken as proxy for its future behavior (by definition of steady state), therefore we can measure the performance delivered by the processor in the immediate past and make sure that the governor's response doesn't prevent it from achieving the same performance (plus some margin in order to account for potential fluctuations in the workload). That's, yes, an essentially heuristic assumption, but one that underlies every other CPU frequency governor in the Linux kernel tree to a greater or lower extent. >> controlling the maximum CPU frequency >> directly is optimal, rather than trying to find the RAPL constraint that >> achieves the same average frequency by trial an error. Also, in theory, >> even if you had an oracle to tell you what the appropriate RAPL >> constraint is, the result would necessarily be more energy-inefficient >> than controlling the maximum CPU frequency directly, since you're giving >> the processor additional freedom to run at frequencies above the one you >> want to average, which is guaranteed to be more energy-inefficient than >> running at that fixed frequency, assuming we are in the region of >> convexity of the processor's power curve. > > So the reason why you want to limit the processor's max frequency in > the first place is because it is sharing the power budget with > something else. No, my ultimate goal is to optimize the energy efficiency of the CPU in cases where the system has a bottleneck elsewhere. > If there's no sharing of the power budget or thermal constraints, > there is no need to limit the CPU frequency other than for the sake of > saving energy. > Yes! > What you can achieve by limiting the max CPU frequency is to make the > processor draw less power (and cause it to use either less or more > energy, depending on the energy-efficiency curve). Yes, in order to make sure that limiting the maximum CPU frequency doesn't lead to increased energy usage the response of the governor is clamped to the convexity range of the CPU power curve (which yes, I'm aware is only an approximation to the convexity range of the whole-system power curve). > You don't know how much less power it will draw then, however. > I don't see any need to care how much less power is drawn in absolute terms, as long as the energy efficiency of the system is improved *and* its performance is at least the same as it was before. > You seem to be saying "I know exactly what the maximum frequency of > the CPU can be, so why I don't set it as the upper limit", but I'm > questioning the source of that knowledge. Does it not come from > knowing the power budget you want to give to the processor? > No, it comes from CPU performance counters -- More on that above. >> Anyway, if you still have some disagreement on the theoretical details >> you're more than welcome to bring up the matter on the other thread [1], >> or accept the invitation for a presentation I sent you months ago... ;) > > Why don't we continue the discussion here instead? > > I think we are getting to the bottom of things here. Up to you. --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iHUEAREIAB0WIQST8OekYz69PM20/4aDmTidfVK/WwUCXxDvIQAKCRCDmTidfVK/ W1fwAP9JWLHXT6Sf6BC4tPEid9EvnuM7fORuWOxndMtp9vv5lQD+M5btA/Na4izy /TloJXLlYVYusZzmHUws8rI0c5xywsw= =85MH -----END PGP SIGNATURE----- --==-=-=--