Received: by 2002:a25:2c96:0:0:0:0:0 with SMTP id s144csp1705623ybs; Tue, 26 May 2020 01:23:11 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzLO1FLxIKC2cV5lHGhtusXIBNi2YLocT3TMRpaXSP94ToD6blTr9Y04S7IRnavaiTcrYZE X-Received: by 2002:a50:bb6f:: with SMTP id y102mr7384521ede.152.1590481391261; Tue, 26 May 2020 01:23:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1590481391; cv=none; d=google.com; s=arc-20160816; b=uxS4vBlWnccOYHwhrUOgJhYNwJRPva6Jn61+XXHJ0PTwqtDFGa/wSFU/MdgSjG9Kkc RuxVq3jRae78qzINGnUNXX4DEcQL2kuss14n6G8s5cxPzViX62U43AeBbEl4b8TGGgvb cuSv74k82NeM946XuS5MW+OtdMPLXC5RAdCIz0aoPuoeZNLRtaP1Q+jT8nvarz1UJQDm P0T74MtABvyCQ7896HWUYKZTuVw3GhIqQZywHRMeT/i7SSt3XLXkEQJtaQKQfaWUhPYB 6t/y3bXaoaFlxkFIwik33PsZU549TOwdI8GQrvwiCbXFrJI3dnJhq6HVUz77iV6i1XrY aw1g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version; bh=KsvQXeXpFRPNar5wQqzlPMtmhRYyKZVsuupY3jGdM1E=; b=Xffc53t2smxBYI4bk6lZ1lrC/c3gkJ7p5GEpU+hHE9ZQTAy1t6RWdr8HeHEMcRo3OZ QtyZyuxQn0DUGZKeKR6iO8EmUZ5Jm85izHCRrOVtEn6RJGe414NR6dQO63AXPt+84Z+L kyv5xbkgDlmcnJ/YAQQDIOqQLMq1D6qLH1XGIr329W/qS1XomnyXliDbysXNfrQGaonw gbdYg19Nkr7aulucC9B/EUAyPfCoci7WGzmHpJvtgAWJZ+TR/J5N/33NxBhWXVJZjxpW 1uLlKotqm7zv5j0MBCtQEjw2DXlBq405Aoo8bnPTWeKbXFoUT5xzFB7wNJ6BoP5iHwmO /rZg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k2si8340500eds.297.2020.05.26.01.22.47; Tue, 26 May 2020 01:23:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731358AbgEZITf (ORCPT + 99 others); Tue, 26 May 2020 04:19:35 -0400 Received: from mail-ot1-f67.google.com ([209.85.210.67]:41917 "EHLO mail-ot1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726926AbgEZITf (ORCPT ); Tue, 26 May 2020 04:19:35 -0400 Received: by mail-ot1-f67.google.com with SMTP id 63so15606273oto.8; Tue, 26 May 2020 01:19:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=KsvQXeXpFRPNar5wQqzlPMtmhRYyKZVsuupY3jGdM1E=; b=QLUi9LjCShLQP3J/tKOvkuSW7pRQPUpfxZ1oczHkI7rmgU7aE6aLqBHkZexQTn5dkr 2jDS8OP/gihyO6pWLKoGjajwXplGAKLV7Z/YAe2i0W3yGRvK9V/3uRqccpQUrRj8Srm0 KKCW7woSzoP0HpnQZC/9agcVgt6OrU2x0Zw0zttcxFtE516jv1TufIqtHgkn8f2Pk9h9 2w1Hlf4sceXnDm8YYCdj/iELJLCnleiigfx3hkfuuUIla+uk4kZtSfZw/RGl3eCmT94a vYeuuX39+Ces5UGRsW04RX0D8O4HRf7lUul97fvfJL1NDQQNVTDAeE6GOgpOEN7ZVf7f OZcQ== X-Gm-Message-State: AOAM532umeZCwfVWi0vDKge4MkFHDFXqSHc8oAzfxzQcXDvPaz5tpDig WmOKGsTNzO8HTfHnCeGvdktMoqfy1NlYNBN+GNo= X-Received: by 2002:a9d:4713:: with SMTP id a19mr36157otf.167.1590481173929; Tue, 26 May 2020 01:19:33 -0700 (PDT) MIME-Version: 1.0 References: <3169564.ZRsPWhXyMD@kreacher> <87mu5wre1v.fsf@intel.com> <87a71vraus.fsf@intel.com> In-Reply-To: <87a71vraus.fsf@intel.com> From: "Rafael J. Wysocki" Date: Tue, 26 May 2020 10:19:21 +0200 Message-ID: Subject: Re: [RFC/RFT][PATCH] cpufreq: intel_pstate: Work in passive mode with HWP enabled To: Francisco Jerez Cc: "Rafael J. Wysocki" , "Rafael J. Wysocki" , Linux PM , LKML , Len Brown , Srinivas Pandruvada , Peter Zijlstra , Giovanni Gherdovich , Doug Smythies Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org to On Mon, May 25, 2020 at 10:57 PM Francisco Jerez wrote: > > "Rafael J. Wysocki" writes: > > > On Mon, May 25, 2020 at 3:39 AM Francisco Jerez > > wrote: > >> > >> "Rafael J. Wysocki" writes: > >> > >> > From: Rafael J. Wysocki > >> > > >> > Allow intel_pstate to work in the passive mode with HWP enabled and > >> > make it translate the target frequency supplied by the cpufreq > >> > governor in use into an EPP value to be written to the HWP request > >> > MSR (high frequencies are mapped to low EPP values that mean more > >> > performance-oriented HWP operation) as a hint for the HWP algorithm > >> > in the processor, so as to prevent it and the CPU scheduler from > >> > working against each other at least when the schedutil governor is > >> > in use. > >> > > >> > Signed-off-by: Rafael J. Wysocki > >> > --- > >> > > >> > This is a prototype not intended for production use (based on linux-next). > >> > > >> > Please test it if you can (on HWP systems, of course) and let me know the > >> > results. > >> > > >> > The INTEL_CPUFREQ_TRANSITION_DELAY_HWP value has been guessed and it very well > >> > may turn out to be either too high or too low for the general use, which is one > >> > reason why getting as much testing coverage as possible is key here. > >> > > >> > If you can play with different INTEL_CPUFREQ_TRANSITION_DELAY_HWP values, > >> > please do so and let me know the conclusions. > >> > > >> > Cheers, > >> > Rafael > >> > > >> > --- > >> > drivers/cpufreq/intel_pstate.c | 169 +++++++++++++++++++++++++++++++---------- > >> > 1 file changed, 131 insertions(+), 38 deletions(-) > >> > > >> > Index: linux-pm/drivers/cpufreq/intel_pstate.c > >> > =================================================================== > >> > --- linux-pm.orig/drivers/cpufreq/intel_pstate.c > >> > +++ linux-pm/drivers/cpufreq/intel_pstate.c > >> > @@ -36,6 +36,7 @@ > >> > #define INTEL_PSTATE_SAMPLING_INTERVAL (10 * NSEC_PER_MSEC) > >> > > >> > #define INTEL_CPUFREQ_TRANSITION_LATENCY 20000 > >> > +#define INTEL_CPUFREQ_TRANSITION_DELAY_HWP 5000 > >> > #define INTEL_CPUFREQ_TRANSITION_DELAY 500 > >> > > >> > #ifdef CONFIG_ACPI > >> > @@ -95,6 +96,8 @@ static inline int32_t percent_ext_fp(int > >> > return div_ext_fp(percent, 100); > >> > } > >> > > >> > +#define HWP_EPP_TO_BYTE(x) (((u64)x >> 24) & 0xFF) > >> > + > >> > /** > >> > * struct sample - Store performance sample > >> > * @core_avg_perf: Ratio of APERF/MPERF which is the actual average > >> > @@ -2175,7 +2178,10 @@ static int intel_pstate_verify_policy(st > >> > > >> > static void intel_cpufreq_stop_cpu(struct cpufreq_policy *policy) > >> > { > >> > - intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]); > >> > + if (hwp_active) > >> > + intel_pstate_hwp_force_min_perf(policy->cpu); > >> > + else > >> > + intel_pstate_set_min_pstate(all_cpu_data[policy->cpu]); > >> > } > >> > > >> > static void intel_pstate_stop_cpu(struct cpufreq_policy *policy) > >> > @@ -2183,12 +2189,10 @@ static void intel_pstate_stop_cpu(struct > >> > pr_debug("CPU %d exiting\n", policy->cpu); > >> > > >> > intel_pstate_clear_update_util_hook(policy->cpu); > >> > - if (hwp_active) { > >> > + if (hwp_active) > >> > intel_pstate_hwp_save_state(policy); > >> > - intel_pstate_hwp_force_min_perf(policy->cpu); > >> > - } else { > >> > - intel_cpufreq_stop_cpu(policy); > >> > - } > >> > + > >> > + intel_cpufreq_stop_cpu(policy); > >> > } > >> > > >> > static int intel_pstate_cpu_exit(struct cpufreq_policy *policy) > >> > @@ -2296,7 +2300,8 @@ static int intel_cpufreq_verify_policy(s > >> > #define INTEL_PSTATE_TRACE_TARGET 10 > >> > #define INTEL_PSTATE_TRACE_FAST_SWITCH 90 > >> > > >> > -static void intel_cpufreq_trace(struct cpudata *cpu, unsigned int trace_type, int old_pstate) > >> > +static void intel_cpufreq_trace(struct cpudata *cpu, unsigned int trace_type, > >> > + int from, int to) > >> > { > >> > struct sample *sample; > >> > > >> > @@ -2309,8 +2314,8 @@ static void intel_cpufreq_trace(struct c > >> > sample = &cpu->sample; > >> > trace_pstate_sample(trace_type, > >> > 0, > >> > - old_pstate, > >> > - cpu->pstate.current_pstate, > >> > + from, > >> > + to, > >> > sample->mperf, > >> > sample->aperf, > >> > sample->tsc, > >> > @@ -2318,40 +2323,110 @@ static void intel_cpufreq_trace(struct c > >> > fp_toint(cpu->iowait_boost * 100)); > >> > } > >> > > >> > -static int intel_cpufreq_target(struct cpufreq_policy *policy, > >> > - unsigned int target_freq, > >> > - unsigned int relation) > >> > +static void intel_cpufreq_update_hwp_request(struct cpudata *cpu, u8 new_epp) > >> > { > >> > - struct cpudata *cpu = all_cpu_data[policy->cpu]; > >> > - struct cpufreq_freqs freqs; > >> > - int target_pstate, old_pstate; > >> > + u64 value, prev; > >> > > >> > - update_turbo_state(); > >> > + prev = READ_ONCE(cpu->hwp_req_cached); > >> > + value = prev; > >> > > >> > - freqs.old = policy->cur; > >> > - freqs.new = target_freq; > >> > + /* > >> > + * The entire MSR needs to be updated in order to update the EPP field > >> > + * in it, so opportunistically update the min and max too if needed. > >> > + */ > >> > + value &= ~HWP_MIN_PERF(~0L); > >> > + value |= HWP_MIN_PERF(cpu->min_perf_ratio); > >> > + > >> > + value &= ~HWP_MAX_PERF(~0L); > >> > + value |= HWP_MAX_PERF(cpu->max_perf_ratio); > >> > + > >> > + if (boot_cpu_has(X86_FEATURE_HWP_EPP)) { > >> > + intel_cpufreq_trace(cpu, INTEL_PSTATE_TRACE_TARGET, > >> > + HWP_EPP_TO_BYTE(prev), new_epp); > >> > + > >> > + value &= ~GENMASK_ULL(31, 24); > >> > + value |= HWP_ENERGY_PERF_PREFERENCE(new_epp); > >> > + } > >> > + > >> > + if (value != prev) { > >> > + WRITE_ONCE(cpu->hwp_req_cached, value); > >> > + wrmsrl_on_cpu(cpu->cpu, MSR_HWP_REQUEST, value); > >> > + } > >> > +} > >> > + > >> > +/** > >> > + * intel_cpufreq_adjust_hwp_request - Adjust the HWP reuqest register. > >> > + * @cpu: Target CPU. > >> > + * @max_freq: Maximum frequency to consider. > >> > + * @target_freq: Target frequency selected by the governor. > >> > + * > >> > + * Translate the target frequency into a new EPP value to be written into the > >> > + * HWP request MSR of @cpu as a hint for the HW-driven P-state selection. > >> > + * > >> > + * The purpose of this is to avoid situations in which the kernel and the HWP > >> > + * algorithm work against each other by giving a hint about the expectations of > >> > + * the former to the latter. > >> > + * > >> > + * The mapping betweeen the target frequencies and the hint values need not be > >> > + * exact, but it must be monotonic, so that higher target frequencies always > >> > + * indicate more performance-oriented P-state selection. > >> > + */ > >> > +static void intel_cpufreq_adjust_hwp_request(struct cpudata *cpu, s64 max_freq, > >> > + unsigned int target_freq) > >> > +{ > >> > + s64 epp_fp = div_fp(255 * (max_freq - target_freq), max_freq); > >> > + > >> > + intel_cpufreq_update_hwp_request(cpu, fp_toint(epp_fp)); > >> > +} > >> > + > >> > >> Hey Rafael, I'm building a kernel with this in order to give it a try on > >> my system, but I'm skeptical that translating the target frequency to an > >> EPP value will work reliably. AFAIA the EPP value only has an indirect > >> influence on the processor's performance by adjusting the trade-off > >> between its responsiveness (rather than the actual clock frequency which > >> it will sustain in the long run) and its energy usage, in a largely > >> unspecified and non-linear way (non-linear like the effect of switching > >> CPU energy optimization features on and off, or like its effect on the > >> energy balancing behavior of the processor which can have a paradoxical > >> effect on performance). > >> > >> I doubt that the scheduling-based CPU utilization is a good predictor > >> for the optimal trade-off between those two variables. > > > > While I agree that this is not perfect, there barely is anything else > > that can be used for this purpose. > > > > Using the desired field or trying to adjust the limits relatively > > often would basically cause the P-state selection to be driven > > entirely by the kernel which simply doesn't know certain things only > > known to the P-unit, so it is reasonable to leave some level of > > control to the latter, so as to allow it to use the information known > > to it only. > > > > However, if it is allowed to do whatever it likes without any hints > > from the kernel, it may very well go against the scheduler's decisions > > which is not going to be optimal. > > > > I'm simply not sure if there is any other way to give such hints to it > > that through the EPP. > > > > Why not HWP_MIN_PERF? That would leave the HWP quite some room for > maneuvering (the whole HWP_MIN_PERF-HWP_MAX_PERF P-state range, it's not > like P-state selection would be entirely driven by the kernel), while > avoiding every failure scenario I was describing in my previous reply. Actually, I have been thinking about the HWP min as an alternative that may be worth evaluating. However, I would rather set the HWP min to something like 80% if the cpufreq request. Let me cut an alternative patch.