Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp735871ybn; Wed, 2 Oct 2019 05:29:11 -0700 (PDT) X-Google-Smtp-Source: APXvYqwqR3+SVI81PyVOjF8YxpzezLKmv9/F8c6CR0m6bT4IUvRCbZdaabfZGzWefYpnllgK4WOG X-Received: by 2002:aa7:c495:: with SMTP id m21mr3488534edq.287.1570019351349; Wed, 02 Oct 2019 05:29:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1570019351; cv=none; d=google.com; s=arc-20160816; b=IVv1MmJpnFE1ZIiSAmVqOLmEg2KQOgpcb00dl0qAelar7t6kTVyHag4P8wGBet9V5u oLmoR0+M6oUV4i7E0Tgp34aA736wrZFLk0YuGjZNZ+x/PqE/ol2i1Ud8CxPVQ5TC4wtG JaPALuOOKRQ58ktmevPQ9rKlrBV/mOTWfgo8FAdsNy1LhyN8Tp32iKfnub4FiDTBePPY iLOeJejb/pO3EKYY+dhxhJd/AKpK8W+mxjSYa2u1XNECMHJSmt2M4Xkv8sJgOqL+VPv8 3NF8iD9BGSh/i8F+KLRAYJ5wEHdMbd3khoBYAXqnfa/Lpb/r4YDMEGBHesTduGd1n6Cv wDDw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:date:cc:to:from:subject:message-id; bh=Cwz0AzgA+G8J8PCjRgGprLT2/W2rb1M4rSUmIx667XM=; b=uMWYIVzDejPF/XfbmunhTORNbD5c+aaDjpHnmN6pVPJPrbt0vKeIdl3AvzRIz/hwi3 NIjCV8XuqESn8wp9ROncKiutzGpJCIQunRW1Z8qXxApT301LOvhNYQdtIaT265D+p6Rd LW++1f24LJxZhLJn0fE3i8Lpuq62GjkX7oKZUVRjPZXRDztMif7BM+cDp/D8wLSd7erW Cb4iNDhbkT8Osk8tFeJNWZ+dIwAaq1lcJ6EHkXAKGDKV3hYLuKuz0lmMAGhSLstfv2fI sp8n8iPb38xlolFHci+GS1nYV4fnxzrEcxkI2Oz/+R4c2Uz3DFyYfQEuLjy6liVCAb/J pmvg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k53si11716992edb.303.2019.10.02.05.28.47; Wed, 02 Oct 2019 05:29:11 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726708AbfJBMW0 (ORCPT + 99 others); Wed, 2 Oct 2019 08:22:26 -0400 Received: from mx2.suse.de ([195.135.220.15]:55570 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725747AbfJBMW0 (ORCPT ); Wed, 2 Oct 2019 08:22:26 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 72B1CACC3; Wed, 2 Oct 2019 12:22:23 +0000 (UTC) Message-ID: <1570019274.22393.2.camel@suse.cz> Subject: Re: [PATCH 1/2] x86,sched: Add support for frequency invariance From: Giovanni Gherdovich To: Peter Zijlstra Cc: Quentin Perret , srinivas.pandruvada@linux.intel.com, tglx@linutronix.de, mingo@redhat.com, bp@suse.de, lenb@kernel.org, rjw@rjwysocki.net, x86@kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, mgorman@techsingularity.net, matt@codeblueprint.co.uk, viresh.kumar@linaro.org, juri.lelli@redhat.com, pjt@google.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com Date: Wed, 02 Oct 2019 14:27:54 +0200 In-Reply-To: <20190924160022.GB2386@hirez.programming.kicks-ass.net> References: <20190909024216.5942-1-ggherdovich@suse.cz> <20190909024216.5942-2-ggherdovich@suse.cz> <20190914105708.GA12877@qperret.net> <1568730466.3329.4.camel@suse.cz> <20190924140332.GL2369@hirez.programming.kicks-ass.net> <20190924160022.GB2386@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.26.6 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2019-09-24 at 18:00 +0200, Peter Zijlstra wrote: > On Tue, Sep 24, 2019 at 04:03:32PM +0200, Peter Zijlstra wrote: > > > > I'll check what's the cost of static_cpu_has() and if it's non-negligible I'll > > > do what you suggest (x86-specific version of arch_scale_freq_invariant(). > > > > static_cpu_has() is an alternative and ends up being a static branch > > (similar to static_key) once the alternative patching runs. > > That said; I think you want a static key anyway, because if we can't > tell the max_freq we don't want to use the invariant stuff. > > Something a little like so on top perhaps. > > Also, the below fixes that silly tick_disable stuff. Thanks for this patch, I'll add this change in v2. Can you elaborate on what you don't like in the tick_disable mechanism? After reading your comments I realized there is a problem, but I'm not sure is the same you're addressing. More on this below, under your edit of the function x86_arch_scale_freq_tick_disable(). > > --- > --- a/arch/x86/include/asm/topology.h > +++ b/arch/x86/include/asm/topology.h > @@ -196,20 +196,24 @@ static inline void sched_clear_itmt_supp > #ifdef CONFIG_SMP > #include > > -#define arch_scale_freq_tick arch_scale_freq_tick > -#define arch_scale_freq_capacity arch_scale_freq_capacity > +DECLARE_STATIC_KEY_FALSE(arch_scale_freq_key); > + > +#define arch_scale_freq_invariant() static_branch_likely(&arch_scale_freq_key) This confused me for a second but then I realized that this #define comes before the one in kernel/sched/sched.h where arch_scale_freq_invariant() is defined again but guarded against previous definitions, so it all falls into place; code from schedutil will see this one. > > DECLARE_PER_CPU(unsigned long, arch_cpu_freq); > > static inline long arch_scale_freq_capacity(int cpu) > { > - if (static_cpu_has(X86_FEATURE_APERFMPERF)) > + if (arch_scale_freq_invariant()) > return per_cpu(arch_cpu_freq, cpu); > > return 1024 /* SCHED_CAPACITY_SCALE */; > } > +#define arch_scale_freq_capacity arch_scale_freq_capacity > > extern void arch_scale_freq_tick(void); > +#define arch_scale_freq_tick arch_scale_freq_tick > + > extern void x86_arch_scale_freq_tick_enable(void); > extern void x86_arch_scale_freq_tick_disable(void); > #else > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1799,6 +1799,8 @@ void native_play_dead(void) > * freq_curr / freq_max to eventually grow >1, in which case we clip it to 1. > */ > > +DEFINE_STATIC_KEY_FALSE(arch_scale_freq_key); > + > static DEFINE_PER_CPU(u64, arch_prev_aperf); > static DEFINE_PER_CPU(u64, arch_prev_mperf); > static u64 arch_max_freq = SCHED_CAPACITY_SCALE; > @@ -1860,6 +1862,8 @@ static void core_set_cpu_max_freq(void) > turbo_ratio = (turbo_ratio >> 24) & 0xFF; /* 4C turbo ratio */ > > arch_max_freq = div_u64(turbo_ratio * SCHED_CAPACITY_SCALE, ratio); > + > + static_key_enable(&arch_scale_freq_key); > } > > static void intel_set_cpu_max_freq(void) > @@ -1876,10 +1880,19 @@ static void intel_set_cpu_max_freq(void) > core_set_cpu_max_freq(); > } > > -static void set_cpu_max_freq(void) > +static void init_scale_freq(void *arg) > { > u64 aperf, mperf; > > + rdmsrl(MSR_IA32_APERF, aperf); > + rdmsrl(MSR_IA32_MPERF, mperf); > + > + this_cpu_write(arch_prev_aperf, aperf); > + this_cpu_write(arch_prev_mperf, mperf); > +} > + > +static void set_cpu_max_freq(void) > +{ > if (!boot_cpu_has(X86_FEATURE_APERFMPERF)) > return; > > @@ -1891,11 +1904,7 @@ static void set_cpu_max_freq(void) > break; > } > > - rdmsrl(MSR_IA32_APERF, aperf); > - rdmsrl(MSR_IA32_MPERF, mperf); > - > - this_cpu_write(arch_prev_aperf, aperf); > - this_cpu_write(arch_prev_mperf, mperf); > + init_scale_freq(NULL); > } > > DEFINE_PER_CPU(unsigned long, arch_cpu_freq); > @@ -1908,7 +1917,7 @@ void arch_scale_freq_tick(void) > u64 aperf, mperf; > u64 acnt, mcnt; > > - if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable) > + if (!arch_scale_freq_invariant() || tick_disable) > return; > > rdmsrl(MSR_IA32_APERF, aperf); > @@ -1940,5 +1949,6 @@ void x86_arch_scale_freq_tick_enable(voi > > void x86_arch_scale_freq_tick_disable(void) > { > + on_each_cpu(init_scale_freq, NULL, 1); > tick_disable = true; I don't see why the call init_scale_freq() here is needed; why would I care of what's in arch_prev_[am]perf at this point. arch_scale_freq_tick() will see that tick_disable == true and exit early before reading arch_prev_[am]perf. The problem IMO emerges in the following configuration, which is a bug in the patch I sent: * arch_scale_freq_invariant() is true (because we have APERF/MPERF) * arch_scale_freq_capacity() is non-trivial (reads arch_cpu_freq) * tick calculations are disabled In this case arch_scale_freq_capacity() feeds stale data to the function update_rq_clock_pelt() in kernel/sched/pelt.h. I initially missed this problem because I forgot that PELT signals have more users than just the schedutil governor (load balancer etc). This is exactly the situation produced by patch 2/2 which disables the tick calculations for intel_cpufreq (aka intel_pstate=passive). I think the fix for this is to set arch_cpu_freq (each per-cpu instance of the variable) to SCHED_CAPACITY_SCALE here in x86_arch_scale_freq_tick_disable(). That would render the scaling factor for invariance moot (always 1), just as it is w/o scale invariance. I'm sending v2 with all your amendmends except this last one. Giovanni