Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp1777734imm; Wed, 16 May 2018 02:58:08 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpbBDPuUaM+LDOiZCQOMbWtWGCEWa0l1O9iiiGcBCmnsKSCYM9ZJZ7gO3KhXJTfSGhN5M3H X-Received: by 2002:a17:902:b189:: with SMTP id s9-v6mr218725plr.352.1526464688038; Wed, 16 May 2018 02:58:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526464687; cv=none; d=google.com; s=arc-20160816; b=K7/KIOJYuP3JiHXp2gA1ioEQjmF3QETq5uHSTHX2PjHcztoYyGpkHpq4vItnsDf9lr DcsH9vh5sBimFYrXZlRDY6Xmzfp2OrwFAm0/v8zWlu1vAy3jm4v7k/LGCPuZrXTuiaLL WsXxJrorQ8yJBkslAYXoiFkPQNURYzuojsbA1rnIDUpTsjGxwXPZ1UAdZkVEs+tu9f4F 0g7sCMCsZDpDfng4s++Yw+P8UWdX2ockqen4PNgERoDAkgXw/cPnlwbh62/BCndIy/kJ qrL5LDAEM2/pkw54jcb1kSZOUbXjYp57z4KnNEelrFOu2joEdFXr03EPKkiP0OM61W6Q 86rg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=DhkdY9saET9+CsHFHmY8ae5aM4oHDg9wXPWWHsxLXkI=; b=DfGOlxy66oOhQG1t2ucESMte3W9XRDU05dWAqr4BaOA2qPg4K4P1kK/Ewvb5sgkkBP khJ7/46I9qXTwRBU4Y3dOQPUDpothGXRa+/8CPX4jovPaEc5q0XqkPOTetBGznKgjMTL sI/kX/kCIb1dCF5Z9wOqvHAQFRipaN9RGdCth9rTSYI1VAxMKoYruztDCn8IzKtwqd+x fhaCs8HQyKe7XKHqR9159YGe7I9YyicfzbUD8es/By8GPbuy5zW/zjZ7cDNekCEv5KFR xaUyGJrH83/Lqt2BwKpiGpDIMa4QiRl4PEybWjNO991jLYJIpGsYlFCVBUzcLS47wqX8 wS6Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=drdFd8dI; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n10-v6si2134700plk.112.2018.05.16.02.57.53; Wed, 16 May 2018 02:58:07 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=drdFd8dI; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752999AbeEPJ5N (ORCPT + 99 others); Wed, 16 May 2018 05:57:13 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:59706 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752479AbeEPJ5K (ORCPT ); Wed, 16 May 2018 05:57:10 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=DhkdY9saET9+CsHFHmY8ae5aM4oHDg9wXPWWHsxLXkI=; b=drdFd8dIW9HgFKOXeGwbKxEOo JWSmJY6+ttcMmJM4tpMm2TZhAkF/7D2jnblrCQrJHEotkSBL0llslOEyRvDWqU8Ny5Az86XhoXvHO AAALy844b/WjWJEe6H9rOCQ76Isejh6wF/UX6uzxeorwhsNsiAS+FZJCTFYBlGP7Trdk8CtxOqKw5 i5qDl/rxQK3x3fmHkTXFMg4ud/ORWULrmegptCVmUFUqGuBGa3++v3CoyTObQg9QXNT3BqWZjy0Sa K2HNdICkRJxiboykisI4wk7j3vf/Orf2HjrUqa1KuYEFnBw9cD1mxcTZgBW/PpDbit+gTEbg8wOBS GrQhAd7QA==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=hirez.programming.kicks-ass.net) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1fItAw-0006fy-Pv; Wed, 16 May 2018 09:56:51 +0000 Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id DDFD82029F86C; Wed, 16 May 2018 11:56:48 +0200 (CEST) Date: Wed, 16 May 2018 11:56:48 +0200 From: Peter Zijlstra To: Srinivas Pandruvada Cc: tglx@linutronix.de, mingo@redhat.com, bp@suse.de, lenb@kernel.org, rjw@rjwysocki.net, mgorman@techsingularity.net, x86@kernel.org, linux-pm@vger.kernel.org, viresh.kumar@linaro.org, juri.lelli@arm.com, linux-kernel@vger.kernel.org, Suravee Suthikulpanit , "Rafael J. Wysocki" , Vincent Guittot , Morten Rasmussen , Dietmar Eggemann , Sudeep Holla Subject: Re: [RFC/RFT] [PATCH 01/10] x86,sched: Add support for frequency invariance Message-ID: <20180516095648.GB12217@hirez.programming.kicks-ass.net> References: <20180516044911.28797-1-srinivas.pandruvada@linux.intel.com> <20180516044911.28797-2-srinivas.pandruvada@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180516044911.28797-2-srinivas.pandruvada@linux.intel.com> User-Agent: Mutt/1.9.5 (2018-04-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Thanks for posting this one; I meant to start a thread on this for a while but never got around to doing so. I left the 'important' parts of the patch for context but removed all the arch fiddling to find the max freq, as that is not so important here. On Tue, May 15, 2018 at 09:49:02PM -0700, Srinivas Pandruvada wrote: > From: Peter Zijlstra > > Implement arch_scale_freq_capacity() for 'modern' x86. This function > is used by the scheduler to correctly account usage in the face of > DVFS. > > For example; suppose a CPU has two frequencies: 500 and 1000 Mhz. When > running a task that would consume 1/3rd of a CPU at 1000 MHz, it would > appear to consume 2/3rd (or 66.6%) when running at 500 MHz, giving the > false impression this CPU is almost at capacity, even though it can go > faster [*]. > > Since modern x86 has hardware control over the actual frequency we run > at (because amongst other things, Turbo-Mode), we cannot simply use > the frequency as requested through cpufreq. > > Instead we use the APERF/MPERF MSRs to compute the effective frequency > over the recent past. Also, because reading MSRs is expensive, don't > do so every time we need the value, but amortize the cost by doing it > every tick. > > [*] this assumes a linear frequency/performance relation; which > everybody knows to be false, but given realities its the best > approximation we can make. > > Cc: Thomas Gleixner > Cc: Suravee Suthikulpanit > Cc: "Rafael J. Wysocki" > Signed-off-by: Peter Zijlstra (Intel) > Signed-off-by: Srinivas Pandruvada > --- > diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h > index c1d2a98..3fb5346 100644 > --- a/arch/x86/include/asm/topology.h > +++ b/arch/x86/include/asm/topology.h > @@ -172,4 +172,33 @@ static inline void sched_clear_itmt_support(void) > } > #endif /* CONFIG_SCHED_MC_PRIO */ > > +#ifdef CONFIG_SMP > +#include > + > +#define arch_scale_freq_tick arch_scale_freq_tick > +#define arch_scale_freq_capacity arch_scale_freq_capacity > + > +DECLARE_PER_CPU(unsigned long, arch_cpu_freq); > + > +static inline long arch_scale_freq_capacity(int cpu) > +{ > + if (static_cpu_has(X86_FEATURE_APERFMPERF)) > + return per_cpu(arch_cpu_freq, cpu); > + > + return 1024 /* SCHED_CAPACITY_SCALE */; > +} > + > +extern void arch_scale_freq_tick(void); > +extern void x86_arch_scale_freq_tick_enable(void); > +extern void x86_arch_scale_freq_tick_disable(void); > +#else > +static inline void x86_arch_scale_freq_tick_enable(void) > +{ > +} > + > +static inline void x86_arch_scale_freq_tick_disable(void) > +{ > +} > +#endif > + > #endif /* _ASM_X86_TOPOLOGY_H */ > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 0f1cbb0..9e2cb82 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1676,3 +1680,193 @@ void native_play_dead(void) > } > > #endif > + > +/* > + * APERF/MPERF frequency ratio computation. > + * > + * The scheduler wants to do frequency invariant accounting and needs a <1 > + * ratio to account for the 'current' frequency. > + * > + * Since the frequency on x86 is controlled by micro-controller and our P-state > + * setting is little more than a request/hint, we need to observe the effective > + * frequency. We do this with APERF/MPERF. > + * > + * One complication is that the APERF/MPERF ratio can be >1, specifically > + * APERF/MPERF gives the ratio relative to the max non-turbo P-state. Therefore > + * we need to re-normalize the ratio. > + * > + * We do this by tracking the max APERF/MPERF ratio previously observed and > + * scaling our MPERF delta with that. Every time our ratio goes over 1, we > + * proportionally scale up our old max. One very important point however is that I wrote this patch in the context of Vincent's new scale invariance proposal: https://lkml.kernel.org/r/1493389435-2525-1-git-send-email-vincent.guittot@linaro.org The reason is that while this 'works' with the current scale invariance, the way turbo is handled is not optimal for it. At OSPM we briefly touched upon this subject, since also ARM will need something like this for some of their chips, so this is of general interrest. The problem with turbo of course is that our max frequency is variable; but we really rather would like a unit value for scaling. Returning a >1 value results in weird things (think of running for 1.5ms in 1ms wall-time for example). This implementation simply finds the absolute max observed and scales that as 1, with a result that when we're busy we'll always run at <1 because we cannot sustain turbo. This might result in the scheduler thinking we're not fully busy, when in fact we are. At OSPM it was suggested to instead track an average max or set 1 at the sustainable freq and clip overshoot. The problem with that is that it is actually hard to track an average max if you cannot tell what max even is. The problem with clipping of course is that we'll end up biasing the frequencies higher than required -- which might be OK if the overshoot is 'small' as it would typically be for an average max thing, but not when we set 1 at the sustainable frequency. I think the new scale invariance solves a bunch of these problems by always saturating, irrespective of the actual frequency we run at. Of course, IIRC it had other issues... > + * The down-side to this runtime max search is that you have to trigger the > + * actual max frequency before your scale is right. Therefore allow > + * architectures to initialize the max ratio on CPU bringup. > + */ > + > +static DEFINE_PER_CPU(u64, arch_prev_aperf); > +static DEFINE_PER_CPU(u64, arch_prev_mperf); > +static DEFINE_PER_CPU(u64, arch_prev_max_freq) = SCHED_CAPACITY_SCALE; > + > +DEFINE_PER_CPU(unsigned long, arch_cpu_freq); > + > +static bool tick_disable; > + > +void arch_scale_freq_tick(void) > +{ > + u64 freq, max_freq = this_cpu_read(arch_prev_max_freq); > + u64 aperf, mperf; > + u64 acnt, mcnt; > + > + if (!static_cpu_has(X86_FEATURE_APERFMPERF) || tick_disable) > + return; > + > + rdmsrl(MSR_IA32_APERF, aperf); > + rdmsrl(MSR_IA32_MPERF, mperf); > + > + acnt = aperf - this_cpu_read(arch_prev_aperf); > + mcnt = mperf - this_cpu_read(arch_prev_mperf); > + if (!mcnt) > + return; > + > + this_cpu_write(arch_prev_aperf, aperf); > + this_cpu_write(arch_prev_mperf, mperf); > + > + acnt <<= 2*SCHED_CAPACITY_SHIFT; > + mcnt *= max_freq; > + > + freq = div64_u64(acnt, mcnt); > + > + if (unlikely(freq > SCHED_CAPACITY_SCALE)) { > + max_freq *= freq; > + max_freq >>= SCHED_CAPACITY_SHIFT; > + > + this_cpu_write(arch_prev_max_freq, max_freq); > + > + freq = SCHED_CAPACITY_SCALE; > + } > + > + this_cpu_write(arch_cpu_freq, freq); > +} > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 092f7c4..2bdef36 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -3076,6 +3076,7 @@ void scheduler_tick(void) > struct task_struct *curr = rq->curr; > struct rq_flags rf; > > + arch_scale_freq_tick(); > sched_clock_tick(); > > rq_lock(rq, &rf);