Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753147AbbGWLDn (ORCPT ); Thu, 23 Jul 2015 07:03:43 -0400 Received: from foss.arm.com ([217.140.101.70]:39896 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752613AbbGWLDc (ORCPT ); Thu, 23 Jul 2015 07:03:32 -0400 Date: Thu, 23 Jul 2015 12:06:26 +0100 From: Morten Rasmussen To: Leo Yan Cc: peterz@infradead.org, mingo@redhat.com, vincent.guittot@linaro.org, daniel.lezcano@linaro.org, Dietmar Eggemann , yuyang.du@intel.com, mturquette@baylibre.com, rjw@rjwysocki.net, Juri Lelli , sgurrappadi@nvidia.com, pang.xunlei@zte.com.cn, linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org, Russell King Subject: Re: [RFCv5, 01/46] arm: Frequency invariant scheduler load-tracking support Message-ID: <20150723110626.GC21785@e105550-lin.cambridge.arm.com> References: <1436293469-25707-2-git-send-email-morten.rasmussen@arm.com> <20150721154145.GA23852@leoy-linaro> <20150722133103.GA21785@e105550-lin.cambridge.arm.com> <20150722145904.GA18354@leoy-linaro> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150722145904.GA18354@leoy-linaro> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10338 Lines: 240 On Wed, Jul 22, 2015 at 10:59:04PM +0800, Leo Yan wrote: > On Wed, Jul 22, 2015 at 02:31:04PM +0100, Morten Rasmussen wrote: > > On Tue, Jul 21, 2015 at 11:41:45PM +0800, Leo Yan wrote: > > > Hi Morten, > > > > > > On Tue, Jul 07, 2015 at 07:23:44PM +0100, Morten Rasmussen wrote: > > > > From: Morten Rasmussen > > > > > > > > Implements arch-specific function to provide the scheduler with a > > > > frequency scaling correction factor for more accurate load-tracking. > > > > The factor is: > > > > > > > > current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu) > > > > > > > > This implementation only provides frequency invariance. No cpu > > > > invariance yet. > > > > > > > > Cc: Russell King > > > > > > > > Signed-off-by: Morten Rasmussen > > > > > > > > --- > > > > arch/arm/include/asm/topology.h | 7 +++++ > > > > arch/arm/kernel/smp.c | 57 +++++++++++++++++++++++++++++++++++++++-- > > > > arch/arm/kernel/topology.c | 17 ++++++++++++ > > > > 3 files changed, 79 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h > > > > index 370f7a7..c31096f 100644 > > > > --- a/arch/arm/include/asm/topology.h > > > > +++ b/arch/arm/include/asm/topology.h > > > > @@ -24,6 +24,13 @@ void init_cpu_topology(void); > > > > void store_cpu_topology(unsigned int cpuid); > > > > const struct cpumask *cpu_coregroup_mask(int cpu); > > > > > > > > +#define arch_scale_freq_capacity arm_arch_scale_freq_capacity > > > > +struct sched_domain; > > > > +extern > > > > +unsigned long arm_arch_scale_freq_capacity(struct sched_domain *sd, int cpu); > > > > + > > > > +DECLARE_PER_CPU(atomic_long_t, cpu_freq_capacity); > > > > + > > > > #else > > > > > > > > static inline void init_cpu_topology(void) { } > > > > diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c > > > > index cca5b87..a32539c 100644 > > > > --- a/arch/arm/kernel/smp.c > > > > +++ b/arch/arm/kernel/smp.c > > > > @@ -677,12 +677,34 @@ static DEFINE_PER_CPU(unsigned long, l_p_j_ref); > > > > static DEFINE_PER_CPU(unsigned long, l_p_j_ref_freq); > > > > static unsigned long global_l_p_j_ref; > > > > static unsigned long global_l_p_j_ref_freq; > > > > +static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq); > > > > +DEFINE_PER_CPU(atomic_long_t, cpu_freq_capacity); > > > > + > > > > +/* > > > > + * Scheduler load-tracking scale-invariance > > > > + * > > > > + * Provides the scheduler with a scale-invariance correction factor that > > > > + * compensates for frequency scaling through arch_scale_freq_capacity() > > > > + * (implemented in topology.c). > > > > + */ > > > > +static inline > > > > +void scale_freq_capacity(int cpu, unsigned long curr, unsigned long max) > > > > +{ > > > > + unsigned long capacity; > > > > + > > > > + if (!max) > > > > + return; > > > > + > > > > + capacity = (curr << SCHED_CAPACITY_SHIFT) / max; > > > > + atomic_long_set(&per_cpu(cpu_freq_capacity, cpu), capacity); > > > > +} > > > > > > > > static int cpufreq_callback(struct notifier_block *nb, > > > > unsigned long val, void *data) > > > > { > > > > struct cpufreq_freqs *freq = data; > > > > int cpu = freq->cpu; > > > > + unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu)); > > > > > > > > if (freq->flags & CPUFREQ_CONST_LOOPS) > > > > return NOTIFY_OK; > > > > @@ -707,6 +729,10 @@ static int cpufreq_callback(struct notifier_block *nb, > > > > per_cpu(l_p_j_ref_freq, cpu), > > > > freq->new); > > > > } > > > > + > > > > + if (val == CPUFREQ_PRECHANGE) > > > > + scale_freq_capacity(cpu, freq->new, max); > > > > + > > > > return NOTIFY_OK; > > > > } > > > > > > > > @@ -714,11 +740,38 @@ static struct notifier_block cpufreq_notifier = { > > > > .notifier_call = cpufreq_callback, > > > > }; > > > > > > > > +static int cpufreq_policy_callback(struct notifier_block *nb, > > > > + unsigned long val, void *data) > > > > +{ > > > > + struct cpufreq_policy *policy = data; > > > > + int i; > > > > + > > > > + if (val != CPUFREQ_NOTIFY) > > > > + return NOTIFY_OK; > > > > + > > > > + for_each_cpu(i, policy->cpus) { > > > > + scale_freq_capacity(i, policy->cur, policy->max); > > > > + atomic_long_set(&per_cpu(cpu_max_freq, i), policy->max); > > > > + } > > > > + > > > > + return NOTIFY_OK; > > > > +} > > > > + > > > > +static struct notifier_block cpufreq_policy_notifier = { > > > > + .notifier_call = cpufreq_policy_callback, > > > > +}; > > > > + > > > > static int __init register_cpufreq_notifier(void) > > > > { > > > > - return cpufreq_register_notifier(&cpufreq_notifier, > > > > + int ret; > > > > + > > > > + ret = cpufreq_register_notifier(&cpufreq_notifier, > > > > CPUFREQ_TRANSITION_NOTIFIER); > > > > + if (ret) > > > > + return ret; > > > > + > > > > + return cpufreq_register_notifier(&cpufreq_policy_notifier, > > > > + CPUFREQ_POLICY_NOTIFIER); > > > > } > > > > core_initcall(register_cpufreq_notifier); > > > > > > For "cpu_freq_capacity" structure, could move it into driver/cpufreq > > > so that it can be shared by all architectures? Otherwise, every > > > architecture's smp.c need register notifier for themselves. > > > > We could, but I put it in arch/arm/* as not all architectures might want > > this notifier. The frequency scaling factor could be provided based on > > architecture specific performance counters instead. AFAIK, the Intel > > p-state driver does not even fire the notifiers so the notifier > > solution would be redundant code for those platforms. > > When i tried to enable EAS on Hikey, i found it's absent related code > for arm64; actually this code section can also be reused by arm64, > so just brought up this question. Yes. We have patches for arm64 if you are interested. We are using them for the Juno platforms. > Just now roughly went through the driver > "drivers/cpufreq/intel_pstate.c"; that's true it has different > implementation comparing to usual ARM SoCs. So i'd like to ask this > question with another way: should cpufreq framework provides helper > functions for getting related cpu frequency scaling info? If the > architecture has specific performance counters then it can ignore > these helper functions. That is the idea with the notifiers. If the architecture code a specific architecture wants to be poked by cpufreq when the frequency is changed it should have a way to subscribe to those. Another way of implementing it is to let the architecture code call a helper function in cpufreq every time the scheduler calls into the architecture code to get the scaling factor (arch_scale_freq_capacity()). We actually did it that way a couple of versions back using weak functions. It wasn't as clean as using the notifiers, but if we make the necessary changes to cpufreq to let the architecture code call into cpufreq that could be even better. > > > That said, the above solution is not handling changes to policy->max > > very well. Basically, we don't inform the scheduler if it has changed > > which means that the OPP represented by "100%" might change. We need > > cpufreq to keep track of the true max frequency when policy->max is > > changed to work out the correct scaling factor instead of having it > > relative to policy->max. > > i'm not sure understand correctly here. For example, when thermal > framework limits the cpu frequency, it will update the value for > policy->max, so scheduler will get the correct scaling factor, right? > So i don't know what's the issue at here. > > Further more, i noticed in the later patches for > arch_scale_cpu_capacity(); the cpu capacity is calculated by the > property passed by DT, so it's a static value. In some cases, system > may constraint the maximum frequency for CPUs, so in this case, will > scheduler get misknowledge from arch_scale_cpu_capacity after system > has imposed constraint for maximum frequency? The issue is first of all to define what 100% means. Is it policy->cur/policy->max or policy->cur/uncapped_max? Where uncapped max is the max frequency supported by the hardware when not capped in any way by governors or thermal framework. If we choose the first definition then we have to recalculate the cpu capacity scaling factor (arch_scale_cpu_capacity()) too whenever policy->max changes such that capacity_orig is updated appropriately. The scale-invariance code in the scheduler assumes: arch_scale_cpu_capacity()*arch_scale_freq_capacity() = current capacity ...and that capacity_orig = arch_scale_cpu_capacity() is the max available capacity. If we cap the frequency to say, 50%, by setting policy->max then we have to reduce arch_scale_cpu_capacity() to 50% to still get the right current capacity using the expression above. Using the second definition arch_scale_cpu_capacity() can be a static value and arch_scale_freq_capacity() is always relative to uncapped_max. It seems simpler, but capacity_orig could then be an unavailable capacity and hence we would need to introduce a third capacity to track the current max capacity and use that for scheduling decisions. As you have already discovered the current code is a combination of both which is broken when policy->max is reduced. Thinking more about it, I would suggest to go with the first definition. The scheduler doesn't need to know about currently unavailable compute capacity it should balance based on the current situation, so it seems to make sense to let capacity_orig reflect the current max capacity. I would suggest that we fix arch_scale_cpu_capacity() to take policy->max changes into account. We need to know the uncapped max frequency somehow to do that. I haven't looked into if we can get that from cpufreq. Also, we need to make sure that no load-balance code assumes that cpus have a capacity of 1024. > Sorry if these questions have been discussed before :) No problem. I don't think we have discussed it to this detail before and it is very valid points. Thanks, Morten -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/