Quoting Morten Rasmussen (2014-09-22 09:24:02)
> Architectures that don't have any other means for tracking cpu frequency
> changes need a callback from cpufreq to implement a scaling factor to
> enable scale-invariant per-entity load-tracking in the scheduler.
>
> To compute the scale invariance correction factor the architecture would
> need to know both the max frequency and the current frequency. This
> patch defines weak functions for setting both from cpufreq.
>
> Related architecture specific functions use weak function definitions.
> The same approach is followed here.
>
> These callbacks can be used to implement frequency scaling of cpu
> capacity later.
>
> Signed-off-by: Morten Rasmussen <[email protected]>
> ---
> drivers/cpufreq/cpufreq.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index d9fdedd..e911f6b 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -278,6 +278,10 @@ static inline void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
> }
> #endif
>
> +void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}
> +
> +void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}
Hi Morten,
This approach assumes a single implementation for an architecture, which
probably will not scale across the myriad platforms we have merged in
mainline today. For ARM there could be any number of methods for
determining a cpus capacity, including use of CPUfreq or some other
platform-specific method.
I am vaguely aware of Intel-based platforms that do not implement ACPI,
so for that architecture I assume that an arch hook that assumes ACPI is
similarly restrictive.
I'll reply to this thread with a pair of patches that try to generalize
the functions you created in patches #2-4 of this series. I'm currently
hacking on a Chromebook 2 using the arm_big_little CPUfreq driver, so
you'll notice that is where I decided to implement the ops. Of course
those could be implemented in arch code, or some other driver.
Regards,
Mike
> +
> static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
> struct cpufreq_freqs *freqs, unsigned int state)
> {
> @@ -315,6 +319,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
> pr_debug("FREQ: %lu - CPU: %lu\n",
> (unsigned long)freqs->new, (unsigned long)freqs->cpu);
> trace_cpu_frequency(freqs->new, freqs->cpu);
> + arch_scale_set_curr_freq(freqs->cpu, freqs->new);
> srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
> CPUFREQ_POSTCHANGE, freqs);
> if (likely(policy) && likely(policy->cpu == freqs->cpu))
> @@ -2135,7 +2140,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> struct cpufreq_policy *new_policy)
> {
> struct cpufreq_governor *old_gov;
> - int ret;
> + int ret, cpu;
>
> pr_debug("setting new policy for CPU %u: %u - %u kHz\n",
> new_policy->cpu, new_policy->min, new_policy->max);
> @@ -2173,6 +2178,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
> policy->min = new_policy->min;
> policy->max = new_policy->max;
>
> + for_each_cpu(cpu, policy->cpus)
> + arch_scale_set_max_freq(cpu, policy->max);
> +
> pr_debug("new min and max freqs are %u - %u kHz\n",
> policy->min, policy->max);
>
> --
> 1.7.9.5
>
>
The fair scheduler needs a method to retrieve the capacity of a cpu,
which may be derived from several platform-specific factors including
micro-architectural differences (e.g. big.LITTLE cpus), cpus with
different transistor types or process node properties (e.g. Nvidia
Tegra30 LP cpu) and cpu frequency (for cpus that dynamically scale clock
speed).
This info is inherently machine-specific and the first patch in this
series implements a new callback, .get_capacity as part of struct
capacity_ops. The default simply returns SCHED_CAPACITY_SCALE.
The second patch populates that callback with CPUfreq-based method for
machines using the arm_big_little CPUfreq driver. This can likely be
abstracted out a bit more to be generally useful to more CPUfreq drivers
but I wanted to gather feedback on the approach before going any
further.
Mike Turquette (2):
sched: cfs: introduce capacity_ops
cpufreq: arm_big_little: provide cpu capacity
arch/arm/include/asm/topology.h | 2 ++
arch/arm/kernel/topology.c | 42 ++-------------------------------
drivers/cpufreq/arm_big_little.c | 51 ++++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 28 ++++++++++++++++++++++
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++-----
5 files changed, 118 insertions(+), 46 deletions(-)
--
1.8.3.2
The scheduler needs to know the current capacity of a cpu taking into
account micro-architectural differences as well as current cpu
frequency. The method for determining this may vary not only from
architecture to architecture, but also within differnt platforms of the
same architectures.
struct capacity_ops allows for a machine-specific backend to provide
this data. Access to the ops is protected by rcu_read_lock(). This is to
prevent a loadable module that provides capacity_ops callbacks from
pulling the rug out from under us while the scheduler is still using the
function.
A weak arch function used to be responsible for this, but that approach
is too limiting. For example various ARM SoCs may have wildly different
methods for determining the current cpu capacity. capacity_ops allows
many methods to be compiled into the kernel and then selects one at
run-time.
The default ops require no knowledge of hardware and do nothing. This
patch only includes .get_capacity, but future ops for updating and
setting the capacity in the works.
Signed-off-by: Mike Turquette <[email protected]>
---
Note that struct capacity_ops should have other members in it in the
future. I have an additional patch that I plan to post soon which adds
.eval_capacity as a way to for the scheduler to initiate a change in cpu
capacity (e.g. scale cpu frequency).
include/linux/sched.h | 28 ++++++++++++++++++++++++++++
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++------
2 files changed, 63 insertions(+), 6 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fa0b121..4b69000 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -846,6 +846,34 @@ enum cpu_idle_type {
CPU_MAX_IDLE_TYPES
};
+/**
+ * capacity_ops - helpers for understanding and changing scalable cpu capacity
+ * @get_capacity: returns current capacity of cpu, accounting for
+ * micro-architecture and frequency variability
+ *
+ * capacity_ops contains machine-specific callbacks for retreiving
+ * power-adjusted capacity and updating capacity on a set of cpus.
+ *
+ * The default ops do not interact with hardware.
+ *
+ * Using these ops requires holding rcu_read_lock() across the function call to
+ * protect against function pointers disappearing during use. This can happen
+ * if a loadable module provides the callbacks and is unloaded during execution
+ * of the callback.
+ *
+ * Updates to the ops (such as implementations based on a CPUfreq backend)
+ * requires acquring capacity_ops.lock during the change, followed by a call to
+ * synchronize_rcu().
+ */
+struct capacity_ops {
+ unsigned long (*get_capacity)(int cpu);
+ spinlock_t lock;
+};
+
+extern struct capacity_ops cfs_capacity_ops;
+
+unsigned long default_scale_load_capacity(int cpu);
+
/*
* Increase resolution of cpu_capacity calculations
*/
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 025bf3c..8124c7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2264,8 +2264,6 @@ static u32 __compute_runnable_contrib(u64 n)
return contrib + runnable_avg_yN_sum[n];
}
-unsigned long arch_scale_load_capacity(int cpu);
-
/*
* We can represent the historical contribution to runnable average as the
* coefficients of a geometric series. To do this we sub-divide our runnable
@@ -2302,7 +2300,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
u64 delta, periods;
u32 runnable_contrib;
int delta_w, decayed = 0;
- u32 scale_cap = arch_scale_load_capacity(cpu);
+ u32 scale_cap = cfs_get_capacity(cpu);
delta = now - sa->last_runnable_update;
/*
@@ -5795,14 +5793,44 @@ unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
return default_scale_smt_capacity(sd, cpu);
}
-static unsigned long default_scale_load_capacity(int cpu)
+unsigned long default_scale_load_capacity(int cpu)
{
return SCHED_CAPACITY_SCALE;
}
-unsigned long __weak arch_scale_load_capacity(int cpu)
+struct capacity_ops cfs_capacity_ops = {
+ .get_capacity = default_scale_load_capacity,
+};
+
+static unsigned long cfs_get_capacity(int cpu)
{
- return default_scale_load_capacity(cpu);
+ unsigned long ret;
+
+ rcu_read_lock();
+ ret = cfs_capacity_ops.get_capacity(cpu);
+ rcu_read_unlock();
+
+ return ret;
+}
+
+/**
+ * set_default_capacity_ops - reset capacity ops to their default
+ * @eops - capacity_ops we are reseting
+ *
+ * Useful for loadable modules that supply custom capacity_ops callbacks. When
+ * unloading these modules need to restore the originals before the custom
+ * callbacks disappear.
+ *
+ * FIXME - belongs in kernel/sched/core.c?
+ */
+void set_default_capacity_ops(struct capacity_ops *eops)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&eops->lock, flags);
+ eops->get_capacity = default_scale_load_capacity;
+ spin_unlock_irqrestore(&eops->lock, flags);
+ synchronize_rcu();
}
static unsigned long scale_rt_capacity(int cpu)
@@ -8026,6 +8054,7 @@ void print_cfs_stats(struct seq_file *m, int cpu)
__init void init_sched_fair_class(void)
{
#ifdef CONFIG_SMP
+ spin_lock_init(&cfs_capacity_ops.lock);
open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
#ifdef CONFIG_NO_HZ_COMMON
--
1.8.3.2
Move the cpu capacity bits out of arch/arm/ and into the CPUfreq driver.
Not all ARM devices will use CPUfreq and it is unsafe to assume as such
in topology.c.
Instead, use the new capacity_ops introduced into CFS. If this code is
generic enough then it could be factored and shared via a header to make
it easier for other CPUfreq drivers to take advantage of it.
Signed-off-by: Mike Turquette <[email protected]>
---
This approach simply builds on top of Morten's series. I am not sure
that the per-cpu method is the best way to go in the future. And if so I
imagine that the CPUfreq core could provide everything except for the
cpu_eff part.
In general I think that the overlap between CPUfreq drivers and
arch/arm/kernel/topology.c is something that needs to addresssed soon,
as both pieces of code are re-inventing parts of each other.
arch/arm/include/asm/topology.h | 2 ++
arch/arm/kernel/topology.c | 42 ++-------------------------------
drivers/cpufreq/arm_big_little.c | 51 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 55 insertions(+), 40 deletions(-)
diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 2fe85ff..3951232 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -14,6 +14,8 @@ struct cputopo_arm {
};
extern struct cputopo_arm cpu_topology[NR_CPUS];
+extern unsigned long max_raw_capacity;
+DECLARE_PER_CPU(unsigned long, cpu_raw_capacity);
#define topology_physical_package_id(cpu) (cpu_topology[cpu].socket_id)
#define topology_core_id(cpu) (cpu_topology[cpu].core_id)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 5f049ec..a2c9b5f 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -79,8 +79,8 @@ static unsigned long *__cpu_capacity;
static unsigned long middle_capacity = 1;
-static unsigned long max_raw_capacity = 1;
-static DEFINE_PER_CPU(unsigned long, cpu_raw_capacity);
+unsigned long max_raw_capacity = 1;
+DEFINE_PER_CPU(unsigned long, cpu_raw_capacity);
/*
* Iterate all CPUs' descriptor in DT and compute the efficiency
@@ -175,44 +175,6 @@ static void update_cpu_capacity(unsigned int cpu)
cpu, arch_scale_freq_capacity(NULL, cpu));
}
-/*
- * Scheduler load-tracking scale-invariance
- *
- * Provides the scheduler with a scale-invariance correction factor that
- * compensates for frequency scaling and micro-architecture differences between
- * cpus.
- */
-
-static DEFINE_PER_CPU(atomic_long_t, cpu_curr_freq);
-static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
-
-/* cpufreq callback function setting current cpu frequency */
-void arch_scale_set_curr_freq(int cpu, unsigned long freq)
-{
- atomic_long_set(&per_cpu(cpu_curr_freq, cpu), freq);
-}
-
-/* cpufreq callback function setting max cpu frequency */
-void arch_scale_set_max_freq(int cpu, unsigned long freq)
-{
- atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
-}
-
-unsigned long arch_scale_load_capacity(int cpu)
-{
- unsigned long curr = atomic_long_read(&per_cpu(cpu_curr_freq, cpu));
- unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
- unsigned long ret;
-
- if (!max || !per_cpu(cpu_raw_capacity, cpu))
- return SCHED_CAPACITY_SCALE;
-
- ret = (curr * SCHED_CAPACITY_SCALE) / max;
- ret = (ret * per_cpu(cpu_raw_capacity, cpu)) / max_raw_capacity;
-
- return ret;
-}
-
#else
static inline void parse_dt_topology(void) {}
static inline void update_cpu_capacity(unsigned int cpuid) {}
diff --git a/drivers/cpufreq/arm_big_little.c b/drivers/cpufreq/arm_big_little.c
index a46c223..5baffbd 100644
--- a/drivers/cpufreq/arm_big_little.c
+++ b/drivers/cpufreq/arm_big_little.c
@@ -31,7 +31,10 @@
#include <linux/slab.h>
#include <linux/topology.h>
#include <linux/types.h>
+#include <linux/sched.h>
+#include <linux/rcupdate.h>
#include <asm/bL_switcher.h>
+#include <asm/topology.h>
#include "arm_big_little.h"
@@ -533,9 +536,52 @@ static struct notifier_block bL_switcher_notifier = {
.notifier_call = bL_cpufreq_switcher_notifier,
};
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling and micro-architecture differences between
+ * cpus.
+ */
+
+static DEFINE_PER_CPU(atomic_long_t, cpu_curr_freq);
+static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
+
+/* cpufreq callback function setting current cpu frequency */
+void arch_scale_set_curr_freq(int cpu, unsigned long freq)
+{
+ atomic_long_set(&per_cpu(cpu_curr_freq, cpu), freq);
+}
+
+/* cpufreq callback function setting max cpu frequency */
+void arch_scale_set_max_freq(int cpu, unsigned long freq)
+{
+ atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
+}
+
+/*
+ * scale_load_capacity returns the current capacity for a given cpu, adjusted
+ * for micro-architectural differences and taking into accout cpu frequency
+ */
+unsigned long scale_load_capacity(int cpu)
+{
+ unsigned long curr = atomic_long_read(&per_cpu(cpu_curr_freq, cpu));
+ unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
+ unsigned long ret;
+
+ if (!max || !per_cpu(cpu_raw_capacity, cpu))
+ return SCHED_CAPACITY_SCALE;
+
+ ret = (curr * SCHED_CAPACITY_SCALE) / max;
+ ret = (ret * per_cpu(cpu_raw_capacity, cpu)) / max_raw_capacity;
+
+ return ret;
+}
+
int bL_cpufreq_register(struct cpufreq_arm_bL_ops *ops)
{
int ret, i;
+ unsigned long flags;
if (arm_bL_ops) {
pr_debug("%s: Already registered: %s, exiting\n", __func__,
@@ -550,6 +596,11 @@ int bL_cpufreq_register(struct cpufreq_arm_bL_ops *ops)
arm_bL_ops = ops;
+ spin_lock_irqsave(&cfs_capacity_ops.lock, flags);
+ cfs_capacity_ops.get_capacity = scale_load_capacity;
+ spin_unlock_irqrestore(&cfs_capacity_ops.lock, flags);
+ synchronize_rcu();
+
ret = bL_switcher_get_enabled();
set_switching_enabled(ret);
--
1.8.3.2
On Tue, Oct 07, 2014 at 11:26:11PM -0700, Mike Turquette wrote:
> +struct capacity_ops {
> + unsigned long (*get_capacity)(int cpu);
> + spinlock_t lock;
> +};
Yeah, fail there. Ops vectors should not contain serialization, that
simply doesn't work. It means you cannot switch the entire vector out.
Secondly, I dislike this because indirect function calls are more
expensive than direct calls.
> +static unsigned long cfs_get_capacity(int cpu)
> {
> - return default_scale_load_capacity(cpu);
> + unsigned long ret;
> +
> + rcu_read_lock();
> + ret = cfs_capacity_ops.get_capacity(cpu);
> + rcu_read_unlock();
> +
> + return ret;
> +}
So ideally we'd never change these things, so rcu or whatnot should not
be required.
> +/**
> + * set_default_capacity_ops - reset capacity ops to their default
> + * @eops - capacity_ops we are reseting
> + *
> + * Useful for loadable modules that supply custom capacity_ops callbacks. When
> + * unloading these modules need to restore the originals before the custom
> + * callbacks disappear.
Yeah, like hell no. We do not want modules to affect scheduler
behaviour.
On Wed, Oct 08, 2014 at 07:26:12AM +0100, Mike Turquette wrote:
> Move the cpu capacity bits out of arch/arm/ and into the CPUfreq driver.
> Not all ARM devices will use CPUfreq and it is unsafe to assume as such
> in topology.c.
>
> Instead, use the new capacity_ops introduced into CFS. If this code is
> generic enough then it could be factored and shared via a header to make
> it easier for other CPUfreq drivers to take advantage of it.
>
> Signed-off-by: Mike Turquette <[email protected]>
> ---
> This approach simply builds on top of Morten's series. I am not sure
> that the per-cpu method is the best way to go in the future. And if so I
> imagine that the CPUfreq core could provide everything except for the
> cpu_eff part.
>
> In general I think that the overlap between CPUfreq drivers and
> arch/arm/kernel/topology.c is something that needs to addresssed soon,
> as both pieces of code are re-inventing parts of each other.
I think it would be easier to deal with the capacity scaling if we split
it into two scaling factors (like Vincent proposed by using the existing
arch_scale_{cpu,freq}_capacity() functions). Then whatever handles
frequency scaling doesn't have to coordinate with uarch scaling. Both
would be factors in the range 0..1024. We would just multiply the two
scaling factors in fair.c when needed (and shift as necessary).
Wouldn't that make it simple enough that we can implement it in a
generic way in arch/*/topology.c with a bit of help from the cpufreq
framework without involving any drivers directly? Or is that just
wishful thinking? :)
On Wed, Oct 08, 2014 at 04:28:36PM -0700, Mike Turquette wrote:
> > Yeah, like hell no. We do not want modules to affect scheduler
> > behaviour.
>
> If a CPUfreq driver is the best place to know how frequency affects the
> capacity of a CPU for a given system, are you suggesting that we must
> compile that code into the kernel image instead of it being a loadable
> module?
Ideally we'll end up doing away with the cpufreq policy modules and
integrate the entire thing into the scheduler.
On Wed, Oct 08, 2014 at 03:37:32PM -0700, Mike Turquette wrote:
> It creates a dependency such that any ARM platform that wants to have
> frequency-invariant load must use CPUfreq. I don't think we want that
> dependency. CPUfreq is a likely back-end for many devices, but not all.
>
> Consider near-future ARM devices that support ACPI for power management
> with no corresponding CPUfreq driver. For example if the CPPC driver is
> not hooked up to CPUfreq, platforms using CPPC will not jive with the
> ARM arch hook that depends on CPUfreq.
Oh crap no, CPPC will not add yet another interface to cpu frequency
stuff.
On Thu, Oct 09, 2014 at 10:25:13AM -0700, Mike Turquette wrote:
> Quoting Peter Zijlstra (2014-10-09 02:02:52)
> > On Wed, Oct 08, 2014 at 03:37:32PM -0700, Mike Turquette wrote:
> > > It creates a dependency such that any ARM platform that wants to have
> > > frequency-invariant load must use CPUfreq. I don't think we want that
> > > dependency. CPUfreq is a likely back-end for many devices, but not all.
> > >
> > > Consider near-future ARM devices that support ACPI for power management
> > > with no corresponding CPUfreq driver. For example if the CPPC driver is
> > > not hooked up to CPUfreq, platforms using CPPC will not jive with the
> > > ARM arch hook that depends on CPUfreq.
> >
> > Oh crap no, CPPC will not add yet another interface to cpu frequency
> > stuff.
>
> Right.
>
> So let's say the ARM arch hook creates a dependency on CPUfreq to scale
> capacity as a function of cpu frequency (as it does in the ARM scale
> invariance series).
>
> Then let's say that a hypothetical ARM platform named "foo" uses CPPC
> and not CPUfreq to scale frequency. Foo's implementation of CPPC does
> not use any of the full-auto or hw-auto stuff. It allows the OS to
> request minimum performance levels and the like.
>
> In this case, how can foo take advantage of the scale invariant stuff?
>
> Also, feel free to replace "CPPC" with "anything other than CPUfreq".
> The problem is a general one and not specific to CPPC or ACPI.
Well answer #1 is that you simply should not ever bypass cpufreq for
setting cpu frequencies (be this the existing cpufreq or the future
integrated cpufreq).
Answer #2 is that if you were allowed to create a second infrastructure
and you're not calling the right scheduler hooks right along with it,
you're buggy.
In short, your problem, not mine.
On Thu, Oct 09, 2014 at 10:34:33AM -0700, Mike Turquette wrote:
> Quoting Peter Zijlstra (2014-10-09 02:00:24)
> > On Wed, Oct 08, 2014 at 04:28:36PM -0700, Mike Turquette wrote:
> > > > Yeah, like hell no. We do not want modules to affect scheduler
> > > > behaviour.
> > >
> > > If a CPUfreq driver is the best place to know how frequency affects the
> > > capacity of a CPU for a given system, are you suggesting that we must
> > > compile that code into the kernel image instead of it being a loadable
> > > module?
> >
> > Ideally we'll end up doing away with the cpufreq policy modules and
> > integrate the entire thing into the scheduler.
>
> I said "CPUfreq driver", not "CPUfreq governor". Certainly the scheduler
> can pick a capacity state/p-state/frequency and, in doing so, replace
> the CPUfreq policy bits.
>
> My question above was about the necessity to select the right CPUfreq
> driver at compile-time and lose support for those drivers to be loadable
> modules. Sounds like a bad idea.
Drivers should not care one way or another.