LinuxLists.cc - [PATCH 0/8] schedutil enhancements

2016-03-14 05:26:15

Subject: [PATCH 0/8] schedutil enhancements

I'm happy that scheduler-driven cpu frequency selection is getting some
attention. Rafael's recent schedutil governor is a step in the right direction.
This series builds on top of Rafael's schedutil governor, bringing it to parity
with some of the features in the schedfreq series posted by Steve[0], as well
as adding a couple of new things.

Patch 1 removes cpufreq_trigger_update()

Patches 2-4 move the cfs capacity margin out of the governor and into
cfs. This value is made tunable by a sysfs control in schedutil.

Patches 5-6 make cpufreq_update_util() aware of multiple scheduler
classes (cfs, rt & dl), and add storage & summation of these per-class
utilization values into schedutil.

Patches 7-8 introduces Dietmar's generic cpufreq implementation[1] of the
frequency invariance hook and changes the preprocessor magic in sched.h to
favor the cpufreq implementation over arch- or platform-specific ones.

If accepted, this series makes it trivial to port Steve and Juri's fine-grained
frequency selection in cfs and Vincent's rt utilization patch to the schedutil
governor.[2-6]

[0] lkml.kernel.org/r/[email protected]
[1] https://git.linaro.org/people/steve.muckle/kernel.git/commit/1b7e57f89f14f7600e75e6fde42bf22d72927b3d
[2] lkml.kernel.org/r/[email protected]
[3] lkml.kernel.org/r/[email protected]
[4] lkml.kernel.org/r/[email protected]
[5] lkml.kernel.org/r/[email protected]
[6] lkml.kernel.org/r/[email protected]

Dietmar Eggemann (1):
cpufreq: Frequency invariant scheduler load-tracking support

Michael Turquette (7):
sched/cpufreq: remove cpufreq_trigger_update()
sched/fair: add margin to utilization update
sched/cpufreq: new cfs capacity margin helpers
cpufreq/schedutil: sysfs capacity margin tunable
sched/cpufreq: pass sched class into cpufreq_update_util
cpufreq/schedutil: sum per-sched class utilization
sched: prefer cpufreq_scale_freq_capacity

drivers/cpufreq/cpufreq.c | 29 ++++++++++++
drivers/cpufreq/cpufreq_governor.c | 5 +-
drivers/cpufreq/cpufreq_schedutil.c | 70 ++++++++++++++++++++++++----
drivers/cpufreq/intel_pstate.c | 5 +-
include/linux/cpufreq.h | 3 ++
include/linux/sched.h | 19 ++++++--
kernel/sched/cpufreq.c | 92 +++++++++++++++++++++++++------------
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 18 +++++++-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 29 +++++++++---
11 files changed, 219 insertions(+), 55 deletions(-)

--
2.1.4

2016-03-14 05:26:19

by Michael Turquette

[permalink] [raw]

Subject: [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()

cpufreq_trigger_update() was introduced in "cpufreq: Rework the
scheduler hooks for triggering updates"[0]. Consensus is that this
helper is not needed and removing it will aid in experimenting with
deadline and rt capacity requests.

Instead of reverting the above patch, which includes useful renaming of
data structures and related functions, simply remove the function,
update affected kerneldoc and change rt.c and deadline.c to use
cpufreq_update_util().

[0] lkml.kernel.org/r/[email protected]

Signed-off-by: Michael Turquette <[email protected]>
---
kernel/sched/cpufreq.c | 28 ++--------------------------
kernel/sched/deadline.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 2 --
4 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index eecaba4..bd012c2 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -20,8 +20,8 @@ static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
*
* Set and publish the freq_update_hook pointer for the given CPU. That pointer
* points to a struct freq_update_hook object containing a callback function
- * to call from cpufreq_trigger_update(). That function will be called from
- * an RCU read-side critical section, so it must not sleep.
+ * to call from cpufreq_update_util(). That function will be called from an
+ * RCU read-side critical section, so it must not sleep.
*
* Callers must use RCU-sched callbacks to free any memory that might be
* accessed via the old update_util_data pointer or invoke synchronize_sched()
@@ -87,27 +87,3 @@ void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
if (hook)
hook->func(hook, time, util, max);
}
-
-/**
- * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
- * @time: Current time.
- *
- * The way cpufreq is currently arranged requires it to evaluate the CPU
- * performance state (frequency/voltage) on a regular basis. To facilitate
- * that, cpufreq_update_util() is called by update_load_avg() in CFS when
- * executed for the current CPU's runqueue.
- *
- * However, this isn't sufficient to prevent the CPU from being stuck in a
- * completely inadequate performance level for too long, because the calls
- * from CFS will not be made if RT or deadline tasks are active all the time
- * (or there are RT and DL tasks only).
- *
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
- * but that really is a band-aid. Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
- */
-void cpufreq_trigger_update(u64 time)
-{
- cpufreq_update_util(time, ULONG_MAX, 0);
-}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1a035fa..3fd5bc4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -728,7 +728,7 @@ static void update_curr_dl(struct rq *rq)

/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_trigger_update(rq_clock(rq));
+ cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);

/*
* Consumed budget is computed considering the time as
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9dd1c09..53ad077 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -947,7 +947,7 @@ static void update_curr_rt(struct rq *rq)

/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_trigger_update(rq_clock(rq));
+ cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);

delta_exec = rq_clock_task(rq) - curr->se.exec_start;
if (unlikely((s64)delta_exec <= 0))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7ae012e..f06dfca 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1742,9 +1742,7 @@ static inline u64 irq_time_read(int cpu)

#ifdef CONFIG_CPU_FREQ
void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
-void cpufreq_trigger_update(u64 time);
#else
static inline void cpufreq_update_util(u64 time, unsigned long util,
unsigned long max) {}
-static inline void cpufreq_trigger_update(u64 time) {}
#endif /* CONFIG_CPU_FREQ */
--
2.1.4

2016-03-14 05:26:23

by Michael Turquette

[permalink] [raw]

Subject: [PATCH 3/8] sched/cpufreq: new cfs capacity margin helpers

Introduce helper functions that allow cpufreq governors to change the
value of the capacity margin applied to the cfs_rq->avg.util_avg signal.
This allows for run-time tuning of the margin.

A follow-up patch will update the schedutil governor to use these
helpers.

Signed-off-by: Michael Turquette <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/cpufreq.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 56 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1fa9b52..f18a99b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2372,6 +2372,9 @@ void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
void (*func)(struct freq_update_hook *hook, u64 time,
unsigned long util, unsigned long max));
void cpufreq_clear_freq_update_hook(int cpu);
+unsigned long cpufreq_get_cfs_capacity_margin(void);
+void cpufreq_set_cfs_capacity_margin(unsigned long margin);
+void cpufreq_reset_cfs_capacity_margin(void);
#endif

#ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index bd012c2..a126b58 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -61,6 +61,59 @@ void cpufreq_clear_freq_update_hook(int cpu)
EXPORT_SYMBOL_GPL(cpufreq_clear_freq_update_hook);

/**
+ * cpufreq_get_cfs_capacity_margin - Get global cfs enqueue capacity margin
+ *
+ * margin is a percentage of capacity that is applied to the current
+ * utilization when selecting a new capacity state or cpu frequency. The value
+ * should be normalized to the range of [0..SCHED_CAPACITY_SCALE], where
+ * SCHED_CAPACITY_SCALE is 100% of the normalized capacity, or equivalent to
+ * multiplying the utilization by one.
+ *
+ * This function returns the current global cfs enqueue capacity margin
+ */
+unsigned long cpufreq_get_cfs_capacity_margin(void)
+{
+ return cfs_capacity_margin;
+}
+EXPORT_SYMBOL_GPL(cpufreq_get_cfs_capacity_margin);
+
+/**
+ * cpufreq_set_cfs_capacity_margin - Set global cfs enqueue capacity margin
+ * @margin: new capacity margin
+ *
+ * margin is a percentage of capacity that is applied to the current
+ * utilization when selecting a new capacity state or cpu frequency. The value
+ * should be normalized to the range of [0..SCHED_CAPACITY_SCALE], where
+ * SCHED_CAPACITY_SCALE is 100% of the normalized capacity, or equivalent to
+ * multiplying the utilization by one.
+ *
+ * For instance, to add a 25% margin to a utilization, margin should be 1280,
+ * which is 1.25x 1024, the default for SCHED_CAPACITY_SCALE.
+ */
+void cpufreq_set_cfs_capacity_margin(unsigned long margin)
+{
+ cfs_capacity_margin = margin;
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_cfs_capacity_margin);
+
+/**
+ * cpufreq_reset_cfs_capacity_margin - Reset global cfs enqueue cap margin
+ *
+ * margin is a percentage of capacity that is applied to the current
+ * utilization when selecting a new capacity state or cpu frequency. The value
+ * should be normalized to the range of [0..SCHED_CAPACITY_SCALE], where
+ * SCHED_CAPACITY_SCALE is 100% of the normalized capacity, or equivalent to
+ * multiplying the utilization by one.
+ *
+ * This function resets the global margin to its default value.
+ */
+void cpufreq_reset_cfs_capacity_margin(void)
+{
+ cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
+}
+EXPORT_SYMBOL_GPL(cpufreq_reset_cfs_capacity_margin);
+
+/**
* cpufreq_update_util - Take a note about CPU utilization changes.
* @time: Current time.
* @util: CPU utilization.
--
2.1.4

2016-03-14 05:26:30

by Michael Turquette

[permalink] [raw]

Subject: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

cpufreq_update_util() accepts a single utilization value which does not
account for multiple utilization contributions from the cfs, rt & dl
scheduler classes. Begin fixing this by adding a sched_class argument to
cpufreq_update_util(), all of its call sites and the governor-specific
hooks in intel_pstate.c, cpufreq_schedutil.c and cpufreq_governor.c.

A follow-on patch will add summation of the sched_class contributions to
the schedutil governor.

Signed-off-by: Michael Turquette <[email protected]>
---
drivers/cpufreq/cpufreq_governor.c | 5 +++--
drivers/cpufreq/cpufreq_schedutil.c | 6 ++++--
drivers/cpufreq/intel_pstate.c | 5 +++--
include/linux/sched.h | 16 +++++++++++++---
kernel/sched/cpufreq.c | 11 +++++++----
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 8 +++++---
9 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 148576c..4694751 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -248,8 +248,9 @@ static void dbs_irq_work(struct irq_work *irq_work)
schedule_work(&policy_dbs->work);
}

-static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time,
- unsigned long util_not_used,
+static void dbs_freq_update_handler(struct freq_update_hook *hook,
+ enum sched_class_util sc_not_used,
+ u64 time, unsigned long util_not_used,
unsigned long max_not_used)
{
struct cpu_dbs_info *cdbs = container_of(hook, struct cpu_dbs_info, update_hook);
diff --git a/drivers/cpufreq/cpufreq_schedutil.c b/drivers/cpufreq/cpufreq_schedutil.c
index 12e49b9..18d9ca3 100644
--- a/drivers/cpufreq/cpufreq_schedutil.c
+++ b/drivers/cpufreq/cpufreq_schedutil.c
@@ -106,7 +106,8 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
trace_cpu_frequency(freq, smp_processor_id());
}

-static void sugov_update_single(struct freq_update_hook *hook, u64 time,
+static void sugov_update_single(struct freq_update_hook *hook,
+ enum sched_class_util sc, u64 time,
unsigned long util, unsigned long max)
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
@@ -166,7 +167,8 @@ static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
return util * max_f / max;
}

-static void sugov_update_shared(struct freq_update_hook *hook, u64 time,
+static void sugov_update_shared(struct freq_update_hook *hook,
+ enum sched_class_util sc, u64 time,
unsigned long util, unsigned long max)
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 20e2bb2..86aa368 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1020,8 +1020,9 @@ static inline void intel_pstate_adjust_busy_pstate(struct cpudata *cpu)
sample->freq);
}

-static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time,
- unsigned long util_not_used,
+static void intel_pstate_freq_update(struct freq_update_hook *hook,
+ enum sched_class_util sc_not_used
+ u64 time, unsigned long util_not_used,
unsigned long max_not_used)
{
struct cpudata *cpu = container_of(hook, struct cpudata, update_hook);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f18a99b..1c7d7bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2362,15 +2362,25 @@ extern u64 scheduler_tick_max_deferment(void);
static inline bool sched_can_stop_tick(void) { return false; }
#endif

+enum sched_class_util {
+ cfs_util,
+ rt_util,
+ dl_util,
+ nr_util_types,
+};
+
#ifdef CONFIG_CPU_FREQ
struct freq_update_hook {
- void (*func)(struct freq_update_hook *hook, u64 time,
+ void (*func)(struct freq_update_hook *hook,
+ enum sched_class_util sched_class, u64 time,
unsigned long util, unsigned long max);
};

void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
- void (*func)(struct freq_update_hook *hook, u64 time,
- unsigned long util, unsigned long max));
+ void (*func)(struct freq_update_hook *hook,
+ enum sched_class_util sched_class,
+ u64 time, unsigned long util,
+ unsigned long max));
void cpufreq_clear_freq_update_hook(int cpu);
unsigned long cpufreq_get_cfs_capacity_margin(void);
void cpufreq_set_cfs_capacity_margin(unsigned long margin);
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index a126b58..87f99a6 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -39,8 +39,10 @@ static void set_freq_update_hook(int cpu, struct freq_update_hook *hook)
* @func: Callback function to use with the new hook.
*/
void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
- void (*func)(struct freq_update_hook *hook, u64 time,
- unsigned long util, unsigned long max))
+ void (*func)(struct freq_update_hook *hook,
+ enum sched_class_util sched_class,
+ u64 time, unsigned long util,
+ unsigned long max))
{
if (WARN_ON(!hook || !func))
return;
@@ -124,7 +126,8 @@ EXPORT_SYMBOL_GPL(cpufreq_reset_cfs_capacity_margin);
*
* It can only be called from RCU-sched read-side critical sections.
*/
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+void cpufreq_update_util(enum sched_class_util sc, u64 time,
+ unsigned long util, unsigned long max)
{
struct freq_update_hook *hook;

@@ -138,5 +141,5 @@ void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
* may become NULL after the check below.
*/
if (hook)
- hook->func(hook, time, util, max);
+ hook->func(hook, sc, time, util, max);
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3fd5bc4..d88ed3f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -728,7 +728,7 @@ static void update_curr_dl(struct rq *rq)

/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);
+ cpufreq_update_util(dl_util, rq_clock(rq), ULONG_MAX, 0);

/*
* Consumed budget is computed considering the time as
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29e8bae..6b454bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2867,7 +2867,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
* thread is a different class (!fair), nor will the utilization
* number include things like RT tasks.
*/
- cpufreq_update_util(rq_clock(rq), min(cap, max), max);
+ cpufreq_update_util(cfs_util, rq_clock(rq), min(cap, max), max);
}
}

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 53ad077..9d9dab4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -947,7 +947,7 @@ static void update_curr_rt(struct rq *rq)

/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);
+ cpufreq_update_util(rt_util, rq_clock(rq), ULONG_MAX, 0);

delta_exec = rq_clock_task(rq) - curr->se.exec_start;
if (unlikely((s64)delta_exec <= 0))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8c93ed2..469d11d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1744,8 +1744,10 @@ static inline u64 irq_time_read(int cpu)
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */

#ifdef CONFIG_CPU_FREQ
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+void cpufreq_update_util(enum sched_class_util sc, u64 time,
+ unsigned long util, unsigned long max);
#else
-static inline void cpufreq_update_util(u64 time, unsigned long util,
- unsigned long max) {}
+static inline void cpufreq_update_util(enum sched_class_util sc, u64 time,
+ unsigned long util, unsigned long max)
+{}
#endif /* CONFIG_CPU_FREQ */
--
2.1.4

2016-03-14 05:26:41

by Michael Turquette

[permalink] [raw]

Subject: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

From: Dietmar Eggemann <[email protected]>

Implements cpufreq_scale_freq_capacity() to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking.

The factor is:

current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)

In fact, freq_scale should be a struct cpufreq_policy data member. But
this would require that the scheduler hot path (__update_load_avg()) would
have to grab the cpufreq lock. This can be avoided by using per-cpu data
initialized to SCHED_CAPACITY_SCALE for freq_scale.

Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Michael Turquette <[email protected]>
---
I'm not as sure about patches 7 & 8, but I included them since I needed
frequency invariance while testing.

As mentioned by myself in 2014 and Rafael last month, the
arch_scale_freq_capacity hook is awkward, because this behavior may vary
within an architecture.

I re-introduce Dietmar's generic cpufreq implementation of the frequency
invariance hook in this patch, and change the preprocessor magic in
sched.h to favor the cpufreq implementation over arch- or
platform-specific ones in the next patch.

If run-time selection of ops is needed them someone will need to write
that code.

I think that this negates the need for the arm arch hooks[0-2], and
hopefully Morten and Dietmar can weigh in on this.

[0] lkml.kernel.org/r/[email protected]
[1] lkml.kernel.org/r/[email protected]
[2] lkml.kernel.org/r/[email protected]

drivers/cpufreq/cpufreq.c | 29 +++++++++++++++++++++++++++++
include/linux/cpufreq.h | 3 +++
2 files changed, 32 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b1ca9c4..e67584f 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -306,6 +306,31 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
#endif
}

+/*********************************************************************
+ * FREQUENCY INVARIANT CPU CAPACITY *
+ *********************************************************************/
+
+static DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
+
+static void
+scale_freq_capacity(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs)
+{
+ unsigned long cur = freqs ? freqs->new : policy->cur;
+ unsigned long scale = (cur << SCHED_CAPACITY_SHIFT) / policy->max;
+ int cpu;
+
+ pr_debug("cpus %*pbl cur/cur max freq %lu/%u kHz freq scale %lu\n",
+ cpumask_pr_args(policy->cpus), cur, policy->max, scale);
+
+ for_each_cpu(cpu, policy->cpus)
+ per_cpu(freq_scale, cpu) = scale;
+}
+
+unsigned long cpufreq_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+ return per_cpu(freq_scale, cpu);
+}
+
static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, unsigned int state)
{
@@ -409,6 +434,8 @@ wait:

spin_unlock(&policy->transition_lock);

+ scale_freq_capacity(policy, freqs);
+
cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);
}
EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin);
@@ -2125,6 +2152,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
CPUFREQ_NOTIFY, new_policy);

+ scale_freq_capacity(new_policy, NULL);
+
policy->min = new_policy->min;
policy->max = new_policy->max;

diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 0e39499..72833be 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -583,4 +583,7 @@ unsigned int cpufreq_generic_get(unsigned int cpu);
int cpufreq_generic_init(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table,
unsigned int transition_latency);
+
+struct sched_domain;
+unsigned long cpufreq_scale_freq_capacity(struct sched_domain *sd, int cpu);
#endif /* _LINUX_CPUFREQ_H */
--
2.1.4

2016-03-14 05:26:38

by Michael Turquette

[permalink] [raw]

Subject: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

arch_scale_freq_capacity is weird. It specifies an arch hook for an
implementation that could easily vary within an architecture or even a
chip family.

This patch helps to mitigate this weirdness by defaulting to the
cpufreq-provided implementation, which should work for all cases where
CONFIG_CPU_FREQ is set.

If CONFIG_CPU_FREQ is not set, then try to use an implementation
provided by the architecture. Failing that, fall back to
SCHED_CAPACITY_SCALE.

It may be desirable for cpufreq drivers to specify their own
implementation of arch_scale_freq_capacity in the future. The same is
true for platform code within an architecture. In both cases an
efficient implementation selector will need to be created and this patch
adds a comment to that effect.

Signed-off-by: Michael Turquette <[email protected]>
---
kernel/sched/sched.h | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 469d11d..37502ea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
#ifdef CONFIG_SMP
extern void sched_avg_update(struct rq *rq);

-#ifndef arch_scale_freq_capacity
+/*
+ * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
+ * arch code. We select the cpufreq-provided implementation first. If it
+ * doesn't exist then we default to any other implementation provided from
+ * platform/arch code. If those do not exist then we use the default
+ * SCHED_CAPACITY_SCALE value below.
+ *
+ * Note that if cpufreq drivers or platform/arch code have competing
+ * implementations it is up to those subsystems to select one at runtime with
+ * an efficient solution, as we cannot tolerate the overhead of indirect
+ * functions (e.g. function pointers) in the scheduler fast path
+ */
+#ifdef CONFIG_CPU_FREQ
+#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
+#elif !defined(arch_scale_freq_capacity)
static __always_inline
unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
{
--
2.1.4

2016-03-14 05:27:30

by Michael Turquette

[permalink] [raw]

Subject: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

Patch, "sched/cpufreq: pass sched class into cpufreq_update_util" made
it possible for calls of cpufreq_update_util() to specify scheduler
class, particularly cfs, rt & dl.

Update the schedutil governor to store these individual utilizations per
cpu and sum them to create a total utilization contribution.

Signed-off-by: Michael Turquette <[email protected]>
---
drivers/cpufreq/cpufreq_schedutil.c | 39 +++++++++++++++++++++++++++++++------
1 file changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_schedutil.c b/drivers/cpufreq/cpufreq_schedutil.c
index 18d9ca3..b9234e1 100644
--- a/drivers/cpufreq/cpufreq_schedutil.c
+++ b/drivers/cpufreq/cpufreq_schedutil.c
@@ -46,8 +46,10 @@ struct sugov_cpu {
struct freq_update_hook update_hook;
struct sugov_policy *sg_policy;

+ unsigned long util[nr_util_types];
+ unsigned long total_util;
+
/* The fields below are only needed when sharing a policy. */
- unsigned long util;
unsigned long max;
u64 last_update;
};
@@ -106,6 +108,18 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
trace_cpu_frequency(freq, smp_processor_id());
}

+static unsigned long sugov_sum_total_util(struct sugov_cpu *sg_cpu)
+{
+ enum sched_class_util sc;
+
+ /* sum the utilization of all sched classes */
+ sg_cpu->total_util = 0;
+ for (sc = 0; sc < nr_util_types; sc++)
+ sg_cpu->total_util += sg_cpu->util[sc];
+
+ return sg_cpu->total_util;
+}
+
static void sugov_update_single(struct freq_update_hook *hook,
enum sched_class_util sc, u64 time,
unsigned long util, unsigned long max)
@@ -113,12 +127,17 @@ static void sugov_update_single(struct freq_update_hook *hook,
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
unsigned int max_f, next_f;
+ unsigned long total_util;

if (!sugov_should_update_freq(sg_policy, time))
return;

+ /* update per-sched_class utilization for this cpu */
+ sg_cpu->util[sc] = util;
+ total_util = sugov_sum_total_util(sg_cpu);
+
max_f = sg_policy->max_freq;
- next_f = util > max ? max_f : util * max_f / max;
+ next_f = total_util > max ? max_f : total_util * max_f / max;
sugov_update_commit(sg_policy, time, next_f);
}

@@ -153,7 +172,7 @@ static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
if ((s64)delta_ns > NSEC_PER_SEC / HZ)
continue;

- j_util = j_sg_cpu->util;
+ j_util = j_sg_cpu->total_util;
j_max = j_sg_cpu->max;
if (j_util > j_max)
return max_f;
@@ -174,15 +193,19 @@ static void sugov_update_shared(struct freq_update_hook *hook,
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
unsigned int next_f;
+ unsigned long total_util;

raw_spin_lock(&sg_policy->update_lock);

- sg_cpu->util = util;
+ sg_cpu->util[sc] = util;
sg_cpu->max = max;
sg_cpu->last_update = time;

+ /* update per-sched_class utilization for this cpu */
+ total_util = sugov_sum_total_util(sg_cpu);
+
if (sugov_should_update_freq(sg_policy, time)) {
- next_f = sugov_next_freq(sg_policy, util, max);
+ next_f = sugov_next_freq(sg_policy, total_util, max);
sugov_update_commit(sg_policy, time, next_f);
}

@@ -423,6 +446,7 @@ static int sugov_start(struct cpufreq_policy *policy)
{
struct sugov_policy *sg_policy = policy->governor_data;
unsigned int cpu;
+ enum sched_class_util sc;

sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
sg_policy->last_freq_update_time = 0;
@@ -434,8 +458,11 @@ static int sugov_start(struct cpufreq_policy *policy)
struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);

sg_cpu->sg_policy = sg_policy;
+ for (sc = 0; sc < nr_util_types; sc++) {
+ sg_cpu->util[sc] = ULONG_MAX;
+ sg_cpu->total_util = ULONG_MAX;
+ }
if (policy_is_shared(policy)) {
- sg_cpu->util = ULONG_MAX;
sg_cpu->max = 0;
sg_cpu->last_update = 0;
cpufreq_set_freq_update_hook(cpu, &sg_cpu->update_hook,
--
2.1.4

2016-03-14 05:27:52

by Michael Turquette

[permalink] [raw]

Subject: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

With the addition of the global cfs capacity margin helpers in patch,
"sched/cpufreq: new cfs capacity margin helpers", we can now export
sysfs tunables from the schedutil governor. This allows privileged users
to tune the value more easily.

The margin value is global to cfs, not per-policy. As such schedutil
does not store any state about the margin. Schedutil restores the margin
value to its default value when exiting.

Signed-off-by: Michael Turquette <[email protected]>
---
drivers/cpufreq/cpufreq_schedutil.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/drivers/cpufreq/cpufreq_schedutil.c b/drivers/cpufreq/cpufreq_schedutil.c
index 5aa26bf..12e49b9 100644
--- a/drivers/cpufreq/cpufreq_schedutil.c
+++ b/drivers/cpufreq/cpufreq_schedutil.c
@@ -246,8 +246,32 @@ static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *bu

static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);

+static ssize_t capacity_margin_show(struct gov_attr_set *not_used,
+ char *buf)
+{
+ return sprintf(buf, "%lu\n", cpufreq_get_cfs_capacity_margin());
+}
+
+static ssize_t capacity_margin_store(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
+{
+ unsigned long margin;
+ int ret;
+
+ ret = sscanf(buf, "%lu", &margin);
+ if (ret != 1)
+ return -EINVAL;
+
+ cpufreq_set_cfs_capacity_margin(margin);
+
+ return count;
+}
+
+static struct governor_attr capacity_margin = __ATTR_RW(capacity_margin);
+
static struct attribute *sugov_attributes[] = {
&rate_limit_us.attr,
+ &capacity_margin.attr,
NULL
};

@@ -381,6 +405,7 @@ static int sugov_exit(struct cpufreq_policy *policy)

mutex_lock(&global_tunables_lock);

+ cpufreq_reset_cfs_capacity_margin();
count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
policy->governor_data = NULL;
if (!count)
--
2.1.4

2016-03-14 05:28:14

by Michael Turquette

[permalink] [raw]

Subject: [PATCH 2/8] sched/fair: add margin to utilization update

Utilization contributions to cfs_rq->avg.util_avg are scaled for both
microarchitecture-invariance as well as frequency-invariance. This means
that any given utilization contribution will be scaled against the
current cpu capacity (cpu frequency). Contributions from long running
tasks, whose utilization grows larger over time, will asymptotically
approach the current capacity.

This causes a problem when using this utilization signal to select a
target cpu capacity (cpu frequency), as our signal will never exceed the
current capacity, which would otherwise be our signal to increase
frequency.

Solve this by introducing a default capacity margin that is added to the
utilization signal when requesting a change to capacity (cpu frequency).
The margin is 1280, or 1.25 x SCHED_CAPACITY_SCALE (1024). This is
equivalent to similar margins such as the default 125 value assigned to
struct sched_domain.imbalance_pct for load balancing, and to the 80%
up_threshold used by the legacy cpufreq ondemand governor.

Signed-off-by: Michael Turquette <[email protected]>
---
kernel/sched/fair.c | 18 ++++++++++++++++--
kernel/sched/sched.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a32f281..29e8bae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -100,6 +100,19 @@ const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
*/
unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;

+/*
+ * Add a 25% margin globally to all capacity requests from cfs. This is
+ * equivalent to an 80% up_threshold in legacy governors like ondemand.
+ *
+ * This is required as task utilization increases. The frequency-invariant
+ * utilization will asymptotically approach the current capacity of the cpu and
+ * the additional margin will cross the threshold into the next capacity state.
+ *
+ * XXX someday expand to separate, per-call site margins? e.g. enqueue, fork,
+ * task_tick, load_balance, etc
+ */
+unsigned long cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
+
#ifdef CONFIG_CFS_BANDWIDTH
/*
* Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
@@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)

if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
unsigned long max = rq->cpu_capacity_orig;
+ unsigned long cap = cfs_rq->avg.util_avg *
+ cfs_capacity_margin / max;

/*
* There are a few boundary cases this might miss but it should
@@ -2852,8 +2867,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
* thread is a different class (!fair), nor will the utilization
* number include things like RT tasks.
*/
- cpufreq_update_util(rq_clock(rq),
- min(cfs_rq->avg.util_avg, max), max);
+ cpufreq_update_util(rq_clock(rq), min(cap, max), max);
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f06dfca..8c93ed2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -27,6 +27,9 @@ extern __read_mostly int scheduler_running;
extern unsigned long calc_load_update;
extern atomic_long_t calc_load_tasks;

+#define CAPACITY_MARGIN_DEFAULT 1280;
+extern unsigned long cfs_capacity_margin;
+
extern void calc_global_load_tick(struct rq *this_rq);
extern long calc_load_fold_active(struct rq *this_rq);

--
2.1.4

2016-03-15 19:13:52

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

Hi Mike,

On 14/03/16 05:22, Michael Turquette wrote:
> From: Dietmar Eggemann <[email protected]>
>
> Implements cpufreq_scale_freq_capacity() to provide the scheduler with a
> frequency scaling correction factor for more accurate load-tracking.
>
> The factor is:
>
> current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
>
> In fact, freq_scale should be a struct cpufreq_policy data member. But
> this would require that the scheduler hot path (__update_load_avg()) would
> have to grab the cpufreq lock. This can be avoided by using per-cpu data
> initialized to SCHED_CAPACITY_SCALE for freq_scale.
>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> Signed-off-by: Michael Turquette <[email protected]>
> ---
> I'm not as sure about patches 7 & 8, but I included them since I needed
> frequency invariance while testing.
>
> As mentioned by myself in 2014 and Rafael last month, the
> arch_scale_freq_capacity hook is awkward, because this behavior may vary
> within an architecture.
>
> I re-introduce Dietmar's generic cpufreq implementation of the frequency
> invariance hook in this patch, and change the preprocessor magic in
> sched.h to favor the cpufreq implementation over arch- or
> platform-specific ones in the next patch.

Maybe it is worth mentioning that this patch is from EAS RFC5.2
(linux-arm.org/linux-power.git energy_model_rfc_v5.2) which hasn't been
posted to LKML. The last EAS RFCv5 has the Frequency Invariant Engine
(FEI) based on the cpufreq notifier calls (cpufreq_callback,
cpufreq_policy_callback) in the ARM arch code.

> If run-time selection of ops is needed them someone will need to write
> that code.

Right now I see 3 different implementations of the FEI. 1) The X86
aperf/mperf based one (https://lkml.org/lkml/2016/3/3/589), 2) This one
in cpufreq.c and 3) the one based on cpufreq notifiers in ARCH (ARM,
ARM64) code.

I guess with sched_util we do need a solution for all platforms
(different archs, x86 w/ and w/o X86_FEATURE_APERFMPERF, ...).

> I think that this negates the need for the arm arch hooks[0-2], and
> hopefully Morten and Dietmar can weigh in on this.

It's true that we tried to get rid of the usage of the cpufreq callbacks
(cpufreq_callback, cpufreq_policy_callback) with this patch. Plus we
didn't want to implement it twice (for ARM and ARM64).

But 2) would have to work for other ARCHs as well. Maybe as a fall-back
for X86 w/o X86_FEATURE_APERFMPERF feature?

[...]

2016-03-15 19:14:06

by Dietmar Eggemann

[permalink] [raw]

Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

On 14/03/16 05:22, Michael Turquette wrote:
> arch_scale_freq_capacity is weird. It specifies an arch hook for an
> implementation that could easily vary within an architecture or even a
> chip family.
>
> This patch helps to mitigate this weirdness by defaulting to the
> cpufreq-provided implementation, which should work for all cases where
> CONFIG_CPU_FREQ is set.
>
> If CONFIG_CPU_FREQ is not set, then try to use an implementation
> provided by the architecture. Failing that, fall back to
> SCHED_CAPACITY_SCALE.
>
> It may be desirable for cpufreq drivers to specify their own
> implementation of arch_scale_freq_capacity in the future. The same is
> true for platform code within an architecture. In both cases an
> efficient implementation selector will need to be created and this patch
> adds a comment to that effect.

For me this independence of the scheduler code towards the actual
implementation of the Frequency Invariant Engine (FEI) was actually a
feature.

In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 ,
which hasn't been posted to LKML) we establish the link in the ARCH code
(arch/arm64/include/asm/topology.h).

#ifdef CONFIG_CPU_FREQ
#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
...
+#endif

>
> Signed-off-by: Michael Turquette <[email protected]>
> ---
> kernel/sched/sched.h | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 469d11d..37502ea 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
> #ifdef CONFIG_SMP
> extern void sched_avg_update(struct rq *rq);
>
> -#ifndef arch_scale_freq_capacity
> +/*
> + * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
> + * arch code. We select the cpufreq-provided implementation first. If it
> + * doesn't exist then we default to any other implementation provided from
> + * platform/arch code. If those do not exist then we use the default
> + * SCHED_CAPACITY_SCALE value below.
> + *
> + * Note that if cpufreq drivers or platform/arch code have competing
> + * implementations it is up to those subsystems to select one at runtime with
> + * an efficient solution, as we cannot tolerate the overhead of indirect
> + * functions (e.g. function pointers) in the scheduler fast path
> + */
> +#ifdef CONFIG_CPU_FREQ
> +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> +#elif !defined(arch_scale_freq_capacity)
> static __always_inline
> unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> {
>

2016-03-15 20:19:24

by Michael Turquette

[permalink] [raw]

Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

Quoting Dietmar Eggemann (2016-03-15 12:13:46)
> Hi Mike,
>
> On 14/03/16 05:22, Michael Turquette wrote:
> > From: Dietmar Eggemann <[email protected]>
> >
> > Implements cpufreq_scale_freq_capacity() to provide the scheduler with a
> > frequency scaling correction factor for more accurate load-tracking.
> >
> > The factor is:
> >
> > current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
> >
> > In fact, freq_scale should be a struct cpufreq_policy data member. But
> > this would require that the scheduler hot path (__update_load_avg()) would
> > have to grab the cpufreq lock. This can be avoided by using per-cpu data
> > initialized to SCHED_CAPACITY_SCALE for freq_scale.
> >
> > Signed-off-by: Dietmar Eggemann <[email protected]>
> > Signed-off-by: Michael Turquette <[email protected]>
> > ---
> > I'm not as sure about patches 7 & 8, but I included them since I needed
> > frequency invariance while testing.
> >
> > As mentioned by myself in 2014 and Rafael last month, the
> > arch_scale_freq_capacity hook is awkward, because this behavior may vary
> > within an architecture.
> >
> > I re-introduce Dietmar's generic cpufreq implementation of the frequency
> > invariance hook in this patch, and change the preprocessor magic in
> > sched.h to favor the cpufreq implementation over arch- or
> > platform-specific ones in the next patch.
>
> Maybe it is worth mentioning that this patch is from EAS RFC5.2
> (linux-arm.org/linux-power.git energy_model_rfc_v5.2) which hasn't been
> posted to LKML. The last EAS RFCv5 has the Frequency Invariant Engine
> (FEI) based on the cpufreq notifier calls (cpufreq_callback,
> cpufreq_policy_callback) in the ARM arch code.

Oops, my apologies. I got a little mixed up while developing these
patches and I should have at least asked you about this one before
posting.

I'm really quite happy to drop #7 and #8 if they are too contentious or
if patch #7 is deemed as not-ready by you.

>
> > If run-time selection of ops is needed them someone will need to write
> > that code.
>
> Right now I see 3 different implementations of the FEI. 1) The X86
> aperf/mperf based one (https://lkml.org/lkml/2016/3/3/589), 2) This one
> in cpufreq.c and 3) the one based on cpufreq notifiers in ARCH (ARM,
> ARM64) code.
>
> I guess with sched_util we do need a solution for all platforms
> (different archs, x86 w/ and w/o X86_FEATURE_APERFMPERF, ...).
>
> > I think that this negates the need for the arm arch hooks[0-2], and
> > hopefully Morten and Dietmar can weigh in on this.
>
> It's true that we tried to get rid of the usage of the cpufreq callbacks
> (cpufreq_callback, cpufreq_policy_callback) with this patch. Plus we
> didn't want to implement it twice (for ARM and ARM64).
>
> But 2) would have to work for other ARCHs as well. Maybe as a fall-back
> for X86 w/o X86_FEATURE_APERFMPERF feature?

That's what I had in mind. I guess that some day there will be a need to
select implementations at run-time for both cpufreq (e.g. different
cpufreq drivers might implement arch_scale_freq_capacity) and for the
!CONFIG_CPU_FREQ case (e.g. different platforms might implement
arch_scale_freq_capcity within the same arch).

The cpufreq approach seems the most generic, hence patch #8 to make it
the default.

Regards,
Mike

>
> [...]

2016-03-15 20:46:36

by Michael Turquette

[permalink] [raw]

Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

Quoting Dietmar Eggemann (2016-03-15 12:13:58)
> On 14/03/16 05:22, Michael Turquette wrote:
> > arch_scale_freq_capacity is weird. It specifies an arch hook for an
> > implementation that could easily vary within an architecture or even a
> > chip family.
> >
> > This patch helps to mitigate this weirdness by defaulting to the
> > cpufreq-provided implementation, which should work for all cases where
> > CONFIG_CPU_FREQ is set.
> >
> > If CONFIG_CPU_FREQ is not set, then try to use an implementation
> > provided by the architecture. Failing that, fall back to
> > SCHED_CAPACITY_SCALE.
> >
> > It may be desirable for cpufreq drivers to specify their own
> > implementation of arch_scale_freq_capacity in the future. The same is
> > true for platform code within an architecture. In both cases an
> > efficient implementation selector will need to be created and this patch
> > adds a comment to that effect.
>
> For me this independence of the scheduler code towards the actual
> implementation of the Frequency Invariant Engine (FEI) was actually a
> feature.

I do not agree that it is a strength; I think it is confusing. My
opinion is that cpufreq drivers should implement
arch_scale_freq_capacity. Having a sane fallback
(cpufreq_scale_freq_capacity) simply means that you can remove the
boilerplate from the arm32 and arm64 code, which is a win.

Furthermore, if we have multiple competing implementations of
arch_scale_freq_invariance, wouldn't it be better for all of them to
live in cpufreq drivers? This means we would only need to implement a
single run-time "selector".

On the other hand, if the implementation lives in arch code and we have
various implementations of arch_scale_freq_capacity within an
architecture, then each arch would need to implement this selector
function. Even worse then if we have a split where some implementations
live in drivers/cpufreq (e.g. intel_pstate) and others in arch/arm and
others in arch/arm64 ... now we have three selectors.

Note that this has nothing to do with cpu microarch invariance. I'm
happy for that to stay in arch code because we can have heterogeneous
cpus that do not scale frequency, and thus would not enable cpufreq.
But if your platform scales cpu frequency, then really cpufreq should be
in the loop.

>
> In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 ,
> which hasn't been posted to LKML) we establish the link in the ARCH code
> (arch/arm64/include/asm/topology.h).

Right, sorry again about preemptively posting the patch. Total brainfart
on my part.

>
> #ifdef CONFIG_CPU_FREQ
> #define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> ...
> +#endif

The above is no longer necessary with this patch. Same question as
above: why insist on the arch boilerplate?

Regards,
Mike

>
> >
> > Signed-off-by: Michael Turquette <[email protected]>
> > ---
> > kernel/sched/sched.h | 16 +++++++++++++++-
> > 1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 469d11d..37502ea 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
> > #ifdef CONFIG_SMP
> > extern void sched_avg_update(struct rq *rq);
> >
> > -#ifndef arch_scale_freq_capacity
> > +/*
> > + * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
> > + * arch code. We select the cpufreq-provided implementation first. If it
> > + * doesn't exist then we default to any other implementation provided from
> > + * platform/arch code. If those do not exist then we use the default
> > + * SCHED_CAPACITY_SCALE value below.
> > + *
> > + * Note that if cpufreq drivers or platform/arch code have competing
> > + * implementations it is up to those subsystems to select one at runtime with
> > + * an efficient solution, as we cannot tolerate the overhead of indirect
> > + * functions (e.g. function pointers) in the scheduler fast path
> > + */
> > +#ifdef CONFIG_CPU_FREQ
> > +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> > +#elif !defined(arch_scale_freq_capacity)
> > static __always_inline
> > unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> > {
> >

2016-03-15 21:15:20

Subject: [PATCH 0/8] schedutil enhancements

Subject: [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()

Subject: [PATCH 3/8] sched/cpufreq: new cfs capacity margin helpers

Subject: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Subject: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

Subject: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

Subject: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

Subject: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: [PATCH 2/8] sched/fair: add margin to utilization update

Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

Subject: Re: [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()

Subject: Re: [PATCH 3/8] sched/cpufreq: new cfs capacity margin helpers

Subject: Re: [PATCH 2/8] sched/fair: add margin to utilization update

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Subject: Re: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

Subject: Re: [PATCH 2/8] sched/fair: add margin to utilization update

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()

Subject: Re: [PATCH 0/8] schedutil enhancements

Subject: Re: [PATCH 2/8] sched/fair: add margin to utilization update

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Subject: Re: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

Subject: Re: [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

Subject: Re: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

Subject: Re: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 2/8] sched/fair: add margin to utilization update

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable