2016-03-14 05:26:15

by Michael Turquette

[permalink] [raw]
Subject: [PATCH 0/8] schedutil enhancements

I'm happy that scheduler-driven cpu frequency selection is getting some
attention. Rafael's recent schedutil governor is a step in the right direction.
This series builds on top of Rafael's schedutil governor, bringing it to parity
with some of the features in the schedfreq series posted by Steve[0], as well
as adding a couple of new things.

Patch 1 removes cpufreq_trigger_update()

Patches 2-4 move the cfs capacity margin out of the governor and into
cfs. This value is made tunable by a sysfs control in schedutil.

Patches 5-6 make cpufreq_update_util() aware of multiple scheduler
classes (cfs, rt & dl), and add storage & summation of these per-class
utilization values into schedutil.

Patches 7-8 introduces Dietmar's generic cpufreq implementation[1] of the
frequency invariance hook and changes the preprocessor magic in sched.h to
favor the cpufreq implementation over arch- or platform-specific ones.

If accepted, this series makes it trivial to port Steve and Juri's fine-grained
frequency selection in cfs and Vincent's rt utilization patch to the schedutil
governor.[2-6]

[0] lkml.kernel.org/r/[email protected]
[1] https://git.linaro.org/people/steve.muckle/kernel.git/commit/1b7e57f89f14f7600e75e6fde42bf22d72927b3d
[2] lkml.kernel.org/r/[email protected]
[3] lkml.kernel.org/r/[email protected]
[4] lkml.kernel.org/r/[email protected]
[5] lkml.kernel.org/r/[email protected]
[6] lkml.kernel.org/r/[email protected]

Dietmar Eggemann (1):
cpufreq: Frequency invariant scheduler load-tracking support

Michael Turquette (7):
sched/cpufreq: remove cpufreq_trigger_update()
sched/fair: add margin to utilization update
sched/cpufreq: new cfs capacity margin helpers
cpufreq/schedutil: sysfs capacity margin tunable
sched/cpufreq: pass sched class into cpufreq_update_util
cpufreq/schedutil: sum per-sched class utilization
sched: prefer cpufreq_scale_freq_capacity

drivers/cpufreq/cpufreq.c | 29 ++++++++++++
drivers/cpufreq/cpufreq_governor.c | 5 +-
drivers/cpufreq/cpufreq_schedutil.c | 70 ++++++++++++++++++++++++----
drivers/cpufreq/intel_pstate.c | 5 +-
include/linux/cpufreq.h | 3 ++
include/linux/sched.h | 19 ++++++--
kernel/sched/cpufreq.c | 92 +++++++++++++++++++++++++------------
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 18 +++++++-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 29 +++++++++---
11 files changed, 219 insertions(+), 55 deletions(-)

--
2.1.4


2016-03-14 05:26:19

by Michael Turquette

[permalink] [raw]
Subject: [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()

cpufreq_trigger_update() was introduced in "cpufreq: Rework the
scheduler hooks for triggering updates"[0]. Consensus is that this
helper is not needed and removing it will aid in experimenting with
deadline and rt capacity requests.

Instead of reverting the above patch, which includes useful renaming of
data structures and related functions, simply remove the function,
update affected kerneldoc and change rt.c and deadline.c to use
cpufreq_update_util().

[0] lkml.kernel.org/r/[email protected]

Signed-off-by: Michael Turquette <[email protected]>
---
kernel/sched/cpufreq.c | 28 ++--------------------------
kernel/sched/deadline.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 2 --
4 files changed, 4 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index eecaba4..bd012c2 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -20,8 +20,8 @@ static DEFINE_PER_CPU(struct freq_update_hook *, cpufreq_freq_update_hook);
*
* Set and publish the freq_update_hook pointer for the given CPU. That pointer
* points to a struct freq_update_hook object containing a callback function
- * to call from cpufreq_trigger_update(). That function will be called from
- * an RCU read-side critical section, so it must not sleep.
+ * to call from cpufreq_update_util(). That function will be called from an
+ * RCU read-side critical section, so it must not sleep.
*
* Callers must use RCU-sched callbacks to free any memory that might be
* accessed via the old update_util_data pointer or invoke synchronize_sched()
@@ -87,27 +87,3 @@ void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
if (hook)
hook->func(hook, time, util, max);
}
-
-/**
- * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
- * @time: Current time.
- *
- * The way cpufreq is currently arranged requires it to evaluate the CPU
- * performance state (frequency/voltage) on a regular basis. To facilitate
- * that, cpufreq_update_util() is called by update_load_avg() in CFS when
- * executed for the current CPU's runqueue.
- *
- * However, this isn't sufficient to prevent the CPU from being stuck in a
- * completely inadequate performance level for too long, because the calls
- * from CFS will not be made if RT or deadline tasks are active all the time
- * (or there are RT and DL tasks only).
- *
- * As a workaround for that issue, this function is called by the RT and DL
- * sched classes to trigger extra cpufreq updates to prevent it from stalling,
- * but that really is a band-aid. Going forward it should be replaced with
- * solutions targeted more specifically at RT and DL tasks.
- */
-void cpufreq_trigger_update(u64 time)
-{
- cpufreq_update_util(time, ULONG_MAX, 0);
-}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1a035fa..3fd5bc4 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -728,7 +728,7 @@ static void update_curr_dl(struct rq *rq)

/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_trigger_update(rq_clock(rq));
+ cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);

/*
* Consumed budget is computed considering the time as
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9dd1c09..53ad077 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -947,7 +947,7 @@ static void update_curr_rt(struct rq *rq)

/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_trigger_update(rq_clock(rq));
+ cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);

delta_exec = rq_clock_task(rq) - curr->se.exec_start;
if (unlikely((s64)delta_exec <= 0))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7ae012e..f06dfca 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1742,9 +1742,7 @@ static inline u64 irq_time_read(int cpu)

#ifdef CONFIG_CPU_FREQ
void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
-void cpufreq_trigger_update(u64 time);
#else
static inline void cpufreq_update_util(u64 time, unsigned long util,
unsigned long max) {}
-static inline void cpufreq_trigger_update(u64 time) {}
#endif /* CONFIG_CPU_FREQ */
--
2.1.4

2016-03-14 05:26:23

by Michael Turquette

[permalink] [raw]
Subject: [PATCH 3/8] sched/cpufreq: new cfs capacity margin helpers

Introduce helper functions that allow cpufreq governors to change the
value of the capacity margin applied to the cfs_rq->avg.util_avg signal.
This allows for run-time tuning of the margin.

A follow-up patch will update the schedutil governor to use these
helpers.

Signed-off-by: Michael Turquette <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/cpufreq.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 56 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1fa9b52..f18a99b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2372,6 +2372,9 @@ void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
void (*func)(struct freq_update_hook *hook, u64 time,
unsigned long util, unsigned long max));
void cpufreq_clear_freq_update_hook(int cpu);
+unsigned long cpufreq_get_cfs_capacity_margin(void);
+void cpufreq_set_cfs_capacity_margin(unsigned long margin);
+void cpufreq_reset_cfs_capacity_margin(void);
#endif

#ifdef CONFIG_SCHED_AUTOGROUP
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index bd012c2..a126b58 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -61,6 +61,59 @@ void cpufreq_clear_freq_update_hook(int cpu)
EXPORT_SYMBOL_GPL(cpufreq_clear_freq_update_hook);

/**
+ * cpufreq_get_cfs_capacity_margin - Get global cfs enqueue capacity margin
+ *
+ * margin is a percentage of capacity that is applied to the current
+ * utilization when selecting a new capacity state or cpu frequency. The value
+ * should be normalized to the range of [0..SCHED_CAPACITY_SCALE], where
+ * SCHED_CAPACITY_SCALE is 100% of the normalized capacity, or equivalent to
+ * multiplying the utilization by one.
+ *
+ * This function returns the current global cfs enqueue capacity margin
+ */
+unsigned long cpufreq_get_cfs_capacity_margin(void)
+{
+ return cfs_capacity_margin;
+}
+EXPORT_SYMBOL_GPL(cpufreq_get_cfs_capacity_margin);
+
+/**
+ * cpufreq_set_cfs_capacity_margin - Set global cfs enqueue capacity margin
+ * @margin: new capacity margin
+ *
+ * margin is a percentage of capacity that is applied to the current
+ * utilization when selecting a new capacity state or cpu frequency. The value
+ * should be normalized to the range of [0..SCHED_CAPACITY_SCALE], where
+ * SCHED_CAPACITY_SCALE is 100% of the normalized capacity, or equivalent to
+ * multiplying the utilization by one.
+ *
+ * For instance, to add a 25% margin to a utilization, margin should be 1280,
+ * which is 1.25x 1024, the default for SCHED_CAPACITY_SCALE.
+ */
+void cpufreq_set_cfs_capacity_margin(unsigned long margin)
+{
+ cfs_capacity_margin = margin;
+}
+EXPORT_SYMBOL_GPL(cpufreq_set_cfs_capacity_margin);
+
+/**
+ * cpufreq_reset_cfs_capacity_margin - Reset global cfs enqueue cap margin
+ *
+ * margin is a percentage of capacity that is applied to the current
+ * utilization when selecting a new capacity state or cpu frequency. The value
+ * should be normalized to the range of [0..SCHED_CAPACITY_SCALE], where
+ * SCHED_CAPACITY_SCALE is 100% of the normalized capacity, or equivalent to
+ * multiplying the utilization by one.
+ *
+ * This function resets the global margin to its default value.
+ */
+void cpufreq_reset_cfs_capacity_margin(void)
+{
+ cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
+}
+EXPORT_SYMBOL_GPL(cpufreq_reset_cfs_capacity_margin);
+
+/**
* cpufreq_update_util - Take a note about CPU utilization changes.
* @time: Current time.
* @util: CPU utilization.
--
2.1.4

2016-03-14 05:26:30

by Michael Turquette

[permalink] [raw]
Subject: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

cpufreq_update_util() accepts a single utilization value which does not
account for multiple utilization contributions from the cfs, rt & dl
scheduler classes. Begin fixing this by adding a sched_class argument to
cpufreq_update_util(), all of its call sites and the governor-specific
hooks in intel_pstate.c, cpufreq_schedutil.c and cpufreq_governor.c.

A follow-on patch will add summation of the sched_class contributions to
the schedutil governor.

Signed-off-by: Michael Turquette <[email protected]>
---
drivers/cpufreq/cpufreq_governor.c | 5 +++--
drivers/cpufreq/cpufreq_schedutil.c | 6 ++++--
drivers/cpufreq/intel_pstate.c | 5 +++--
include/linux/sched.h | 16 +++++++++++++---
kernel/sched/cpufreq.c | 11 +++++++----
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 2 +-
kernel/sched/rt.c | 2 +-
kernel/sched/sched.h | 8 +++++---
9 files changed, 38 insertions(+), 19 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_governor.c b/drivers/cpufreq/cpufreq_governor.c
index 148576c..4694751 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -248,8 +248,9 @@ static void dbs_irq_work(struct irq_work *irq_work)
schedule_work(&policy_dbs->work);
}

-static void dbs_freq_update_handler(struct freq_update_hook *hook, u64 time,
- unsigned long util_not_used,
+static void dbs_freq_update_handler(struct freq_update_hook *hook,
+ enum sched_class_util sc_not_used,
+ u64 time, unsigned long util_not_used,
unsigned long max_not_used)
{
struct cpu_dbs_info *cdbs = container_of(hook, struct cpu_dbs_info, update_hook);
diff --git a/drivers/cpufreq/cpufreq_schedutil.c b/drivers/cpufreq/cpufreq_schedutil.c
index 12e49b9..18d9ca3 100644
--- a/drivers/cpufreq/cpufreq_schedutil.c
+++ b/drivers/cpufreq/cpufreq_schedutil.c
@@ -106,7 +106,8 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
trace_cpu_frequency(freq, smp_processor_id());
}

-static void sugov_update_single(struct freq_update_hook *hook, u64 time,
+static void sugov_update_single(struct freq_update_hook *hook,
+ enum sched_class_util sc, u64 time,
unsigned long util, unsigned long max)
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
@@ -166,7 +167,8 @@ static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
return util * max_f / max;
}

-static void sugov_update_shared(struct freq_update_hook *hook, u64 time,
+static void sugov_update_shared(struct freq_update_hook *hook,
+ enum sched_class_util sc, u64 time,
unsigned long util, unsigned long max)
{
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 20e2bb2..86aa368 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1020,8 +1020,9 @@ static inline void intel_pstate_adjust_busy_pstate(struct cpudata *cpu)
sample->freq);
}

-static void intel_pstate_freq_update(struct freq_update_hook *hook, u64 time,
- unsigned long util_not_used,
+static void intel_pstate_freq_update(struct freq_update_hook *hook,
+ enum sched_class_util sc_not_used
+ u64 time, unsigned long util_not_used,
unsigned long max_not_used)
{
struct cpudata *cpu = container_of(hook, struct cpudata, update_hook);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f18a99b..1c7d7bd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2362,15 +2362,25 @@ extern u64 scheduler_tick_max_deferment(void);
static inline bool sched_can_stop_tick(void) { return false; }
#endif

+enum sched_class_util {
+ cfs_util,
+ rt_util,
+ dl_util,
+ nr_util_types,
+};
+
#ifdef CONFIG_CPU_FREQ
struct freq_update_hook {
- void (*func)(struct freq_update_hook *hook, u64 time,
+ void (*func)(struct freq_update_hook *hook,
+ enum sched_class_util sched_class, u64 time,
unsigned long util, unsigned long max);
};

void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
- void (*func)(struct freq_update_hook *hook, u64 time,
- unsigned long util, unsigned long max));
+ void (*func)(struct freq_update_hook *hook,
+ enum sched_class_util sched_class,
+ u64 time, unsigned long util,
+ unsigned long max));
void cpufreq_clear_freq_update_hook(int cpu);
unsigned long cpufreq_get_cfs_capacity_margin(void);
void cpufreq_set_cfs_capacity_margin(unsigned long margin);
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index a126b58..87f99a6 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -39,8 +39,10 @@ static void set_freq_update_hook(int cpu, struct freq_update_hook *hook)
* @func: Callback function to use with the new hook.
*/
void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
- void (*func)(struct freq_update_hook *hook, u64 time,
- unsigned long util, unsigned long max))
+ void (*func)(struct freq_update_hook *hook,
+ enum sched_class_util sched_class,
+ u64 time, unsigned long util,
+ unsigned long max))
{
if (WARN_ON(!hook || !func))
return;
@@ -124,7 +126,8 @@ EXPORT_SYMBOL_GPL(cpufreq_reset_cfs_capacity_margin);
*
* It can only be called from RCU-sched read-side critical sections.
*/
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
+void cpufreq_update_util(enum sched_class_util sc, u64 time,
+ unsigned long util, unsigned long max)
{
struct freq_update_hook *hook;

@@ -138,5 +141,5 @@ void cpufreq_update_util(u64 time, unsigned long util, unsigned long max)
* may become NULL after the check below.
*/
if (hook)
- hook->func(hook, time, util, max);
+ hook->func(hook, sc, time, util, max);
}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 3fd5bc4..d88ed3f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -728,7 +728,7 @@ static void update_curr_dl(struct rq *rq)

/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);
+ cpufreq_update_util(dl_util, rq_clock(rq), ULONG_MAX, 0);

/*
* Consumed budget is computed considering the time as
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29e8bae..6b454bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2867,7 +2867,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
* thread is a different class (!fair), nor will the utilization
* number include things like RT tasks.
*/
- cpufreq_update_util(rq_clock(rq), min(cap, max), max);
+ cpufreq_update_util(cfs_util, rq_clock(rq), min(cap, max), max);
}
}

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 53ad077..9d9dab4 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -947,7 +947,7 @@ static void update_curr_rt(struct rq *rq)

/* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
if (cpu_of(rq) == smp_processor_id())
- cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);
+ cpufreq_update_util(rt_util, rq_clock(rq), ULONG_MAX, 0);

delta_exec = rq_clock_task(rq) - curr->se.exec_start;
if (unlikely((s64)delta_exec <= 0))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8c93ed2..469d11d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1744,8 +1744,10 @@ static inline u64 irq_time_read(int cpu)
#endif /* CONFIG_IRQ_TIME_ACCOUNTING */

#ifdef CONFIG_CPU_FREQ
-void cpufreq_update_util(u64 time, unsigned long util, unsigned long max);
+void cpufreq_update_util(enum sched_class_util sc, u64 time,
+ unsigned long util, unsigned long max);
#else
-static inline void cpufreq_update_util(u64 time, unsigned long util,
- unsigned long max) {}
+static inline void cpufreq_update_util(enum sched_class_util sc, u64 time,
+ unsigned long util, unsigned long max)
+{}
#endif /* CONFIG_CPU_FREQ */
--
2.1.4

2016-03-14 05:26:41

by Michael Turquette

[permalink] [raw]
Subject: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

From: Dietmar Eggemann <[email protected]>

Implements cpufreq_scale_freq_capacity() to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking.

The factor is:

current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)

In fact, freq_scale should be a struct cpufreq_policy data member. But
this would require that the scheduler hot path (__update_load_avg()) would
have to grab the cpufreq lock. This can be avoided by using per-cpu data
initialized to SCHED_CAPACITY_SCALE for freq_scale.

Signed-off-by: Dietmar Eggemann <[email protected]>
Signed-off-by: Michael Turquette <[email protected]>
---
I'm not as sure about patches 7 & 8, but I included them since I needed
frequency invariance while testing.

As mentioned by myself in 2014 and Rafael last month, the
arch_scale_freq_capacity hook is awkward, because this behavior may vary
within an architecture.

I re-introduce Dietmar's generic cpufreq implementation of the frequency
invariance hook in this patch, and change the preprocessor magic in
sched.h to favor the cpufreq implementation over arch- or
platform-specific ones in the next patch.

If run-time selection of ops is needed them someone will need to write
that code.

I think that this negates the need for the arm arch hooks[0-2], and
hopefully Morten and Dietmar can weigh in on this.

[0] lkml.kernel.org/r/[email protected]
[1] lkml.kernel.org/r/[email protected]
[2] lkml.kernel.org/r/[email protected]

drivers/cpufreq/cpufreq.c | 29 +++++++++++++++++++++++++++++
include/linux/cpufreq.h | 3 +++
2 files changed, 32 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b1ca9c4..e67584f 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -306,6 +306,31 @@ static void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
#endif
}

+/*********************************************************************
+ * FREQUENCY INVARIANT CPU CAPACITY *
+ *********************************************************************/
+
+static DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
+
+static void
+scale_freq_capacity(struct cpufreq_policy *policy, struct cpufreq_freqs *freqs)
+{
+ unsigned long cur = freqs ? freqs->new : policy->cur;
+ unsigned long scale = (cur << SCHED_CAPACITY_SHIFT) / policy->max;
+ int cpu;
+
+ pr_debug("cpus %*pbl cur/cur max freq %lu/%u kHz freq scale %lu\n",
+ cpumask_pr_args(policy->cpus), cur, policy->max, scale);
+
+ for_each_cpu(cpu, policy->cpus)
+ per_cpu(freq_scale, cpu) = scale;
+}
+
+unsigned long cpufreq_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+ return per_cpu(freq_scale, cpu);
+}
+
static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
struct cpufreq_freqs *freqs, unsigned int state)
{
@@ -409,6 +434,8 @@ wait:

spin_unlock(&policy->transition_lock);

+ scale_freq_capacity(policy, freqs);
+
cpufreq_notify_transition(policy, freqs, CPUFREQ_PRECHANGE);
}
EXPORT_SYMBOL_GPL(cpufreq_freq_transition_begin);
@@ -2125,6 +2152,8 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
CPUFREQ_NOTIFY, new_policy);

+ scale_freq_capacity(new_policy, NULL);
+
policy->min = new_policy->min;
policy->max = new_policy->max;

diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 0e39499..72833be 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -583,4 +583,7 @@ unsigned int cpufreq_generic_get(unsigned int cpu);
int cpufreq_generic_init(struct cpufreq_policy *policy,
struct cpufreq_frequency_table *table,
unsigned int transition_latency);
+
+struct sched_domain;
+unsigned long cpufreq_scale_freq_capacity(struct sched_domain *sd, int cpu);
#endif /* _LINUX_CPUFREQ_H */
--
2.1.4

2016-03-14 05:26:38

by Michael Turquette

[permalink] [raw]
Subject: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

arch_scale_freq_capacity is weird. It specifies an arch hook for an
implementation that could easily vary within an architecture or even a
chip family.

This patch helps to mitigate this weirdness by defaulting to the
cpufreq-provided implementation, which should work for all cases where
CONFIG_CPU_FREQ is set.

If CONFIG_CPU_FREQ is not set, then try to use an implementation
provided by the architecture. Failing that, fall back to
SCHED_CAPACITY_SCALE.

It may be desirable for cpufreq drivers to specify their own
implementation of arch_scale_freq_capacity in the future. The same is
true for platform code within an architecture. In both cases an
efficient implementation selector will need to be created and this patch
adds a comment to that effect.

Signed-off-by: Michael Turquette <[email protected]>
---
kernel/sched/sched.h | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 469d11d..37502ea 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
#ifdef CONFIG_SMP
extern void sched_avg_update(struct rq *rq);

-#ifndef arch_scale_freq_capacity
+/*
+ * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
+ * arch code. We select the cpufreq-provided implementation first. If it
+ * doesn't exist then we default to any other implementation provided from
+ * platform/arch code. If those do not exist then we use the default
+ * SCHED_CAPACITY_SCALE value below.
+ *
+ * Note that if cpufreq drivers or platform/arch code have competing
+ * implementations it is up to those subsystems to select one at runtime with
+ * an efficient solution, as we cannot tolerate the overhead of indirect
+ * functions (e.g. function pointers) in the scheduler fast path
+ */
+#ifdef CONFIG_CPU_FREQ
+#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
+#elif !defined(arch_scale_freq_capacity)
static __always_inline
unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
{
--
2.1.4

2016-03-14 05:27:30

by Michael Turquette

[permalink] [raw]
Subject: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

Patch, "sched/cpufreq: pass sched class into cpufreq_update_util" made
it possible for calls of cpufreq_update_util() to specify scheduler
class, particularly cfs, rt & dl.

Update the schedutil governor to store these individual utilizations per
cpu and sum them to create a total utilization contribution.

Signed-off-by: Michael Turquette <[email protected]>
---
drivers/cpufreq/cpufreq_schedutil.c | 39 +++++++++++++++++++++++++++++++------
1 file changed, 33 insertions(+), 6 deletions(-)

diff --git a/drivers/cpufreq/cpufreq_schedutil.c b/drivers/cpufreq/cpufreq_schedutil.c
index 18d9ca3..b9234e1 100644
--- a/drivers/cpufreq/cpufreq_schedutil.c
+++ b/drivers/cpufreq/cpufreq_schedutil.c
@@ -46,8 +46,10 @@ struct sugov_cpu {
struct freq_update_hook update_hook;
struct sugov_policy *sg_policy;

+ unsigned long util[nr_util_types];
+ unsigned long total_util;
+
/* The fields below are only needed when sharing a policy. */
- unsigned long util;
unsigned long max;
u64 last_update;
};
@@ -106,6 +108,18 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time,
trace_cpu_frequency(freq, smp_processor_id());
}

+static unsigned long sugov_sum_total_util(struct sugov_cpu *sg_cpu)
+{
+ enum sched_class_util sc;
+
+ /* sum the utilization of all sched classes */
+ sg_cpu->total_util = 0;
+ for (sc = 0; sc < nr_util_types; sc++)
+ sg_cpu->total_util += sg_cpu->util[sc];
+
+ return sg_cpu->total_util;
+}
+
static void sugov_update_single(struct freq_update_hook *hook,
enum sched_class_util sc, u64 time,
unsigned long util, unsigned long max)
@@ -113,12 +127,17 @@ static void sugov_update_single(struct freq_update_hook *hook,
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
unsigned int max_f, next_f;
+ unsigned long total_util;

if (!sugov_should_update_freq(sg_policy, time))
return;

+ /* update per-sched_class utilization for this cpu */
+ sg_cpu->util[sc] = util;
+ total_util = sugov_sum_total_util(sg_cpu);
+
max_f = sg_policy->max_freq;
- next_f = util > max ? max_f : util * max_f / max;
+ next_f = total_util > max ? max_f : total_util * max_f / max;
sugov_update_commit(sg_policy, time, next_f);
}

@@ -153,7 +172,7 @@ static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
if ((s64)delta_ns > NSEC_PER_SEC / HZ)
continue;

- j_util = j_sg_cpu->util;
+ j_util = j_sg_cpu->total_util;
j_max = j_sg_cpu->max;
if (j_util > j_max)
return max_f;
@@ -174,15 +193,19 @@ static void sugov_update_shared(struct freq_update_hook *hook,
struct sugov_cpu *sg_cpu = container_of(hook, struct sugov_cpu, update_hook);
struct sugov_policy *sg_policy = sg_cpu->sg_policy;
unsigned int next_f;
+ unsigned long total_util;

raw_spin_lock(&sg_policy->update_lock);

- sg_cpu->util = util;
+ sg_cpu->util[sc] = util;
sg_cpu->max = max;
sg_cpu->last_update = time;

+ /* update per-sched_class utilization for this cpu */
+ total_util = sugov_sum_total_util(sg_cpu);
+
if (sugov_should_update_freq(sg_policy, time)) {
- next_f = sugov_next_freq(sg_policy, util, max);
+ next_f = sugov_next_freq(sg_policy, total_util, max);
sugov_update_commit(sg_policy, time, next_f);
}

@@ -423,6 +446,7 @@ static int sugov_start(struct cpufreq_policy *policy)
{
struct sugov_policy *sg_policy = policy->governor_data;
unsigned int cpu;
+ enum sched_class_util sc;

sg_policy->freq_update_delay_ns = sg_policy->tunables->rate_limit_us * NSEC_PER_USEC;
sg_policy->last_freq_update_time = 0;
@@ -434,8 +458,11 @@ static int sugov_start(struct cpufreq_policy *policy)
struct sugov_cpu *sg_cpu = &per_cpu(sugov_cpu, cpu);

sg_cpu->sg_policy = sg_policy;
+ for (sc = 0; sc < nr_util_types; sc++) {
+ sg_cpu->util[sc] = ULONG_MAX;
+ sg_cpu->total_util = ULONG_MAX;
+ }
if (policy_is_shared(policy)) {
- sg_cpu->util = ULONG_MAX;
sg_cpu->max = 0;
sg_cpu->last_update = 0;
cpufreq_set_freq_update_hook(cpu, &sg_cpu->update_hook,
--
2.1.4

2016-03-14 05:27:52

by Michael Turquette

[permalink] [raw]
Subject: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

With the addition of the global cfs capacity margin helpers in patch,
"sched/cpufreq: new cfs capacity margin helpers", we can now export
sysfs tunables from the schedutil governor. This allows privileged users
to tune the value more easily.

The margin value is global to cfs, not per-policy. As such schedutil
does not store any state about the margin. Schedutil restores the margin
value to its default value when exiting.

Signed-off-by: Michael Turquette <[email protected]>
---
drivers/cpufreq/cpufreq_schedutil.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/drivers/cpufreq/cpufreq_schedutil.c b/drivers/cpufreq/cpufreq_schedutil.c
index 5aa26bf..12e49b9 100644
--- a/drivers/cpufreq/cpufreq_schedutil.c
+++ b/drivers/cpufreq/cpufreq_schedutil.c
@@ -246,8 +246,32 @@ static ssize_t rate_limit_us_store(struct gov_attr_set *attr_set, const char *bu

static struct governor_attr rate_limit_us = __ATTR_RW(rate_limit_us);

+static ssize_t capacity_margin_show(struct gov_attr_set *not_used,
+ char *buf)
+{
+ return sprintf(buf, "%lu\n", cpufreq_get_cfs_capacity_margin());
+}
+
+static ssize_t capacity_margin_store(struct gov_attr_set *attr_set,
+ const char *buf, size_t count)
+{
+ unsigned long margin;
+ int ret;
+
+ ret = sscanf(buf, "%lu", &margin);
+ if (ret != 1)
+ return -EINVAL;
+
+ cpufreq_set_cfs_capacity_margin(margin);
+
+ return count;
+}
+
+static struct governor_attr capacity_margin = __ATTR_RW(capacity_margin);
+
static struct attribute *sugov_attributes[] = {
&rate_limit_us.attr,
+ &capacity_margin.attr,
NULL
};

@@ -381,6 +405,7 @@ static int sugov_exit(struct cpufreq_policy *policy)

mutex_lock(&global_tunables_lock);

+ cpufreq_reset_cfs_capacity_margin();
count = gov_attr_set_put(&tunables->attr_set, &sg_policy->tunables_hook);
policy->governor_data = NULL;
if (!count)
--
2.1.4

2016-03-14 05:28:14

by Michael Turquette

[permalink] [raw]
Subject: [PATCH 2/8] sched/fair: add margin to utilization update

Utilization contributions to cfs_rq->avg.util_avg are scaled for both
microarchitecture-invariance as well as frequency-invariance. This means
that any given utilization contribution will be scaled against the
current cpu capacity (cpu frequency). Contributions from long running
tasks, whose utilization grows larger over time, will asymptotically
approach the current capacity.

This causes a problem when using this utilization signal to select a
target cpu capacity (cpu frequency), as our signal will never exceed the
current capacity, which would otherwise be our signal to increase
frequency.

Solve this by introducing a default capacity margin that is added to the
utilization signal when requesting a change to capacity (cpu frequency).
The margin is 1280, or 1.25 x SCHED_CAPACITY_SCALE (1024). This is
equivalent to similar margins such as the default 125 value assigned to
struct sched_domain.imbalance_pct for load balancing, and to the 80%
up_threshold used by the legacy cpufreq ondemand governor.

Signed-off-by: Michael Turquette <[email protected]>
---
kernel/sched/fair.c | 18 ++++++++++++++++--
kernel/sched/sched.h | 3 +++
2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a32f281..29e8bae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -100,6 +100,19 @@ const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
*/
unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;

+/*
+ * Add a 25% margin globally to all capacity requests from cfs. This is
+ * equivalent to an 80% up_threshold in legacy governors like ondemand.
+ *
+ * This is required as task utilization increases. The frequency-invariant
+ * utilization will asymptotically approach the current capacity of the cpu and
+ * the additional margin will cross the threshold into the next capacity state.
+ *
+ * XXX someday expand to separate, per-call site margins? e.g. enqueue, fork,
+ * task_tick, load_balance, etc
+ */
+unsigned long cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
+
#ifdef CONFIG_CFS_BANDWIDTH
/*
* Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
@@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)

if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
unsigned long max = rq->cpu_capacity_orig;
+ unsigned long cap = cfs_rq->avg.util_avg *
+ cfs_capacity_margin / max;

/*
* There are a few boundary cases this might miss but it should
@@ -2852,8 +2867,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
* thread is a different class (!fair), nor will the utilization
* number include things like RT tasks.
*/
- cpufreq_update_util(rq_clock(rq),
- min(cfs_rq->avg.util_avg, max), max);
+ cpufreq_update_util(rq_clock(rq), min(cap, max), max);
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f06dfca..8c93ed2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -27,6 +27,9 @@ extern __read_mostly int scheduler_running;
extern unsigned long calc_load_update;
extern atomic_long_t calc_load_tasks;

+#define CAPACITY_MARGIN_DEFAULT 1280;
+extern unsigned long cfs_capacity_margin;
+
extern void calc_global_load_tick(struct rq *this_rq);
extern long calc_load_fold_active(struct rq *this_rq);

--
2.1.4

2016-03-15 19:13:52

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

Hi Mike,

On 14/03/16 05:22, Michael Turquette wrote:
> From: Dietmar Eggemann <[email protected]>
>
> Implements cpufreq_scale_freq_capacity() to provide the scheduler with a
> frequency scaling correction factor for more accurate load-tracking.
>
> The factor is:
>
> current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
>
> In fact, freq_scale should be a struct cpufreq_policy data member. But
> this would require that the scheduler hot path (__update_load_avg()) would
> have to grab the cpufreq lock. This can be avoided by using per-cpu data
> initialized to SCHED_CAPACITY_SCALE for freq_scale.
>
> Signed-off-by: Dietmar Eggemann <[email protected]>
> Signed-off-by: Michael Turquette <[email protected]>
> ---
> I'm not as sure about patches 7 & 8, but I included them since I needed
> frequency invariance while testing.
>
> As mentioned by myself in 2014 and Rafael last month, the
> arch_scale_freq_capacity hook is awkward, because this behavior may vary
> within an architecture.
>
> I re-introduce Dietmar's generic cpufreq implementation of the frequency
> invariance hook in this patch, and change the preprocessor magic in
> sched.h to favor the cpufreq implementation over arch- or
> platform-specific ones in the next patch.

Maybe it is worth mentioning that this patch is from EAS RFC5.2
(linux-arm.org/linux-power.git energy_model_rfc_v5.2) which hasn't been
posted to LKML. The last EAS RFCv5 has the Frequency Invariant Engine
(FEI) based on the cpufreq notifier calls (cpufreq_callback,
cpufreq_policy_callback) in the ARM arch code.

> If run-time selection of ops is needed them someone will need to write
> that code.

Right now I see 3 different implementations of the FEI. 1) The X86
aperf/mperf based one (https://lkml.org/lkml/2016/3/3/589), 2) This one
in cpufreq.c and 3) the one based on cpufreq notifiers in ARCH (ARM,
ARM64) code.

I guess with sched_util we do need a solution for all platforms
(different archs, x86 w/ and w/o X86_FEATURE_APERFMPERF, ...).

> I think that this negates the need for the arm arch hooks[0-2], and
> hopefully Morten and Dietmar can weigh in on this.

It's true that we tried to get rid of the usage of the cpufreq callbacks
(cpufreq_callback, cpufreq_policy_callback) with this patch. Plus we
didn't want to implement it twice (for ARM and ARM64).

But 2) would have to work for other ARCHs as well. Maybe as a fall-back
for X86 w/o X86_FEATURE_APERFMPERF feature?

[...]

2016-03-15 19:14:06

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

On 14/03/16 05:22, Michael Turquette wrote:
> arch_scale_freq_capacity is weird. It specifies an arch hook for an
> implementation that could easily vary within an architecture or even a
> chip family.
>
> This patch helps to mitigate this weirdness by defaulting to the
> cpufreq-provided implementation, which should work for all cases where
> CONFIG_CPU_FREQ is set.
>
> If CONFIG_CPU_FREQ is not set, then try to use an implementation
> provided by the architecture. Failing that, fall back to
> SCHED_CAPACITY_SCALE.
>
> It may be desirable for cpufreq drivers to specify their own
> implementation of arch_scale_freq_capacity in the future. The same is
> true for platform code within an architecture. In both cases an
> efficient implementation selector will need to be created and this patch
> adds a comment to that effect.

For me this independence of the scheduler code towards the actual
implementation of the Frequency Invariant Engine (FEI) was actually a
feature.

In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 ,
which hasn't been posted to LKML) we establish the link in the ARCH code
(arch/arm64/include/asm/topology.h).

#ifdef CONFIG_CPU_FREQ
#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
...
+#endif

>
> Signed-off-by: Michael Turquette <[email protected]>
> ---
> kernel/sched/sched.h | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 469d11d..37502ea 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
> #ifdef CONFIG_SMP
> extern void sched_avg_update(struct rq *rq);
>
> -#ifndef arch_scale_freq_capacity
> +/*
> + * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
> + * arch code. We select the cpufreq-provided implementation first. If it
> + * doesn't exist then we default to any other implementation provided from
> + * platform/arch code. If those do not exist then we use the default
> + * SCHED_CAPACITY_SCALE value below.
> + *
> + * Note that if cpufreq drivers or platform/arch code have competing
> + * implementations it is up to those subsystems to select one at runtime with
> + * an efficient solution, as we cannot tolerate the overhead of indirect
> + * functions (e.g. function pointers) in the scheduler fast path
> + */
> +#ifdef CONFIG_CPU_FREQ
> +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> +#elif !defined(arch_scale_freq_capacity)
> static __always_inline
> unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> {
>

2016-03-15 20:19:24

by Michael Turquette

[permalink] [raw]
Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

Quoting Dietmar Eggemann (2016-03-15 12:13:46)
> Hi Mike,
>
> On 14/03/16 05:22, Michael Turquette wrote:
> > From: Dietmar Eggemann <[email protected]>
> >
> > Implements cpufreq_scale_freq_capacity() to provide the scheduler with a
> > frequency scaling correction factor for more accurate load-tracking.
> >
> > The factor is:
> >
> > current_freq(cpu) << SCHED_CAPACITY_SHIFT / max_freq(cpu)
> >
> > In fact, freq_scale should be a struct cpufreq_policy data member. But
> > this would require that the scheduler hot path (__update_load_avg()) would
> > have to grab the cpufreq lock. This can be avoided by using per-cpu data
> > initialized to SCHED_CAPACITY_SCALE for freq_scale.
> >
> > Signed-off-by: Dietmar Eggemann <[email protected]>
> > Signed-off-by: Michael Turquette <[email protected]>
> > ---
> > I'm not as sure about patches 7 & 8, but I included them since I needed
> > frequency invariance while testing.
> >
> > As mentioned by myself in 2014 and Rafael last month, the
> > arch_scale_freq_capacity hook is awkward, because this behavior may vary
> > within an architecture.
> >
> > I re-introduce Dietmar's generic cpufreq implementation of the frequency
> > invariance hook in this patch, and change the preprocessor magic in
> > sched.h to favor the cpufreq implementation over arch- or
> > platform-specific ones in the next patch.
>
> Maybe it is worth mentioning that this patch is from EAS RFC5.2
> (linux-arm.org/linux-power.git energy_model_rfc_v5.2) which hasn't been
> posted to LKML. The last EAS RFCv5 has the Frequency Invariant Engine
> (FEI) based on the cpufreq notifier calls (cpufreq_callback,
> cpufreq_policy_callback) in the ARM arch code.

Oops, my apologies. I got a little mixed up while developing these
patches and I should have at least asked you about this one before
posting.

I'm really quite happy to drop #7 and #8 if they are too contentious or
if patch #7 is deemed as not-ready by you.

>
> > If run-time selection of ops is needed them someone will need to write
> > that code.
>
> Right now I see 3 different implementations of the FEI. 1) The X86
> aperf/mperf based one (https://lkml.org/lkml/2016/3/3/589), 2) This one
> in cpufreq.c and 3) the one based on cpufreq notifiers in ARCH (ARM,
> ARM64) code.
>
> I guess with sched_util we do need a solution for all platforms
> (different archs, x86 w/ and w/o X86_FEATURE_APERFMPERF, ...).
>
> > I think that this negates the need for the arm arch hooks[0-2], and
> > hopefully Morten and Dietmar can weigh in on this.
>
> It's true that we tried to get rid of the usage of the cpufreq callbacks
> (cpufreq_callback, cpufreq_policy_callback) with this patch. Plus we
> didn't want to implement it twice (for ARM and ARM64).
>
> But 2) would have to work for other ARCHs as well. Maybe as a fall-back
> for X86 w/o X86_FEATURE_APERFMPERF feature?

That's what I had in mind. I guess that some day there will be a need to
select implementations at run-time for both cpufreq (e.g. different
cpufreq drivers might implement arch_scale_freq_capacity) and for the
!CONFIG_CPU_FREQ case (e.g. different platforms might implement
arch_scale_freq_capcity within the same arch).

The cpufreq approach seems the most generic, hence patch #8 to make it
the default.

Regards,
Mike

>
> [...]

2016-03-15 20:46:36

by Michael Turquette

[permalink] [raw]
Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

Quoting Dietmar Eggemann (2016-03-15 12:13:58)
> On 14/03/16 05:22, Michael Turquette wrote:
> > arch_scale_freq_capacity is weird. It specifies an arch hook for an
> > implementation that could easily vary within an architecture or even a
> > chip family.
> >
> > This patch helps to mitigate this weirdness by defaulting to the
> > cpufreq-provided implementation, which should work for all cases where
> > CONFIG_CPU_FREQ is set.
> >
> > If CONFIG_CPU_FREQ is not set, then try to use an implementation
> > provided by the architecture. Failing that, fall back to
> > SCHED_CAPACITY_SCALE.
> >
> > It may be desirable for cpufreq drivers to specify their own
> > implementation of arch_scale_freq_capacity in the future. The same is
> > true for platform code within an architecture. In both cases an
> > efficient implementation selector will need to be created and this patch
> > adds a comment to that effect.
>
> For me this independence of the scheduler code towards the actual
> implementation of the Frequency Invariant Engine (FEI) was actually a
> feature.

I do not agree that it is a strength; I think it is confusing. My
opinion is that cpufreq drivers should implement
arch_scale_freq_capacity. Having a sane fallback
(cpufreq_scale_freq_capacity) simply means that you can remove the
boilerplate from the arm32 and arm64 code, which is a win.

Furthermore, if we have multiple competing implementations of
arch_scale_freq_invariance, wouldn't it be better for all of them to
live in cpufreq drivers? This means we would only need to implement a
single run-time "selector".

On the other hand, if the implementation lives in arch code and we have
various implementations of arch_scale_freq_capacity within an
architecture, then each arch would need to implement this selector
function. Even worse then if we have a split where some implementations
live in drivers/cpufreq (e.g. intel_pstate) and others in arch/arm and
others in arch/arm64 ... now we have three selectors.

Note that this has nothing to do with cpu microarch invariance. I'm
happy for that to stay in arch code because we can have heterogeneous
cpus that do not scale frequency, and thus would not enable cpufreq.
But if your platform scales cpu frequency, then really cpufreq should be
in the loop.

>
> In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 ,
> which hasn't been posted to LKML) we establish the link in the ARCH code
> (arch/arm64/include/asm/topology.h).

Right, sorry again about preemptively posting the patch. Total brainfart
on my part.

>
> #ifdef CONFIG_CPU_FREQ
> #define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> ...
> +#endif

The above is no longer necessary with this patch. Same question as
above: why insist on the arch boilerplate?

Regards,
Mike

>
> >
> > Signed-off-by: Michael Turquette <[email protected]>
> > ---
> > kernel/sched/sched.h | 16 +++++++++++++++-
> > 1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 469d11d..37502ea 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
> > #ifdef CONFIG_SMP
> > extern void sched_avg_update(struct rq *rq);
> >
> > -#ifndef arch_scale_freq_capacity
> > +/*
> > + * arch_scale_freq_capacity can be implemented by cpufreq, platform code or
> > + * arch code. We select the cpufreq-provided implementation first. If it
> > + * doesn't exist then we default to any other implementation provided from
> > + * platform/arch code. If those do not exist then we use the default
> > + * SCHED_CAPACITY_SCALE value below.
> > + *
> > + * Note that if cpufreq drivers or platform/arch code have competing
> > + * implementations it is up to those subsystems to select one at runtime with
> > + * an efficient solution, as we cannot tolerate the overhead of indirect
> > + * functions (e.g. function pointers) in the scheduler fast path
> > + */
> > +#ifdef CONFIG_CPU_FREQ
> > +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> > +#elif !defined(arch_scale_freq_capacity)
> > static __always_inline
> > unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> > {
> >

2016-03-15 21:15:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()

On Sun, Mar 13, 2016 at 10:22:05PM -0700, Michael Turquette wrote:
> cpufreq_trigger_update() was introduced in "cpufreq: Rework the
> scheduler hooks for triggering updates"[0]. Consensus is that this
> helper is not needed and removing it will aid in experimenting with
> deadline and rt capacity requests.
>
> Instead of reverting the above patch, which includes useful renaming of
> data structures and related functions, simply remove the function,
> update affected kerneldoc and change rt.c and deadline.c to use
> cpufreq_update_util().

This fails to explain how the need for these hooks is dealt with.

2016-03-15 21:17:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/8] sched/cpufreq: new cfs capacity margin helpers

On Sun, Mar 13, 2016 at 10:22:07PM -0700, Michael Turquette wrote:
> +/**
> + * cpufreq_set_cfs_capacity_margin - Set global cfs enqueue capacity margin
> + * @margin: new capacity margin
> + *
> + * margin is a percentage of capacity that is applied to the current
> + * utilization when selecting a new capacity state or cpu frequency. The value
> + * should be normalized to the range of [0..SCHED_CAPACITY_SCALE], where
> + * SCHED_CAPACITY_SCALE is 100% of the normalized capacity, or equivalent to
> + * multiplying the utilization by one.
> + *
> + * For instance, to add a 25% margin to a utilization, margin should be 1280,
> + * which is 1.25x 1024, the default for SCHED_CAPACITY_SCALE.
> + */
> +void cpufreq_set_cfs_capacity_margin(unsigned long margin)
> +{
> + cfs_capacity_margin = margin;
> +}
> +EXPORT_SYMBOL_GPL(cpufreq_set_cfs_capacity_margin);

I don't like this as an interface; what's wrong with using percentiles
as per the discussion I had with Rafael last week?

Also, why is this exported?

2016-03-15 21:19:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/8] sched/fair: add margin to utilization update

On Sun, Mar 13, 2016 at 10:22:06PM -0700, Michael Turquette wrote:
> @@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
>
> if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> unsigned long max = rq->cpu_capacity_orig;
> + unsigned long cap = cfs_rq->avg.util_avg *
> + cfs_capacity_margin / max;
>
> /*
> * There are a few boundary cases this might miss but it should
> @@ -2852,8 +2867,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
> * thread is a different class (!fair), nor will the utilization
> * number include things like RT tasks.
> */
> - cpufreq_update_util(rq_clock(rq),
> - min(cfs_rq->avg.util_avg, max), max);
> + cpufreq_update_util(rq_clock(rq), min(cap, max), max);
> }
> }

I really don't see why that is here, and not inside whatever uses
cpufreq_update_util().

2016-03-15 21:20:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

On Sun, Mar 13, 2016 at 10:22:08PM -0700, Michael Turquette wrote:
> With the addition of the global cfs capacity margin helpers in patch,
> "sched/cpufreq: new cfs capacity margin helpers", we can now export
> sysfs tunables from the schedutil governor. This allows privileged users
> to tune the value more easily.
>
> The margin value is global to cfs, not per-policy. As such schedutil
> does not store any state about the margin. Schedutil restores the margin
> value to its default value when exiting.

Yuck sysfs.. I would really rather we did not expose this per default.
And certainly not in this weird form.

2016-03-15 21:25:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

On Sun, Mar 13, 2016 at 10:22:09PM -0700, Michael Turquette wrote:
> +++ b/include/linux/sched.h
> @@ -2362,15 +2362,25 @@ extern u64 scheduler_tick_max_deferment(void);
> static inline bool sched_can_stop_tick(void) { return false; }
> #endif
>
> +enum sched_class_util {
> + cfs_util,
> + rt_util,
> + dl_util,
> + nr_util_types,
> +};
> +
> #ifdef CONFIG_CPU_FREQ
> struct freq_update_hook {
> + void (*func)(struct freq_update_hook *hook,
> + enum sched_class_util sched_class, u64 time,
> unsigned long util, unsigned long max);
> };
>
> void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
> + void (*func)(struct freq_update_hook *hook,
> + enum sched_class_util sched_class,
> + u64 time, unsigned long util,
> + unsigned long max));

Have you looked at the asm that generated? At some point you'll start
spilling on the stack and it'll be a god awful mess.

2016-03-15 21:32:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

On Sun, Mar 13, 2016 at 10:22:10PM -0700, Michael Turquette wrote:

> +static unsigned long sugov_sum_total_util(struct sugov_cpu *sg_cpu)
> +{
> + enum sched_class_util sc;
> +
> + /* sum the utilization of all sched classes */
> + sg_cpu->total_util = 0;
> + for (sc = 0; sc < nr_util_types; sc++)
> + sg_cpu->total_util += sg_cpu->util[sc];
> +
> + return sg_cpu->total_util;
> +}

> @@ -153,7 +172,7 @@ static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
> if ((s64)delta_ns > NSEC_PER_SEC / HZ)
> continue;
>
> - j_util = j_sg_cpu->util;
> + j_util = j_sg_cpu->total_util;
> j_max = j_sg_cpu->max;
> if (j_util > j_max)
> return max_f;

So while not strictly wrong, I think we can do so much better.

Changelog doesn't mention anything useful, like that this is indeed very
rough and what we really should be doing etc..

2016-03-15 21:33:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

On Tue, Mar 15, 2016 at 01:19:17PM -0700, Michael Turquette wrote:
> That's what I had in mind. I guess that some day there will be a need to
> select implementations at run-time for both cpufreq (e.g. different
> cpufreq drivers might implement arch_scale_freq_capacity) and for the
> !CONFIG_CPU_FREQ case (e.g. different platforms might implement
> arch_scale_freq_capcity within the same arch).

No, no runtime selection. That gets us function pointers and other
indirect mess.

We should be trying very hard to get rid of that cpufreq_util_update()
pointer, not add more of that gunk.

Use self modifying code if you have to do something.

2016-03-15 21:34:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

On Sun, Mar 13, 2016 at 10:22:11PM -0700, Michael Turquette wrote:

> +static DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
> +
> +unsigned long cpufreq_scale_freq_capacity(struct sched_domain *sd, int cpu)
> +{
> + return per_cpu(freq_scale, cpu);
> +}

These should be in a header so that we can avoid having to do a function
call.

2016-03-15 21:40:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

On Sun, Mar 13, 2016 at 10:22:12PM -0700, Michael Turquette wrote:
> +++ b/kernel/sched/sched.h
> @@ -1368,7 +1368,21 @@ static inline int hrtick_enabled(struct rq *rq)
> #ifdef CONFIG_SMP
> extern void sched_avg_update(struct rq *rq);
>
> -#ifndef arch_scale_freq_capacity

> +#ifdef CONFIG_CPU_FREQ
> +#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
> +#elif !defined(arch_scale_freq_capacity)
> static __always_inline
> unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> {

This could not allow x86 to use the APERF/MPERF thing, so no, can't be
right.

Maybe something like

#ifndef arch_scale_freq_capacity
#ifdef CONFIG_CPU_FREQ
#define arch_scale_freq_capacity cpufreq_scale_freq_capacity
#else
static __always_inline
unsigned long arch_scale_freq_capacity(..)
{
return SCHED_CAPACITY_SCALE;
}
#endif
#endif

Which will let the arch override and only falls back to cpufreq if
the arch doesn't do anything.


2016-03-15 21:44:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/8] sched/fair: add margin to utilization update

On Tue, Mar 15, 2016 at 02:28:48PM -0700, Michael Turquette wrote:
> Quoting Peter Zijlstra (2016-03-15 14:16:14)
> > On Sun, Mar 13, 2016 at 10:22:06PM -0700, Michael Turquette wrote:
> > > @@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
> > >
> > > if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> > > unsigned long max = rq->cpu_capacity_orig;
> > > + unsigned long cap = cfs_rq->avg.util_avg *
> > > + cfs_capacity_margin / max;
> > >
> > > /*
> > > * There are a few boundary cases this might miss but it should
> > > @@ -2852,8 +2867,7 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
> > > * thread is a different class (!fair), nor will the utilization
> > > * number include things like RT tasks.
> > > */
> > > - cpufreq_update_util(rq_clock(rq),
> > > - min(cfs_rq->avg.util_avg, max), max);
> > > + cpufreq_update_util(rq_clock(rq), min(cap, max), max);
> > > }
> > > }
> >
> > I really don't see why that is here, and not inside whatever uses
> > cpufreq_update_util().
>
> Because I want schedutil to be dumb and not implement any policy of it's
> own. The idea is for the scheduler to select frequency after all.
>
> I want to avoid a weird hybrid solution where we try to be smart about
> selecting the right capacity/frequency in fair.c (e.g. Steve and Juri's
> patches to fair.c from the sched-freq-v7 series), but then have an
> additional layer of "smarts" massaging those values further in the
> cpufreq governor.

So the problem here is that you add an unconditional division, even if
cpufreq_update_util() then decides to not do anything with it.

Please, these are scheduler paths, do not add (fancy) instructions if
you don't absolutely have to.

2016-03-15 21:49:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

On Tue, Mar 15, 2016 at 02:40:43PM -0700, Michael Turquette wrote:
> Quoting Peter Zijlstra (2016-03-15 14:20:47)
> > On Sun, Mar 13, 2016 at 10:22:08PM -0700, Michael Turquette wrote:
> > > With the addition of the global cfs capacity margin helpers in patch,
> > > "sched/cpufreq: new cfs capacity margin helpers", we can now export
> > > sysfs tunables from the schedutil governor. This allows privileged users
> > > to tune the value more easily.
> > >
> > > The margin value is global to cfs, not per-policy. As such schedutil
> > > does not store any state about the margin. Schedutil restores the margin
> > > value to its default value when exiting.
> >
> > Yuck sysfs.. I would really rather we did not expose this per default.
> > And certainly not in this weird form.
>
> I'm happy to change capacity_margin to up_threshold and use a
> percentage.
>
> The sysfs approach has two benefits. First, it is aligned with cpufreq
> user expectations. Second, there has been rough consensus that this
> value should be tunable and sysfs gets us there quickly and painlessly.
> We're already exporting rate_limit_us for schedutil via sysfs. Is there
> a better way interface you can recommend?

It really depends on how tunable you want this to be. Do we always want
this to be a tunable, or just now while we're playing about with the
whole thing?

The problem with exposing it in sysfs is that you cannot take it out
again, it becomes ABI.

What we do for all the scheduler tunables (pretty much every time we
have to take a value out of thin air), is we make them const for
!SCHED_DEBUG builds, but have them as sysctl for SCHED_DEBUG builds
(although we should probably move them to /debug/sched/ or somesuch).

That way you get better code generation (compile time constants rule)
for !debug builds, while having the 'joy' of poking at your number on
debug builds.


2016-03-15 21:50:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()

On Tue, Mar 15, 2016 at 02:45:45PM -0700, Michael Turquette wrote:
> Quoting Peter Zijlstra (2016-03-15 14:14:48)
> > On Sun, Mar 13, 2016 at 10:22:05PM -0700, Michael Turquette wrote:
> > > cpufreq_trigger_update() was introduced in "cpufreq: Rework the
> > > scheduler hooks for triggering updates"[0]. Consensus is that this
> > > helper is not needed and removing it will aid in experimenting with
> > > deadline and rt capacity requests.
> > >
> > > Instead of reverting the above patch, which includes useful renaming of
> > > data structures and related functions, simply remove the function,
> > > update affected kerneldoc and change rt.c and deadline.c to use
> > > cpufreq_update_util().
> >
> > This fails to explain how the need for these hooks is dealt with.
>
> Sorry, I don't understand your point. The removed hook,
> "cpufreq_trigger_update()" was only used in deadline.c and rt.c, and
> this patch effectively reverts Rafael's patch that introduces that
> function.
>
> It simply does not revert the other changes in Rafael's patch, such as
> some renaming.
>
> deadline.c and rt.c are made to use cpufreq_update_util() and pass in
> ULONG_MAX for capacity and 0 for time. This is exactly what they did
> before patch "cpufreq: Rework the scheduler hooks for triggering
> updates".

Clearly I need to learn to read again.. You're right.

2016-03-16 00:06:05

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 0/8] schedutil enhancements

Hi Mike,

On Sunday, March 13, 2016 10:22:04 PM Michael Turquette wrote:
> I'm happy that scheduler-driven cpu frequency selection is getting some
> attention. Rafael's recent schedutil governor is a step in the right direction.

Thanks!

> This series builds on top of Rafael's schedutil governor, bringing it to parity
> with some of the features in the schedfreq series posted by Steve[0], as well
> as adding a couple of new things.
>
> Patch 1 removes cpufreq_trigger_update()
>
> Patches 2-4 move the cfs capacity margin out of the governor and into
> cfs. This value is made tunable by a sysfs control in schedutil.
>
> Patches 5-6 make cpufreq_update_util() aware of multiple scheduler
> classes (cfs, rt & dl), and add storage & summation of these per-class
> utilization values into schedutil.
>
> Patches 7-8 introduces Dietmar's generic cpufreq implementation[1] of the
> frequency invariance hook and changes the preprocessor magic in sched.h to
> favor the cpufreq implementation over arch- or platform-specific ones.

After the discussion mentioned by Peter in one of his responses (the relevant
e-mail thread is here: http://marc.info/?t=145688568600003&r=1&w=4) I have
changed the schedutil series to address some points made during it.

In particular, I've modified it to use the formula from
http://marc.info/?l=linux-acpi&m=145756618321500&w=4 with the twist that
if the utilization is frequency-invariant, max_freq will be used instead
of current_freq.

The way it recognizes whether or not that is the case is based on the
Peter's suggestion from http://marc.info/?l=linux-kernel&m=145760739700716&w=4.

For this reason, the governor goes into kernel/sched/ as I don't want
arch_scale_freq_invariant() to be exposed to cpufreq at large, because
schedutil will be the only governor using it in foreseeable future.

The fast switching support patch is slightly different too, but that's
not relevant here.

The new series is in the pm-cpufreq-experimental branch of my tree:

git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
pm-cpufreq-experimental

I haven't had the time to post it over the last few days (sorry about
that), but I'll do that tomorrow.

Thanks,
Rafael

2016-03-16 02:53:06

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH 2/8] sched/fair: add margin to utilization update

On 03/13/2016 10:22 PM, Michael Turquette wrote:
> +unsigned long cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
> +
> #ifdef CONFIG_CFS_BANDWIDTH
> /*
> * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
> @@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
>
> if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> unsigned long max = rq->cpu_capacity_orig;
> + unsigned long cap = cfs_rq->avg.util_avg *
> + cfs_capacity_margin / max;

Doesn't rq->cpu_capacity_orig get scaled per the microarch invariance?
This would mean that the margin we're applying here would differ based
on that.

I'd expect that the margin would be * (cfs_capacity_margin /
SCHED_CAPACITY_SCALE) which would then reduce the division into a shift.

2016-03-16 03:37:08

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

On 03/15/2016 03:37 PM, Michael Turquette wrote:
>>>> Yuck sysfs.. I would really rather we did not expose this per default.
>>>> > > > And certainly not in this weird form.
>>> > >
>>> > > I'm happy to change capacity_margin to up_threshold and use a
>>> > > percentage.
>>> > >
>>> > > The sysfs approach has two benefits. First, it is aligned with cpufreq
>>> > > user expectations. Second, there has been rough consensus that this
>>> > > value should be tunable and sysfs gets us there quickly and painlessly.
>>> > > We're already exporting rate_limit_us for schedutil via sysfs. Is there
>>> > > a better way interface you can recommend?
>> >
>> > It really depends on how tunable you want this to be. Do we always want
>> > this to be a tunable, or just now while we're playing about with the
>> > whole thing?
>
> I had considered this myself, and I really think that Steve and Juri
> should chime in as they have spent more time tuning and running the
> numbers.
>
> I'm inclined to think that a debug version would be good enough, as I
> don't imagine this value being changed at run-time by some userspace
> daemon or something.
>
> Then again, maybe this knob will be part of the mythical
> power-vs-performance slider?

Patrick Bellasi's schedtune series [0] (which I think is the referenced
mythical slider) aims to provide a more sophisticated interface for
tuning scheduler-driven frequency selection. In addition to a global
boost value it includes a cgroup controller as well for per-task tuning.

I would definitely expect the margin/boost value to be modified at
runtime, for example if the battery is running low, or the user wants
100% performance for a while, or the userspace framework wants to
temporarily tailor the performance level for a particular set of tasks, etc.

[0] http://article.gmane.org/gmane.linux.kernel/2022959

2016-03-16 03:55:41

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

Hi Mike,

On 03/15/2016 03:06 PM, Michael Turquette wrote:
>>> > > void cpufreq_set_freq_update_hook(int cpu, struct freq_update_hook *hook,
>>> > > + void (*func)(struct freq_update_hook *hook,
>>> > > + enum sched_class_util sched_class,
>>> > > + u64 time, unsigned long util,
>>> > > + unsigned long max));
>> >
>> > Have you looked at the asm that generated? At some point you'll start
>> > spilling on the stack and it'll be a god awful mess.
>> >
> Is your complaint about the enum that I added to the existing function
> signature, or do you not like the original function signature[0] from
> Rafael's patch, sans enum?

The ARM procedure call standard has the first four arguments in
registers so the addition of the enum above will start using the stack.

2016-03-16 07:38:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

On Tue, Mar 15, 2016 at 03:09:51PM -0700, Michael Turquette wrote:
> Quoting Peter Zijlstra (2016-03-15 14:29:26)
> > On Sun, Mar 13, 2016 at 10:22:10PM -0700, Michael Turquette wrote:
> >
> > > +static unsigned long sugov_sum_total_util(struct sugov_cpu *sg_cpu)
> > > +{
> > > + enum sched_class_util sc;
> > > +
> > > + /* sum the utilization of all sched classes */
> > > + sg_cpu->total_util = 0;
> > > + for (sc = 0; sc < nr_util_types; sc++)
> > > + sg_cpu->total_util += sg_cpu->util[sc];
> > > +
> > > + return sg_cpu->total_util;
> > > +}
> >
> > > @@ -153,7 +172,7 @@ static unsigned int sugov_next_freq(struct sugov_policy *sg_policy,
> > > if ((s64)delta_ns > NSEC_PER_SEC / HZ)
> > > continue;
> > >
> > > - j_util = j_sg_cpu->util;
> > > + j_util = j_sg_cpu->total_util;
> > > j_max = j_sg_cpu->max;
> > > if (j_util > j_max)
> > > return max_f;
> >
> > So while not strictly wrong, I think we can do so much better.
> >
> > Changelog doesn't mention anything useful, like that this is indeed very
> > rough and what we really should be doing etc..
>
> What should we really be doing? Summing the scheduler class
> contributions seems correct to me.
>
> Are you referring to the fact that dl and rt are passing bogus values
> into cpufreq_update_util()? If so I'm happy to add a note about that in
> the changelog.

Somewhere in the giant discussions I mentioned that we should be looking
at a CPPC like interface and pass {min,max} tuples to the cpufreq
selection thingy.

In that same discussion I also mentioned that we must compute min as the
hard dl reservation, but that for max we can actually use the avg dl +
avg rt + avg cfs.

That way there is far more room for selecting a sensible frequency.

2016-03-16 07:41:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

On Tue, Mar 15, 2016 at 03:06:09PM -0700, Michael Turquette wrote:
> Quoting Peter Zijlstra (2016-03-15 14:25:20)
> > On Sun, Mar 13, 2016 at 10:22:09PM -0700, Michael Turquette wrote:
> > > +++ b/include/linux/sched.h
> > > @@ -2362,15 +2362,25 @@ extern u64 scheduler_tick_max_deferment(void);
> > > static inline bool sched_can_stop_tick(void) { return false; }
> > > #endif
> > >
> > > +enum sched_class_util {
> > > + cfs_util,
> > > + rt_util,
> > > + dl_util,
> > > + nr_util_types,
> > > +};
> > > +
> > > #ifdef CONFIG_CPU_FREQ
> > > struct freq_update_hook {
> > > + void (*func)(struct freq_update_hook *hook,
> > > + enum sched_class_util sched_class, u64 time,
> > > unsigned long util, unsigned long max);
> > > };
> > >
> > Have you looked at the asm that generated? At some point you'll start
> > spilling on the stack and it'll be a god awful mess.
> >
>
> Is your complaint about the enum that I added to the existing function
> signature, or do you not like the original function signature[0] from
> Rafael's patch, sans enum?

No, my complaint is more about the call ABI for all our platforms, at
some point we start passing arguments on the stack instead of through
registers.

I'm not sure where that starts hurting, but its always a concern when
adding arguments to functions.

2016-03-16 07:47:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

On Tue, Mar 15, 2016 at 03:27:21PM -0700, Michael Turquette wrote:

> That solution scales for the case where architectures have different
> methods. It doesn't scale for the case where cpufreq drivers or platform
> code within the same arch have competing implementations.

Sure it does; no matter what interface we use on x86 to set the DVFS
hints (ACPI, intel_p_state, whatever), using APERF/MPERF is the only
actual way of telling WTH the actual frequency was.

> I'm happy with it as a stop-gap, because it will initially work for
> arm{64} and x86, but we'll still need run-time selection of
> arch_scale_freq_capacity some day. Once we have that, I think that we
> should favor a run-time provided implementation over the arch-provided
> one.

Also, I'm thinking we don't need any of this. Your
cpufreq_scale_freq_capacity() is completely and utterly pointless. Since
its implementation simply provides whatever frequency we selected its
identical to not using frequency invariant load metrics and having
cpufreq use the !inv formula.

See:

lkml.kernel.org/r/[email protected]

Now, something else (power aware scheduling etc..) might need the freq
invariant stuff, but cpufreq (which we're concerned with here) does not
unless arch_scale_freq_capacity() does something else than simply return
the value we've set earlier.

2016-03-16 08:01:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/8] sched/cpufreq: remove cpufreq_trigger_update()

On Sun, Mar 13, 2016 at 10:22:05PM -0700, Michael Turquette wrote:
> cpufreq_trigger_update() was introduced in "cpufreq: Rework the
> scheduler hooks for triggering updates"[0]. Consensus is that this
> helper is not needed and removing it will aid in experimenting with
> deadline and rt capacity requests.
>
> Instead of reverting the above patch, which includes useful renaming of
> data structures and related functions, simply remove the function,
> update affected kerneldoc and change rt.c and deadline.c to use
> cpufreq_update_util().

> -/**
> - * cpufreq_trigger_update - Trigger CPU performance state evaluation if needed.
> - * @time: Current time.
> - *
> - * The way cpufreq is currently arranged requires it to evaluate the CPU
> - * performance state (frequency/voltage) on a regular basis. To facilitate
> - * that, cpufreq_update_util() is called by update_load_avg() in CFS when
> - * executed for the current CPU's runqueue.
> - *
> - * However, this isn't sufficient to prevent the CPU from being stuck in a
> - * completely inadequate performance level for too long, because the calls
> - * from CFS will not be made if RT or deadline tasks are active all the time
> - * (or there are RT and DL tasks only).
> - *
> - * As a workaround for that issue, this function is called by the RT and DL
> - * sched classes to trigger extra cpufreq updates to prevent it from stalling,
> - * but that really is a band-aid. Going forward it should be replaced with
> - * solutions targeted more specifically at RT and DL tasks.
> - */
> -void cpufreq_trigger_update(u64 time)
> -{
> - cpufreq_update_util(time, ULONG_MAX, 0);
> -}

> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 1a035fa..3fd5bc4 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -728,7 +728,7 @@ static void update_curr_dl(struct rq *rq)
>
> /* Kick cpufreq (see the comment in drivers/cpufreq/cpufreq.c). */
> if (cpu_of(rq) == smp_processor_id())
> - cpufreq_trigger_update(rq_clock(rq));
> + cpufreq_update_util(rq_clock(rq), ULONG_MAX, 0);
>
> /*
> * Consumed budget is computed considering the time as

OK, so take two on this, now hopefully more coherent (yay for sleep!).

So my problem is that this (update_curr_dl) is not the right location to
set DL utilization (although it might be for avg dl, see the other
email).

The only reason it lives here, is that some cpufreq governors require
'timely' calls into this hook. The comment you destroyed tries to convey
this.

We should still remove this requirement from the governors. And for
simple DL guarantees this hook is placed wrong.


2016-03-16 08:05:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
> > Then again, maybe this knob will be part of the mythical
> > power-vs-performance slider?
>
> Patrick Bellasi's schedtune series [0] (which I think is the referenced
> mythical slider) aims to provide a more sophisticated interface for
> tuning scheduler-driven frequency selection. In addition to a global
> boost value it includes a cgroup controller as well for per-task tuning.
>
> I would definitely expect the margin/boost value to be modified at
> runtime, for example if the battery is running low, or the user wants
> 100% performance for a while, or the userspace framework wants to
> temporarily tailor the performance level for a particular set of tasks, etc.

OK, so how about we start with it as a debug knob, and once we have
experience and feel like it is indeed a useful runtime knob, we upgrade
it to ABI.

The problem with starting out as ABI is that its hard to take away
again.

2016-03-16 08:30:26

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

On 16 March 2016 at 08:41, Peter Zijlstra <[email protected]> wrote:
> On Tue, Mar 15, 2016 at 03:06:09PM -0700, Michael Turquette wrote:
>> Quoting Peter Zijlstra (2016-03-15 14:25:20)
>> > On Sun, Mar 13, 2016 at 10:22:09PM -0700, Michael Turquette wrote:
>> > > +++ b/include/linux/sched.h
>> > > @@ -2362,15 +2362,25 @@ extern u64 scheduler_tick_max_deferment(void);
>> > > static inline bool sched_can_stop_tick(void) { return false; }
>> > > #endif
>> > >
>> > > +enum sched_class_util {
>> > > + cfs_util,
>> > > + rt_util,
>> > > + dl_util,
>> > > + nr_util_types,
>> > > +};
>> > > +
>> > > #ifdef CONFIG_CPU_FREQ
>> > > struct freq_update_hook {
>> > > + void (*func)(struct freq_update_hook *hook,
>> > > + enum sched_class_util sched_class, u64 time,
>> > > unsigned long util, unsigned long max);
>> > > };
>> > >
>> > Have you looked at the asm that generated? At some point you'll start
>> > spilling on the stack and it'll be a god awful mess.
>> >
>>
>> Is your complaint about the enum that I added to the existing function
>> signature, or do you not like the original function signature[0] from
>> Rafael's patch, sans enum?
>
> No, my complaint is more about the call ABI for all our platforms, at
> some point we start passing arguments on the stack instead of through
> registers.
>
> I'm not sure where that starts hurting, but its always a concern when
> adding arguments to functions.

I wonder if it's really worth passing per sched_class request to
sched_util ? sched_util is about selecting a frequency based on the
utilization of the CPU, it only needs a value that reflect the whole
utilization. Can't we sum (or whatever the formula we want to apply)
utilizations before calling cpufreq_update_util

2016-03-16 08:54:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

On Wed, Mar 16, 2016 at 09:29:59AM +0100, Vincent Guittot wrote:
> I wonder if it's really worth passing per sched_class request to
> sched_util ? sched_util is about selecting a frequency based on the
> utilization of the CPU, it only needs a value that reflect the whole
> utilization. Can't we sum (or whatever the formula we want to apply)
> utilizations before calling cpufreq_update_util

So I've thought the same; but I'm conflicted, its a shame to compute
anything if the call then doesn't do anything with it.

And keeping a structure of all the various numbers to pass in also has
cost of yet another cacheline to touch.

2016-03-16 09:16:47

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

On 16 March 2016 at 09:53, Peter Zijlstra <[email protected]> wrote:
> On Wed, Mar 16, 2016 at 09:29:59AM +0100, Vincent Guittot wrote:
>> I wonder if it's really worth passing per sched_class request to
>> sched_util ? sched_util is about selecting a frequency based on the
>> utilization of the CPU, it only needs a value that reflect the whole
>> utilization. Can't we sum (or whatever the formula we want to apply)
>> utilizations before calling cpufreq_update_util
>
> So I've thought the same; but I'm conflicted, its a shame to compute
> anything if the call then doesn't do anything with it.

yes, at least we shoud skip all that stuff (including adding a margin)
if no hook has been set in cpufreq_update_util.
I also see potential optimization of updating the value only if the
utilization has been decayed

>
> And keeping a structure of all the various numbers to pass in also has
> cost of yet another cacheline to touch.

2016-03-16 10:01:45

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Hi,

On 16/03/16 09:05, Peter Zijlstra wrote:
> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
> > > Then again, maybe this knob will be part of the mythical
> > > power-vs-performance slider?
> >
> > Patrick Bellasi's schedtune series [0] (which I think is the referenced
> > mythical slider) aims to provide a more sophisticated interface for
> > tuning scheduler-driven frequency selection. In addition to a global
> > boost value it includes a cgroup controller as well for per-task tuning.
> >
> > I would definitely expect the margin/boost value to be modified at
> > runtime, for example if the battery is running low, or the user wants
> > 100% performance for a while, or the userspace framework wants to
> > temporarily tailor the performance level for a particular set of tasks, etc.
>
> OK, so how about we start with it as a debug knob, and once we have
> experience and feel like it is indeed a useful runtime knob, we upgrade
> it to ABI.
>

I tend to agree here. To me the margin is something that we need to make
this thing work and to get acceptable performance out of the box. So we
can play with it while debugging, but I consider the schedtune slider as
the way to tune the system at runtime.

Best,

- Juri

2016-03-16 12:39:14

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

On Wed, Mar 16, 2016 at 9:53 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Mar 16, 2016 at 09:29:59AM +0100, Vincent Guittot wrote:
>> I wonder if it's really worth passing per sched_class request to
>> sched_util ? sched_util is about selecting a frequency based on the
>> utilization of the CPU, it only needs a value that reflect the whole
>> utilization. Can't we sum (or whatever the formula we want to apply)
>> utilizations before calling cpufreq_update_util
>
> So I've thought the same; but I'm conflicted, its a shame to compute
> anything if the call then doesn't do anything with it.
>
> And keeping a structure of all the various numbers to pass in also has
> cost of yet another cacheline to touch.

In principle we can use high-order bits of util and max to encode the
information on where they come from.

Of course, that translates to additional ifs in the governor, but I
guess they are unavoidable anyway.

2016-03-16 12:42:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

On Wed, Mar 16, 2016 at 08:47:52AM +0100, Peter Zijlstra wrote:
> On Tue, Mar 15, 2016 at 03:27:21PM -0700, Michael Turquette wrote:

> > I'm happy with it as a stop-gap, because it will initially work for
> > arm{64} and x86, but we'll still need run-time selection of
> > arch_scale_freq_capacity some day. Once we have that, I think that we
> > should favor a run-time provided implementation over the arch-provided
> > one.
>
> Also, I'm thinking we don't need any of this. Your
> cpufreq_scale_freq_capacity() is completely and utterly pointless.

Scrap that, while true to instant utilization, this isn't true for our
case, since our utilization numbers carry history, and any frequency
change in that window is relevant.


2016-03-16 12:45:23

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

On Wed, Mar 16, 2016 at 9:05 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
>> > Then again, maybe this knob will be part of the mythical
>> > power-vs-performance slider?
>>
>> Patrick Bellasi's schedtune series [0] (which I think is the referenced
>> mythical slider) aims to provide a more sophisticated interface for
>> tuning scheduler-driven frequency selection. In addition to a global
>> boost value it includes a cgroup controller as well for per-task tuning.
>>
>> I would definitely expect the margin/boost value to be modified at
>> runtime, for example if the battery is running low, or the user wants
>> 100% performance for a while, or the userspace framework wants to
>> temporarily tailor the performance level for a particular set of tasks, etc.
>
> OK, so how about we start with it as a debug knob, and once we have
> experience and feel like it is indeed a useful runtime knob, we upgrade
> it to ABI.
>
> The problem with starting out as ABI is that its hard to take away
> again.

Agreed, plus it is quite hard to get ABI right from the outset. Even
if we decide on a sysfs knob, it still is unclear what exactly should
be represented by it in what units etc.

2016-03-16 13:10:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

On Wed, Mar 16, 2016 at 01:39:10PM +0100, Rafael J. Wysocki wrote:
> On Wed, Mar 16, 2016 at 9:53 AM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Mar 16, 2016 at 09:29:59AM +0100, Vincent Guittot wrote:
> >> I wonder if it's really worth passing per sched_class request to
> >> sched_util ? sched_util is about selecting a frequency based on the
> >> utilization of the CPU, it only needs a value that reflect the whole
> >> utilization. Can't we sum (or whatever the formula we want to apply)
> >> utilizations before calling cpufreq_update_util
> >
> > So I've thought the same; but I'm conflicted, its a shame to compute
> > anything if the call then doesn't do anything with it.
> >
> > And keeping a structure of all the various numbers to pass in also has
> > cost of yet another cacheline to touch.
>
> In principle we can use high-order bits of util and max to encode the
> information on where they come from.
>
> Of course, that translates to additional ifs in the governor, but I
> guess they are unavoidable anyway.

Another thing we can do, for as long as we have the indirect function
call anyway, is stuff extra pointers in that same cacheline we pull the
function from.

Something like the below; there's room for 8 pointers (including the
function pointer) in a cacheline.

That would allow the callback to fetch whatever data it feels is
required (could be all of it).

We could also put a u64 *now = &rq->clock in, which would leave another
4 pointers for DL/RT support.

And since we're then back to 1-2 arguments on the function, we can add a
flags/mask field to indicate what changed (and if the function
throttles, it can keep a mask of all that changed since last time it
actually did something, or allow punching through the throttle if our
minimum guarantee changes or whatnot).

(this would of course require we allocate struct update_util_data with
the proper alignment thingies etc..)

Then again, maybe this is somewhat overboard :-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba49c9efd0b2..d34d75c5cc93 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3236,8 +3236,10 @@ static inline unsigned long rlimit_max(unsigned int limit)

#ifdef CONFIG_CPU_FREQ
struct update_util_data {
- void (*func)(struct update_util_data *data,
- u64 time, unsigned long util, unsigned long max);
+ unsigned long *cfs_util_avg;
+ unsigned long *cfs_util_max;
+
+ void (*func)(struct update_util_data *data, u64 time);
};

void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
index 928c4ba32f68..de5b20b11de3 100644
--- a/kernel/sched/cpufreq.c
+++ b/kernel/sched/cpufreq.c
@@ -32,6 +32,9 @@ void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
if (WARN_ON(data && !data->func))
return;

+ data->cfs_util_avg = &cpu_rq(cpu)->cfs.avg.util_avg;
+ data->cfs_util_max = &cpu_rq(cpu)->cpu_capacity_orig;
+
rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
}
EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);

2016-03-16 13:23:25

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

On Wed, Mar 16, 2016 at 2:10 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Mar 16, 2016 at 01:39:10PM +0100, Rafael J. Wysocki wrote:
>> On Wed, Mar 16, 2016 at 9:53 AM, Peter Zijlstra <[email protected]> wrote:
>> > On Wed, Mar 16, 2016 at 09:29:59AM +0100, Vincent Guittot wrote:
>> >> I wonder if it's really worth passing per sched_class request to
>> >> sched_util ? sched_util is about selecting a frequency based on the
>> >> utilization of the CPU, it only needs a value that reflect the whole
>> >> utilization. Can't we sum (or whatever the formula we want to apply)
>> >> utilizations before calling cpufreq_update_util
>> >
>> > So I've thought the same; but I'm conflicted, its a shame to compute
>> > anything if the call then doesn't do anything with it.
>> >
>> > And keeping a structure of all the various numbers to pass in also has
>> > cost of yet another cacheline to touch.
>>
>> In principle we can use high-order bits of util and max to encode the
>> information on where they come from.
>>
>> Of course, that translates to additional ifs in the governor, but I
>> guess they are unavoidable anyway.
>
> Another thing we can do, for as long as we have the indirect function
> call anyway, is stuff extra pointers in that same cacheline we pull the
> function from.
>
> Something like the below; there's room for 8 pointers (including the
> function pointer) in a cacheline.
>
> That would allow the callback to fetch whatever data it feels is
> required (could be all of it).
>
> We could also put a u64 *now = &rq->clock in, which would leave another
> 4 pointers for DL/RT support.
>
> And since we're then back to 1-2 arguments on the function, we can add a
> flags/mask field to indicate what changed (and if the function
> throttles, it can keep a mask of all that changed since last time it
> actually did something, or allow punching through the throttle if our
> minimum guarantee changes or whatnot).
>
> (this would of course require we allocate struct update_util_data with
> the proper alignment thingies etc..)
>
> Then again, maybe this is somewhat overboard :-)

I was thinking about something along these lines, but then I thought
that passing in registers would be more efficient.

One advantage I can see here is that we don't pass arguments that may
not be used by the callee.

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ba49c9efd0b2..d34d75c5cc93 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -3236,8 +3236,10 @@ static inline unsigned long rlimit_max(unsigned int limit)
>
> #ifdef CONFIG_CPU_FREQ
> struct update_util_data {
> - void (*func)(struct update_util_data *data,
> - u64 time, unsigned long util, unsigned long max);
> + unsigned long *cfs_util_avg;
> + unsigned long *cfs_util_max;
> +
> + void (*func)(struct update_util_data *data, u64 time);
> };

How do we ensure proper alignment?

> void cpufreq_set_update_util_data(int cpu, struct update_util_data *data);
> diff --git a/kernel/sched/cpufreq.c b/kernel/sched/cpufreq.c
> index 928c4ba32f68..de5b20b11de3 100644
> --- a/kernel/sched/cpufreq.c
> +++ b/kernel/sched/cpufreq.c
> @@ -32,6 +32,9 @@ void cpufreq_set_update_util_data(int cpu, struct update_util_data *data)
> if (WARN_ON(data && !data->func))
> return;
>
> + data->cfs_util_avg = &cpu_rq(cpu)->cfs.avg.util_avg;
> + data->cfs_util_max = &cpu_rq(cpu)->cpu_capacity_orig;
> +
> rcu_assign_pointer(per_cpu(cpufreq_update_util_data, cpu), data);
> }
> EXPORT_SYMBOL_GPL(cpufreq_set_update_util_data);

2016-03-16 13:44:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 5/8] sched/cpufreq: pass sched class into cpufreq_update_util

On Wed, Mar 16, 2016 at 02:23:21PM +0100, Rafael J. Wysocki wrote:
> On Wed, Mar 16, 2016 at 2:10 PM, Peter Zijlstra <[email protected]> wrote:
> > (this would of course require we allocate struct update_util_data with
> > the proper alignment thingies etc..)

> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index ba49c9efd0b2..d34d75c5cc93 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -3236,8 +3236,10 @@ static inline unsigned long rlimit_max(unsigned int limit)
> >
> > #ifdef CONFIG_CPU_FREQ
> > struct update_util_data {
> > - void (*func)(struct update_util_data *data,
> > - u64 time, unsigned long util, unsigned long max);
> > + unsigned long *cfs_util_avg;
> > + unsigned long *cfs_util_max;
> > +
> > + void (*func)(struct update_util_data *data, u64 time);
> > };

we should add: ____cacheline_aligned here

> How do we ensure proper alignment?

Depends on the allocator; not all of them respect the struct alignment
attribute.

kernel/sched/cpufreq.c:DEFINE_PER_CPU(struct update_util_data *, cpufreq_update_util_data);

That one could use:

DEFINE_PER_CPU_ALIGNED() instead

as would this one:

drivers/cpufreq/cpufreq_governor.c:static DEFINE_PER_CPU(struct cpu_dbs_info, cpu_dbs);

Because when you cacheline align dbs_info, its update_util_data member
will also get the correct alignment because of the structure attribute.


2016-03-16 17:55:52

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

On 03/16/2016 03:02 AM, Juri Lelli wrote:
> Hi,
>
> On 16/03/16 09:05, Peter Zijlstra wrote:
>> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
>>>> Then again, maybe this knob will be part of the mythical
>>>> power-vs-performance slider?
>>>
>>> Patrick Bellasi's schedtune series [0] (which I think is the referenced
>>> mythical slider) aims to provide a more sophisticated interface for
>>> tuning scheduler-driven frequency selection. In addition to a global
>>> boost value it includes a cgroup controller as well for per-task tuning.
>>>
>>> I would definitely expect the margin/boost value to be modified at
>>> runtime, for example if the battery is running low, or the user wants
>>> 100% performance for a while, or the userspace framework wants to
>>> temporarily tailor the performance level for a particular set of tasks, etc.
>>
>> OK, so how about we start with it as a debug knob, and once we have
>> experience and feel like it is indeed a useful runtime knob, we upgrade
>> it to ABI.
>>
>
> I tend to agree here. To me the margin is something that we need to make
> this thing work and to get acceptable performance out of the box. So we
> can play with it while debugging, but I consider the schedtune slider as
> the way to tune the system at runtime.

Could the default schedtune value not serve as the out of the box margin?

Regardless I agree that a debug interface is the way to go for now while
we figure things out.

2016-03-16 18:20:48

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

On 03/16/2016 12:38 AM, Peter Zijlstra wrote:
> Somewhere in the giant discussions I mentioned that we should be looking
> at a CPPC like interface and pass {min,max} tuples to the cpufreq
> selection thingy.
>
> In that same discussion I also mentioned that we must compute min as the
> hard dl reservation, but that for max we can actually use the avg dl +
> avg rt + avg cfs.
>
> That way there is far more room for selecting a sensible frequency.

Doesn't the above min/max policy mean that the platform will likely
underserve the task load? If avg dl+rt+cfs represents our best estimate
of the work to be done, I would think that should be the min.

2016-03-16 18:34:01

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 7/8] cpufreq: Frequency invariant scheduler load-tracking support

On 15/03/16 20:19, Michael Turquette wrote:
> Quoting Dietmar Eggemann (2016-03-15 12:13:46)
>> Hi Mike,
>>
>> On 14/03/16 05:22, Michael Turquette wrote:
>>> From: Dietmar Eggemann <[email protected]>
>>>

[...]

>> Maybe it is worth mentioning that this patch is from EAS RFC5.2
>> (linux-arm.org/linux-power.git energy_model_rfc_v5.2) which hasn't been
>> posted to LKML. The last EAS RFCv5 has the Frequency Invariant Engine
>> (FEI) based on the cpufreq notifier calls (cpufreq_callback,
>> cpufreq_policy_callback) in the ARM arch code.
>
> Oops, my apologies. I got a little mixed up while developing these
> patches and I should have at least asked you about this one before
> posting.

No need to apologize, just wanted to set the context right for people
following the EAS stuff.

> I'm really quite happy to drop #7 and #8 if they are too contentious or
> if patch #7 is deemed as not-ready by you.

If people think that driving frequency invariance from within cpufreq.c
is the right thing to so then this patch can stay.

That said, I'm not sure if we can do a better job of integrating
freq_scale into existing cpufreq data structures.


>>> If run-time selection of ops is needed them someone will need to write
>>> that code.
>>
>> Right now I see 3 different implementations of the FEI. 1) The X86
>> aperf/mperf based one (https://lkml.org/lkml/2016/3/3/589), 2) This one
>> in cpufreq.c and 3) the one based on cpufreq notifiers in ARCH (ARM,
>> ARM64) code.

I rechecked the functionality of this implementation (2) versus (1) and
(3) by running them all together on an X86 system w/
X86_FEATURE_APERFMPERF and acpi-cpufreq (i5 CPU M520) w/o actually doing
the wiring towards arch_scale_freq_capacity and they all behave similar
or equal. The update of (1) happens much more often obviously since its
driven from the scheduler tick.

...
arch_scale_freq_tick: APERF/MPERF: aperf=5864236223 mperf=6009442434
acnt=200073 mcnt=362360 arch_cpu_freq=565
arch_scale_freq_tick: APERF/MPERF: aperf=5864261886 mperf=6009474724
acnt=25663 mcnt=32290 arch_cpu_freq=813
scale_freq_capacity: CPUFREQ.C: cpu=1 freq_scale=910
cpufreq_callback: NOTIFIER: cpu=1 curr_freq=2133000 max_freq=2400000
freq_scale=910
arch_scale_freq_tick: APERF/MPERF: aperf=5864524610 mperf=6009802492
acnt=262724 mcnt=327768 arch_cpu_freq=820
arch_scale_freq_tick: APERF/MPERF: aperf=5864581063 mperf=6009862268
acnt=56453 mcnt=59776 arch_cpu_freq=967
...

[...]

2016-03-16 18:36:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

On Wed, Mar 16, 2016 at 11:20:45AM -0700, Steve Muckle wrote:
> On 03/16/2016 12:38 AM, Peter Zijlstra wrote:
> > Somewhere in the giant discussions I mentioned that we should be looking
> > at a CPPC like interface and pass {min,max} tuples to the cpufreq
> > selection thingy.
> >
> > In that same discussion I also mentioned that we must compute min as the
> > hard dl reservation, but that for max we can actually use the avg dl +
> > avg rt + avg cfs.
> >
> > That way there is far more room for selecting a sensible frequency.
>
> Doesn't the above min/max policy mean that the platform will likely
> underserve the task load? If avg dl+rt+cfs represents our best estimate
> of the work to be done, I would think that should be the min.

Can't be the min, avg_dl might (and typically will be) must lower than
the worst case utilization estimates.

However if we use that as our min, peaks in DL utilization will not
complete, because we run at too low a frequency.

Therefore, the min must be given by our worst case utilization
reservation, not by the actual avg consumed.

2016-03-16 19:12:26

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH 6/8] cpufreq/schedutil: sum per-sched class utilization

On 03/16/2016 11:36 AM, Peter Zijlstra wrote:
> On Wed, Mar 16, 2016 at 11:20:45AM -0700, Steve Muckle wrote:
>> On 03/16/2016 12:38 AM, Peter Zijlstra wrote:
>>> Somewhere in the giant discussions I mentioned that we should be looking
>>> at a CPPC like interface and pass {min,max} tuples to the cpufreq
>>> selection thingy.
>>>
>>> In that same discussion I also mentioned that we must compute min as the
>>> hard dl reservation, but that for max we can actually use the avg dl +
>>> avg rt + avg cfs.
>>>
>>> That way there is far more room for selecting a sensible frequency.
>>
>> Doesn't the above min/max policy mean that the platform will likely
>> underserve the task load? If avg dl+rt+cfs represents our best estimate
>> of the work to be done, I would think that should be the min.
>
> Can't be the min, avg_dl might (and typically will be) must lower than
> the worst case utilization estimates.
>
> However if we use that as our min, peaks in DL utilization will not
> complete, because we run at too low a frequency.

Doesn't that mean the max (if one is specified) should also use hard dl?
I.e. hard dl + rt + cfs.

> Therefore, the min must be given by our worst case utilization
> reservation, not by the actual avg consumed.

Ok sure. My concern was more about the platform potentially ignoring the
CFS and RT capacity requests. So given the point about DL bw I'd think
the min would then be hard dl + rt + cfs.

2016-03-16 19:44:38

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

On 15/03/16 20:46, Michael Turquette wrote:
> Quoting Dietmar Eggemann (2016-03-15 12:13:58)
>> On 14/03/16 05:22, Michael Turquette wrote:

[...]

>> For me this independence of the scheduler code towards the actual
>> implementation of the Frequency Invariant Engine (FEI) was actually a
>> feature.
>
> I do not agree that it is a strength; I think it is confusing. My
> opinion is that cpufreq drivers should implement
> arch_scale_freq_capacity. Having a sane fallback
> (cpufreq_scale_freq_capacity) simply means that you can remove the
> boilerplate from the arm32 and arm64 code, which is a win.
>
> Furthermore, if we have multiple competing implementations of
> arch_scale_freq_invariance, wouldn't it be better for all of them to
> live in cpufreq drivers? This means we would only need to implement a
> single run-time "selector".
>
> On the other hand, if the implementation lives in arch code and we have
> various implementations of arch_scale_freq_capacity within an
> architecture, then each arch would need to implement this selector
> function. Even worse then if we have a split where some implementations
> live in drivers/cpufreq (e.g. intel_pstate) and others in arch/arm and
> others in arch/arm64 ... now we have three selectors.

OK, now I see your point. What I don't understand is the fact why you
want different foo_scale_freq_capacity() implementations per cpufreq
drivers. IMHO we want to do the cpufreq.c based implementation to
abstract from that (at least for target_index() cpufreq drivers).

intel_pstate (setpolicy()) is an exception but my humble guess is that
systems with intel_pstate driver have X86_FEATURE_APERFMPERF support.

> Note that this has nothing to do with cpu microarch invariance. I'm
> happy for that to stay in arch code because we can have heterogeneous
> cpus that do not scale frequency, and thus would not enable cpufreq.
> But if your platform scales cpu frequency, then really cpufreq should be
> in the loop.

Agreed.

>
>>
>> In EAS RFC5.2 (linux-arm.org/linux-power.git energy_model_rfc_v5.2 ,
>> which hasn't been posted to LKML) we establish the link in the ARCH code
>> (arch/arm64/include/asm/topology.h).
>
> Right, sorry again about preemptively posting the patch. Total brainfart
> on my part.
>
>>
>> #ifdef CONFIG_CPU_FREQ
>> #define arch_scale_freq_capacity cpufreq_scale_freq_capacity
>> ...
>> +#endif
>
> The above is no longer necessary with this patch. Same question as
> above: why insist on the arch boilerplate?

OK.

[...]

2016-03-16 20:08:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

On Wed, Mar 16, 2016 at 07:44:33PM +0000, Dietmar Eggemann wrote:
> intel_pstate (setpolicy()) is an exception but my humble guess is that
> systems with intel_pstate driver have X86_FEATURE_APERFMPERF support.

A quick browse of the Intel SDM says you're right. It looks like
everything after Pentium-M; so Core-Solo/Core-Duo and onwards have
APERF/MPERF.

And it looks like P6 class systems didn't have DVFS support at all,
which basically leaves P4 and Pentium-M as the only chips to have DVFS
support lacking APERF/MPERF.

And while I haven't got any P4 based space heaters left, I might still
have a Pentium-M class laptop somewhere (if it still boots).

2016-03-16 21:30:45

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 8/8] sched: prefer cpufreq_scale_freq_capacity

On Wednesday, March 16, 2016 09:07:52 PM Peter Zijlstra wrote:
> On Wed, Mar 16, 2016 at 07:44:33PM +0000, Dietmar Eggemann wrote:
> > intel_pstate (setpolicy()) is an exception but my humble guess is that
> > systems with intel_pstate driver have X86_FEATURE_APERFMPERF support.
>
> A quick browse of the Intel SDM says you're right. It looks like
> everything after Pentium-M; so Core-Solo/Core-Duo and onwards have
> APERF/MPERF.
>
> And it looks like P6 class systems didn't have DVFS support at all,
> which basically leaves P4 and Pentium-M as the only chips to have DVFS
> support lacking APERF/MPERF.
>
> And while I haven't got any P4 based space heaters left, I might still
> have a Pentium-M class laptop somewhere (if it still boots).

intel_pstate depends on APERF/MPERF, so it won't work with anything
without them.

2016-03-16 22:03:59

by Michael Turquette

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Quoting Steve Muckle (2016-03-15 20:36:57)
> On 03/15/2016 03:37 PM, Michael Turquette wrote:
> >>>> Yuck sysfs.. I would really rather we did not expose this per default.
> >>>> > > > And certainly not in this weird form.
> >>> > >
> >>> > > I'm happy to change capacity_margin to up_threshold and use a
> >>> > > percentage.
> >>> > >
> >>> > > The sysfs approach has two benefits. First, it is aligned with cpufreq
> >>> > > user expectations. Second, there has been rough consensus that this
> >>> > > value should be tunable and sysfs gets us there quickly and painlessly.
> >>> > > We're already exporting rate_limit_us for schedutil via sysfs. Is there
> >>> > > a better way interface you can recommend?
> >> >
> >> > It really depends on how tunable you want this to be. Do we always want
> >> > this to be a tunable, or just now while we're playing about with the
> >> > whole thing?
> >
> > I had considered this myself, and I really think that Steve and Juri
> > should chime in as they have spent more time tuning and running the
> > numbers.
> >
> > I'm inclined to think that a debug version would be good enough, as I
> > don't imagine this value being changed at run-time by some userspace
> > daemon or something.
> >
> > Then again, maybe this knob will be part of the mythical
> > power-vs-performance slider?
>
> Patrick Bellasi's schedtune series [0] (which I think is the referenced
> mythical slider) aims to provide a more sophisticated interface for
> tuning scheduler-driven frequency selection. In addition to a global
> boost value it includes a cgroup controller as well for per-task tuning.

/me spends 15 seconds looking schedtune

>
> I would definitely expect the margin/boost value to be modified at
> runtime, for example if the battery is running low, or the user wants
> 100% performance for a while, or the userspace framework wants to
> temporarily tailor the performance level for a particular set of tasks, etc.

Right, and it looks like schedtune is a kernel solution, not userspace
solution. The following three interfaces from patch #2 could be used by
schedtune:

unsigned long cpufreq_get_cfs_capacity_margin(void);
void cpufreq_set_cfs_capacity_margin(unsigned long margin);
void cpufreq_reset_cfs_capacity_margin(void);

Then we can let schedtune worry about the userspace abi.

So I'll keep the basic idea of this patch, but explore making it
debuggy, instead of sysfsy.

Regards,
Mike

>
> [0] http://article.gmane.org/gmane.linux.kernel/2022959

2016-03-16 22:05:16

by Michael Turquette

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Quoting Steve Muckle (2016-03-16 10:55:49)
> On 03/16/2016 03:02 AM, Juri Lelli wrote:
> > Hi,
> >
> > On 16/03/16 09:05, Peter Zijlstra wrote:
> >> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
> >>>> Then again, maybe this knob will be part of the mythical
> >>>> power-vs-performance slider?
> >>>
> >>> Patrick Bellasi's schedtune series [0] (which I think is the referenced
> >>> mythical slider) aims to provide a more sophisticated interface for
> >>> tuning scheduler-driven frequency selection. In addition to a global
> >>> boost value it includes a cgroup controller as well for per-task tuning.
> >>>
> >>> I would definitely expect the margin/boost value to be modified at
> >>> runtime, for example if the battery is running low, or the user wants
> >>> 100% performance for a while, or the userspace framework wants to
> >>> temporarily tailor the performance level for a particular set of tasks, etc.
> >>
> >> OK, so how about we start with it as a debug knob, and once we have
> >> experience and feel like it is indeed a useful runtime knob, we upgrade
> >> it to ABI.
> >>
> >
> > I tend to agree here. To me the margin is something that we need to make
> > this thing work and to get acceptable performance out of the box. So we
> > can play with it while debugging, but I consider the schedtune slider as
> > the way to tune the system at runtime.
>
> Could the default schedtune value not serve as the out of the box margin?

It can. Let's keep the kernel interfaces in patch #2 for changing the
margin/threshold, and schedtune can call these interfaces.

Regards,
Mike

>
> Regardless I agree that a debug interface is the way to go for now while
> we figure things out.
>

2016-03-16 22:12:44

by Michael Turquette

[permalink] [raw]
Subject: Re: [PATCH 2/8] sched/fair: add margin to utilization update

Quoting Steve Muckle (2016-03-15 19:52:59)
> On 03/13/2016 10:22 PM, Michael Turquette wrote:
> > +unsigned long cfs_capacity_margin = CAPACITY_MARGIN_DEFAULT;
> > +
> > #ifdef CONFIG_CFS_BANDWIDTH
> > /*
> > * Amount of runtime to allocate from global (tg) to local (per-cfs_rq) pool
> > @@ -2840,6 +2853,8 @@ static inline void update_load_avg(struct sched_entity *se, int update_tg)
> >
> > if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
> > unsigned long max = rq->cpu_capacity_orig;
> > + unsigned long cap = cfs_rq->avg.util_avg *
> > + cfs_capacity_margin / max;
>
> Doesn't rq->cpu_capacity_orig get scaled per the microarch invariance?
> This would mean that the margin we're applying here would differ based
> on that.
>
> I'd expect that the margin would be * (cfs_capacity_margin /
> SCHED_CAPACITY_SCALE) which would then reduce the division into a shift.

Will fix.

Thanks,
Mike

2016-03-17 09:39:34

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

On 16/03/16 10:55, Steve Muckle wrote:
> On 03/16/2016 03:02 AM, Juri Lelli wrote:
> > Hi,
> >
> > On 16/03/16 09:05, Peter Zijlstra wrote:
> >> On Tue, Mar 15, 2016 at 08:36:57PM -0700, Steve Muckle wrote:
> >>>> Then again, maybe this knob will be part of the mythical
> >>>> power-vs-performance slider?
> >>>
> >>> Patrick Bellasi's schedtune series [0] (which I think is the referenced
> >>> mythical slider) aims to provide a more sophisticated interface for
> >>> tuning scheduler-driven frequency selection. In addition to a global
> >>> boost value it includes a cgroup controller as well for per-task tuning.
> >>>
> >>> I would definitely expect the margin/boost value to be modified at
> >>> runtime, for example if the battery is running low, or the user wants
> >>> 100% performance for a while, or the userspace framework wants to
> >>> temporarily tailor the performance level for a particular set of tasks, etc.
> >>
> >> OK, so how about we start with it as a debug knob, and once we have
> >> experience and feel like it is indeed a useful runtime knob, we upgrade
> >> it to ABI.
> >>
> >
> > I tend to agree here. To me the margin is something that we need to make
> > this thing work and to get acceptable performance out of the box. So we
> > can play with it while debugging, but I consider the schedtune slider as
> > the way to tune the system at runtime.
>
> Could the default schedtune value not serve as the out of the box margin?
>

I'm not sure I understand you here. For me schedtune should be disabled
by default, so I'd say that it doesn't introduce any additional margin
by default. But we still need a margin to make the governor work without
schedtune in the mix.

> Regardless I agree that a debug interface is the way to go for now while
> we figure things out.
>

Looks good to me.

Best,

- Juri

2016-03-17 13:55:40

by Steve Muckle

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

On 03/17/2016 02:40 AM, Juri Lelli wrote:
>> Could the default schedtune value not serve as the out of the box margin?
>>
> I'm not sure I understand you here. For me schedtune should be disabled
> by default, so I'd say that it doesn't introduce any additional margin
> by default. But we still need a margin to make the governor work without
> schedtune in the mix.

Why not have schedtune be enabled always, and use it to add the margin?
It seems like it'd simplify things.

I haven't looked at the schedtune code at all so I don't know whether
this makes sense given its current implementation. But conceptually I
don't know why we'd need or want one margin in schedutil which will be
tunable, and then another mechanism for tuning as well.

2016-03-17 15:54:11

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

On 17-Mar 06:55, Steve Muckle wrote:
> On 03/17/2016 02:40 AM, Juri Lelli wrote:
> >> Could the default schedtune value not serve as the out of the box margin?
> >>
> > I'm not sure I understand you here. For me schedtune should be disabled
> > by default, so I'd say that it doesn't introduce any additional margin
> > by default. But we still need a margin to make the governor work without
> > schedtune in the mix.
>
> Why not have schedtune be enabled always, and use it to add the margin?
> It seems like it'd simplify things.

Actually one of the effects we noticed when SchedTune and SchedFreq
are both in use is that we have a sort of "double boosting" effect.

SchedTune boosts the CPU utilization signal, thus already providing a
sort of margin for the selection of the OPP. This margin overlaps with
the SchedFreq margin, which in turns could results in the selection of
an OPP even more higher than required (with boost already accouned).

> I haven't looked at the schedtune code at all so I don't know whether
> this makes sense given its current implementation.

The current implementation requires review, of course ;-)
Last (and only) posting is based on top of SchedFreq code, as it was
at that time.

> But conceptually I don't know why we'd need or want one margin in
> schedutil which will be tunable, and then another mechanism for
> tuning as well.

I agree with Steve on the conceptual standpoint. The main goal of
SchedTune is actually to provide a "single tunable" to bias many
different subsystem in a "consistent" way. Thus, from a conceptual
standpoint, IMO it makes sens to investigate better how the boost value
can be linked with SchedFreq.

A possible option can be to:
1. use an hardcoded margin (M) defined by SchedFreq
this margin is used to trigger OPP jumps
when SchedTune _is not_ in use
2. "compose" the M margin with a boost value defined margin (B)
when SchedTune _is_ in use

This means, e.g.
schedfreq_margin = max(M, B)
Thus:
a) non boosted tasks (and in general when SchedTune is not in use)
gets OPPs jumps based on the hardcoded M margin
b) boosted tasks can get more aggressive OPPs jumps based on the B
margin

While the M margin is hardcoded, the B one is defined via CGroups
depending on the how much tasks needs to be boosted.

--
#include <best/regards.h>

Patrick Bellasi

2016-03-17 17:52:58

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Hi,

On 17/03/16 15:53, Patrick Bellasi wrote:
> On 17-Mar 06:55, Steve Muckle wrote:
> > On 03/17/2016 02:40 AM, Juri Lelli wrote:
> > >> Could the default schedtune value not serve as the out of the box margin?
> > >>
> > > I'm not sure I understand you here. For me schedtune should be disabled
> > > by default, so I'd say that it doesn't introduce any additional margin
> > > by default. But we still need a margin to make the governor work without
> > > schedtune in the mix.
> >
> > Why not have schedtune be enabled always, and use it to add the margin?
> > It seems like it'd simplify things.
>
> Actually one of the effects we noticed when SchedTune and SchedFreq
> are both in use is that we have a sort of "double boosting" effect.
>
> SchedTune boosts the CPU utilization signal, thus already providing a
> sort of margin for the selection of the OPP. This margin overlaps with
> the SchedFreq margin, which in turns could results in the selection of
> an OPP even more higher than required (with boost already accouned).
>
> > I haven't looked at the schedtune code at all so I don't know whether
> > this makes sense given its current implementation.
>
> The current implementation requires review, of course ;-)
> Last (and only) posting is based on top of SchedFreq code, as it was
> at that time.
>
> > But conceptually I don't know why we'd need or want one margin in
> > schedutil which will be tunable, and then another mechanism for
> > tuning as well.
>
> I agree with Steve on the conceptual standpoint. The main goal of
> SchedTune is actually to provide a "single tunable" to bias many
> different subsystem in a "consistent" way. Thus, from a conceptual
> standpoint, IMO it makes sens to investigate better how the boost value
> can be linked with SchedFreq.
>
> A possible option can be to:
> 1. use an hardcoded margin (M) defined by SchedFreq
> this margin is used to trigger OPP jumps
> when SchedTune _is not_ in use
> 2. "compose" the M margin with a boost value defined margin (B)
> when SchedTune _is_ in use
>
> This means, e.g.
> schedfreq_margin = max(M, B)
> Thus:
> a) non boosted tasks (and in general when SchedTune is not in use)
> gets OPPs jumps based on the hardcoded M margin
> b) boosted tasks can get more aggressive OPPs jumps based on the B
> margin
>
> While the M margin is hardcoded, the B one is defined via CGroups
> depending on the how much tasks needs to be boosted.
>

Makes sense to me. And I think M margin is the one we don't want to make
part of the ABI and only play with it under DEBUG.

Best,

- Juri

2016-03-17 18:56:14

by Michael Turquette

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

Quoting Juri Lelli (2016-03-17 10:54:07)
> Hi,
>
> On 17/03/16 15:53, Patrick Bellasi wrote:
> > On 17-Mar 06:55, Steve Muckle wrote:
> > > On 03/17/2016 02:40 AM, Juri Lelli wrote:
> > > >> Could the default schedtune value not serve as the out of the box margin?
> > > >>
> > > > I'm not sure I understand you here. For me schedtune should be disabled
> > > > by default, so I'd say that it doesn't introduce any additional margin
> > > > by default. But we still need a margin to make the governor work without
> > > > schedtune in the mix.
> > >
> > > Why not have schedtune be enabled always, and use it to add the margin?
> > > It seems like it'd simplify things.
> >
> > Actually one of the effects we noticed when SchedTune and SchedFreq
> > are both in use is that we have a sort of "double boosting" effect.
> >
> > SchedTune boosts the CPU utilization signal, thus already providing a
> > sort of margin for the selection of the OPP. This margin overlaps with
> > the SchedFreq margin, which in turns could results in the selection of
> > an OPP even more higher than required (with boost already accouned).
> >
> > > I haven't looked at the schedtune code at all so I don't know whether
> > > this makes sense given its current implementation.
> >
> > The current implementation requires review, of course ;-)
> > Last (and only) posting is based on top of SchedFreq code, as it was
> > at that time.
> >
> > > But conceptually I don't know why we'd need or want one margin in
> > > schedutil which will be tunable, and then another mechanism for
> > > tuning as well.
> >
> > I agree with Steve on the conceptual standpoint. The main goal of
> > SchedTune is actually to provide a "single tunable" to bias many
> > different subsystem in a "consistent" way. Thus, from a conceptual
> > standpoint, IMO it makes sens to investigate better how the boost value
> > can be linked with SchedFreq.
> >
> > A possible option can be to:
> > 1. use an hardcoded margin (M) defined by SchedFreq
> > this margin is used to trigger OPP jumps
> > when SchedTune _is not_ in use
> > 2. "compose" the M margin with a boost value defined margin (B)
> > when SchedTune _is_ in use
> >
> > This means, e.g.
> > schedfreq_margin = max(M, B)
> > Thus:
> > a) non boosted tasks (and in general when SchedTune is not in use)
> > gets OPPs jumps based on the hardcoded M margin
> > b) boosted tasks can get more aggressive OPPs jumps based on the B
> > margin
> >
> > While the M margin is hardcoded, the B one is defined via CGroups
> > depending on the how much tasks needs to be boosted.
> >
>
> Makes sense to me. And I think M margin is the one we don't want to make
> part of the ABI and only play with it under DEBUG.

Correct.

Regarding "composing" the margin, schedtune could even overwrite the
margin entirely via cpufreq_set_cfs_capacity_margin (see patch #2 in
this series). This avoids complications around a "double boosting"
effect.

Either way, it sounds like the schedtune angle is something that we can
figure out in due time and change the code as needed later on. For
schedutil to make sense for frequency-invariant platforms we do need a
margin today, and there is desire to tune it easily, so I will move this
sysfs knob to a debug knob in v2.

Regards,
Mike

>
> Best,
>
> - Juri

2016-03-17 22:34:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH 4/8] cpufreq/schedutil: sysfs capacity margin tunable

On Thu, Mar 17, 2016 at 7:56 PM, Michael Turquette
<[email protected]> wrote:
> Quoting Juri Lelli (2016-03-17 10:54:07)
>> Hi,
>>
>> On 17/03/16 15:53, Patrick Bellasi wrote:
>> > On 17-Mar 06:55, Steve Muckle wrote:
>> > > On 03/17/2016 02:40 AM, Juri Lelli wrote:
>> > > >> Could the default schedtune value not serve as the out of the box margin?
>> > > >>
>> > > > I'm not sure I understand you here. For me schedtune should be disabled
>> > > > by default, so I'd say that it doesn't introduce any additional margin
>> > > > by default. But we still need a margin to make the governor work without
>> > > > schedtune in the mix.
>> > >
>> > > Why not have schedtune be enabled always, and use it to add the margin?
>> > > It seems like it'd simplify things.
>> >
>> > Actually one of the effects we noticed when SchedTune and SchedFreq
>> > are both in use is that we have a sort of "double boosting" effect.
>> >
>> > SchedTune boosts the CPU utilization signal, thus already providing a
>> > sort of margin for the selection of the OPP. This margin overlaps with
>> > the SchedFreq margin, which in turns could results in the selection of
>> > an OPP even more higher than required (with boost already accouned).
>> >
>> > > I haven't looked at the schedtune code at all so I don't know whether
>> > > this makes sense given its current implementation.
>> >
>> > The current implementation requires review, of course ;-)
>> > Last (and only) posting is based on top of SchedFreq code, as it was
>> > at that time.
>> >
>> > > But conceptually I don't know why we'd need or want one margin in
>> > > schedutil which will be tunable, and then another mechanism for
>> > > tuning as well.
>> >
>> > I agree with Steve on the conceptual standpoint. The main goal of
>> > SchedTune is actually to provide a "single tunable" to bias many
>> > different subsystem in a "consistent" way. Thus, from a conceptual
>> > standpoint, IMO it makes sens to investigate better how the boost value
>> > can be linked with SchedFreq.
>> >
>> > A possible option can be to:
>> > 1. use an hardcoded margin (M) defined by SchedFreq
>> > this margin is used to trigger OPP jumps
>> > when SchedTune _is not_ in use
>> > 2. "compose" the M margin with a boost value defined margin (B)
>> > when SchedTune _is_ in use
>> >
>> > This means, e.g.
>> > schedfreq_margin = max(M, B)
>> > Thus:
>> > a) non boosted tasks (and in general when SchedTune is not in use)
>> > gets OPPs jumps based on the hardcoded M margin
>> > b) boosted tasks can get more aggressive OPPs jumps based on the B
>> > margin
>> >
>> > While the M margin is hardcoded, the B one is defined via CGroups
>> > depending on the how much tasks needs to be boosted.
>> >
>>
>> Makes sense to me. And I think M margin is the one we don't want to make
>> part of the ABI and only play with it under DEBUG.
>
> Correct.
>
> Regarding "composing" the margin, schedtune could even overwrite the
> margin entirely via cpufreq_set_cfs_capacity_margin (see patch #2 in
> this series). This avoids complications around a "double boosting"
> effect.
>
> Either way, it sounds like the schedtune angle is something that we can
> figure out in due time and change the code as needed later on. For
> schedutil to make sense for frequency-invariant platforms we do need a
> margin today, and there is desire to tune it easily, so I will move this
> sysfs knob to a debug knob in v2.

Sounds good!

Also, if you look at the latest iteration of the schedutil patch
(https://patchwork.kernel.org/patch/8612561/), it maps the choice of
the margin to the choice of the frequency tipping point. That is, the
value of (util / max) for which the frequency will stay the same as it
was before. [For (util / max) below the tipping point the new
frequency will be less than the old one (unless it already is minimum)
and for (util / max) above it, the new frequency will be greater than
the old one.]

The tipping point seems to be a good candidate for a tunable to me,
because its meaning is well defined and the range of values that make
sense is quite easy to figure out too.