Scheduler-driven CPU frequency selection hopes to exploit both
per-task and global information in the scheduler to improve frequency
selection policy and achieve lower power consumption, improved
responsiveness/performance, and less reliance on heuristics and
tunables. For further discussion of this integration see [0].
This patch series implements a cpufreq governor which collects CPU
capacity requests from the fair, realtime, and deadline scheduling
classes. The fair and realtime scheduling classes are modified to make
these requests. The deadline class is not yet modified to make CPU
capacity requests.
The last RFC posting of this was RFCv5 [1] as part of a larger posting
including energy-aware scheduling. Scheduler-driven CPU frequency
scaling is contained in patches 37-46 of [1]. Changes in this series
since RFCv5:
- the API to request CPU capacity changes is extended beyond the fair
scheduling class to the realtime and deadline classes
- the realtime class is modified to make CPU capacity requests
- recalculated capacity is converted to a supported target frequency
to test if a frequency change is actually required
- allow any CPU to change the frequency domain capacity, not just a
CPU that is driving the maximum capacity in the frequency domain
- cpufreq_driver_might_sleep has been changed to cpufreq_driver_is_slow,
since it is possible a driver may not sleep but still be too slow to
be called in scheduler hot paths
- capacity requests which occur while throttled are no longer lost
- cleanups based on RFCv5 lkml feedback
- initialization, static key management fixes
Profiling results:
Performance profiling has been done by using rt-app [2] to generate
various periodic workloads with a particular duty cycle. The time to
complete the busy portion of the duty cycle is measured and overhead
is calculated as
overhead = (busy_duration_test_gov - busy_duration_perf_gov)/
(busy_duration_pwrsave_gov - busy_duration_perf_gov)
This shows as a percentage how close the governor is to running the
workload at fmin (100%) or fmax (0%). The number of times the busy
duration exceeds the period of the periodic workload (an "overrun") is
also recorded. In the table below the performance of the ondemand
(sampling_rate = 20ms), interactive (default tunables), and
scheduler-driven governors are evaluated using these metrics. The test
platform is a Samsung Chromebook 2 ("Peach Pi"). The workload is
affined to CPU0, an A15 with an fmin of 200MHz and an fmax of
2GHz. The interactive governor was incorporated/adapted from [3]. A
branch with the interactive governor and a few required dependency
patches for ARM is available at [4].
More detailed explanation of the columns below:
run: duration at fmax of the busy portion of the periodic workload in msec
period: duration of the entire period of the periodic workload in msec
loops: number of iterations of the periodic workload tested
OR: number of instances of overrun as described above
OH: overhead as calculated above
SCHED_OTHER workload:
wload parameters ondemand interactive sched
run period loops OR OH OR OH OR OH
1 100 100 0 51.83% 0 99.74% 0 99.76%
10 1000 10 0 24.73% 0 19.41% 0 50.09%
1 10 1000 0 19.34% 0 62.81% 7 62.85%
10 100 100 0 11.20% 0 15.84% 0 33.48%
100 1000 10 0 1.62% 0 1.82% 0 6.64%
6 33 300 0 13.73% 0 7.98% 1 33.32%
66 333 30 0 1.87% 0 3.11% 0 12.39%
4 10 1000 1 6.08% 1 10.92% 3 6.63%
40 100 100 0 0.98% 0 0.06% 1 2.92%
400 1000 10 0 0.40% 0 0.50% 0 1.26%
5 9 1000 1 3.38% 2 5.87% 6 3.76%
50 90 100 0 1.78% 0 0.03% 1 1.56%
500 900 10 0 0.32% 0 0.37% 0 1.64%
9 12 1000 2 1.57% 1 0.16% 3 0.47%
90 120 100 0 1.25% 0 0.02% 1 0.45%
900 1200 10 0 0.19% 0 0.24% 0 0.87%
SCHED_FIFO workload:
wload parameters ondemand interactive sched
run period loops OR OH OR OH OR OH
1 100 100 0 65.10% 0 99.84% 0 100.00%
10 1000 10 0 96.01% 0 21.08% 0 87.88%
1 10 1000 0 14.11% 0 61.98% 0 62.53%
10 100 100 34 49.89% 0 14.28% 0 68.58%
100 1000 10 1 46.29% 0 1.89% 0 23.78%
6 33 300 50 25.36% 0 8.20% 2 33.42%
66 333 30 10 34.97% 0 3.02% 0 27.07%
4 10 1000 0 5.62% 0 11.00% 9 10.94%
40 100 100 8 10.02% 0 0.11% 1 10.65%
400 1000 10 3 8.17% 0 0.50% 0 6.27%
5 9 1000 1 3.21% 1 5.79% 11 4.79%
50 90 100 12 8.44% 0 0.03% 1 4.74%
500 900 10 4 8.72% 0 0.41% 0 4.05%
9 12 1000 48 1.94% 0 0.01% 10 0.79%
90 120 100 27 6.19% 0 0.01% 1 1.44%
900 1200 10 5 4.95% 0 0.22% 0 1.83%
Note that at this point RT CPU capacity is measured via rt_avg. For
the above results sched_time_avg_ms has been set to 50ms.
Known issues:
- The sched governor suffers more overruns with SCHED_OTHER than ondemand
or interactive. This is likely due to PELT's slow responsiveness but
ore analysis is required.
- More testing with real world type workloads, such as UI workloads and
benchmarks, is required.
- The power side of the characterization is yet to be done.
- The locking in cpufreq will be improved in a separate patchset. Once
that is complete this series will be updated so the hot path relies
only on RCU read locking.
- Deadline scheduling class does not yet make CPU capacity requests.
- Throttling is not yet supported on platforms with fast cpufreq
drivers.
Dependencies:
Frequency invariant load tracking is required. For heterogeneous
systems such as big.Little, CPU invariant load tracking is required as
well. The required support for ARM platforms along with a patch
creating tracepoints for cpufreq_sched is located in [5].
References:
[0] http://article.gmane.org/gmane.linux.kernel/1499836
[1] https://lkml.org/lkml/2015/7/7/754
[2] https://git.linaro.org/power/rt-app.git
[3] https://lkml.org/lkml/2015/10/28/782
[4] https://git.linaro.org/people/steve.muckle/kernel.git/shortlog/refs/heads/interactive
[5] https://git.linaro.org/people/steve.muckle/kernel.git/shortlog/refs/heads/sched-freq-rfcv6
Juri Lelli (3):
sched/fair: add triggers for OPP change requests
sched/{core,fair}: trigger OPP change request on fork()
sched/fair: cpufreq_sched triggers for load balancing
Michael Turquette (2):
cpufreq: introduce cpufreq_driver_is_slow
sched: scheduler-driven cpu frequency selection
Morten Rasmussen (1):
sched: Compute cpu capacity available at current frequency
Steve Muckle (1):
sched/fair: jump to max OPP when crossing UP threshold
Vincent Guittot (3):
sched: remove call of sched_avg_update from sched_rt_avg_update
sched: deadline: use deadline bandwidth in scale_rt_capacity
sched: rt scheduler sets capacity requirement
drivers/cpufreq/Kconfig | 20 +++
drivers/cpufreq/cpufreq.c | 6 +
include/linux/cpufreq.h | 12 ++
include/linux/sched.h | 8 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 43 ++++-
kernel/sched/cpufreq_sched.c | 364 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/deadline.c | 33 +++-
kernel/sched/fair.c | 115 ++++++++------
kernel/sched/rt.c | 49 +++++-
kernel/sched/sched.h | 114 +++++++++++++-
11 files changed, 714 insertions(+), 51 deletions(-)
create mode 100644 kernel/sched/cpufreq_sched.c
--
2.4.10
From: Morten Rasmussen <[email protected]>
capacity_orig_of() returns the max available compute capacity of a cpu.
For scale-invariant utilization tracking and energy-aware scheduling
decisions it is useful to know the compute capacity available at the
current OPP of a cpu.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Morten Rasmussen <[email protected]>
Signed-off-by: Steve Muckle <[email protected]>
---
kernel/sched/fair.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1093873..95b83c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4737,6 +4737,17 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
#endif
/*
+ * Returns the current capacity of cpu after applying both
+ * cpu and freq scaling.
+ */
+static unsigned long capacity_curr_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig *
+ arch_scale_freq_capacity(NULL, cpu)
+ >> SCHED_CAPACITY_SHIFT;
+}
+
+/*
* Detect M:N waker/wakee relationships via a switching-frequency heuristic.
* A waker of many should wake a different task than the one last awakened
* at a frequency roughly N times higher than one of its wakees. In order
--
2.4.10
From: Michael Turquette <[email protected]>
Some architectures and platforms perform CPU frequency transitions
through a non-blocking method, while some might block or sleep. Even
when frequency transitions do not block or sleep they may be very slow.
This distinction is important when trying to change frequency from
a non-interruptible context in a scheduler hot path.
Describe this distinction with a cpufreq driver flag,
CPUFREQ_DRIVER_FAST. The default is to not have this flag set,
thus erring on the side of caution.
cpufreq_driver_is_slow() is also introduced in this patch. Setting
the above flag will allow this function to return false.
[[email protected]: change flag/API to include drivers that are too
slow for scheduler hot paths, in addition to those that block/sleep]
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Signed-off-by: Michael Turquette <[email protected]>
Signed-off-by: Steve Muckle <[email protected]>
---
drivers/cpufreq/cpufreq.c | 6 ++++++
include/linux/cpufreq.h | 9 +++++++++
2 files changed, 15 insertions(+)
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 7c48e73..8482820 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -154,6 +154,12 @@ bool have_governor_per_policy(void)
}
EXPORT_SYMBOL_GPL(have_governor_per_policy);
+bool cpufreq_driver_is_slow(void)
+{
+ return !(cpufreq_driver->flags & CPUFREQ_DRIVER_FAST);
+}
+EXPORT_SYMBOL_GPL(cpufreq_driver_is_slow);
+
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy)
{
if (have_governor_per_policy())
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index ef4c5b1..7f8c63d 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -159,6 +159,7 @@ u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy);
int cpufreq_get_policy(struct cpufreq_policy *policy, unsigned int cpu);
int cpufreq_update_policy(unsigned int cpu);
bool have_governor_per_policy(void);
+bool cpufreq_driver_is_slow(void);
struct kobject *get_governor_parent_kobj(struct cpufreq_policy *policy);
#else
static inline unsigned int cpufreq_get(unsigned int cpu)
@@ -316,6 +317,14 @@ struct cpufreq_driver {
*/
#define CPUFREQ_NEED_INITIAL_FREQ_CHECK (1 << 5)
+/*
+ * Indicates that it is safe to call cpufreq_driver_target from
+ * non-interruptable context in scheduler hot paths. Drivers must
+ * opt-in to this flag, as the safe default is that they might sleep
+ * or be too slow for hot path use.
+ */
+#define CPUFREQ_DRIVER_FAST (1 << 6)
+
int cpufreq_register_driver(struct cpufreq_driver *driver_data);
int cpufreq_unregister_driver(struct cpufreq_driver *driver_data);
--
2.4.10
From: Michael Turquette <[email protected]>
Scheduler-driven CPU frequency selection hopes to exploit both
per-task and global information in the scheduler to improve frequency
selection policy, achieving lower power consumption, improved
responsiveness/performance, and less reliance on heuristics and
tunables. For further discussion on the motivation of this integration
see [0].
This patch implements a shim layer between the Linux scheduler and the
cpufreq subsystem. The interface accepts capacity requests from the
CFS, RT and deadline sched classes. The requests from each sched class
are summed on each CPU with a margin applied to the CFS and RT
capacity requests to provide some headroom. Deadline requests are
expected to be precise enough given their nature to not require
headroom. The maximum total capacity request for a CPU in a frequency
domain drives the requested frequency for that domain.
Policy is determined by both the sched classes and this shim layer.
Note that this algorithm is event-driven. There is no polling loop to
check cpu idle time nor any other method which is unsynchronized with
the scheduler, aside from a throttling mechanism to ensure frequency
changes are not attempted faster than the hardware can accommodate them.
Thanks to Juri Lelli <[email protected]> for contributing design ideas,
code and test results, and to Ricky Liang <[email protected]>
for initialization and static key inc/dec fixes.
[0] http://article.gmane.org/gmane.linux.kernel/1499836
[[email protected]: various additions and fixes, revised commit text]
CC: Ricky Liang <[email protected]>
Signed-off-by: Michael Turquette <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Steve Muckle <[email protected]>
---
drivers/cpufreq/Kconfig | 20 +++
include/linux/cpufreq.h | 3 +
include/linux/sched.h | 8 +
kernel/sched/Makefile | 1 +
kernel/sched/cpufreq_sched.c | 364 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 51 ++++++
6 files changed, 447 insertions(+)
create mode 100644 kernel/sched/cpufreq_sched.c
diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
index 659879a..6f2e96c 100644
--- a/drivers/cpufreq/Kconfig
+++ b/drivers/cpufreq/Kconfig
@@ -102,6 +102,14 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
Be aware that not all cpufreq drivers support the conservative
governor. If unsure have a look at the help section of the
driver. Fallback governor will be the performance governor.
+
+config CPU_FREQ_DEFAULT_GOV_SCHED
+ bool "sched"
+ select CPU_FREQ_GOV_SCHED
+ help
+ Use the CPUfreq governor 'sched' as default. This scales
+ cpu frequency using CPU utilization estimates from the
+ scheduler.
endchoice
config CPU_FREQ_GOV_PERFORMANCE
@@ -183,6 +191,18 @@ config CPU_FREQ_GOV_CONSERVATIVE
If in doubt, say N.
+config CPU_FREQ_GOV_SCHED
+ bool "'sched' cpufreq governor"
+ depends on CPU_FREQ
+ select CPU_FREQ_GOV_COMMON
+ help
+ 'sched' - this governor scales cpu frequency from the
+ scheduler as a function of cpu capacity utilization. It does
+ not evaluate utilization on a periodic basis (as ondemand
+ does) but instead is event-driven by the scheduler.
+
+ If in doubt, say N.
+
comment "CPU frequency scaling drivers"
config CPUFREQ_DT
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 7f8c63d..7e4bde1 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -495,6 +495,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
extern struct cpufreq_governor cpufreq_gov_conservative;
#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative)
+#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED)
+extern struct cpufreq_governor cpufreq_gov_sched;
+#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_sched)
#endif
/*********************************************************************
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3b0de68..d910a31 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -927,6 +927,14 @@ enum cpu_idle_type {
#define SCHED_CAPACITY_SHIFT 10
#define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
+struct sched_capacity_reqs {
+ unsigned long cfs;
+ unsigned long rt;
+ unsigned long dl;
+
+ unsigned long total;
+};
+
/*
* Wake-queues are lists of tasks with a pending wakeup, whose
* callers have already marked the task as woken internally,
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 6768797..90ed832 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
new file mode 100644
index 0000000..af8b5bc
--- /dev/null
+++ b/kernel/sched/cpufreq_sched.c
@@ -0,0 +1,364 @@
+/*
+ * Copyright (C) 2015 Michael Turquette <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include <linux/cpufreq.h>
+#include <linux/module.h>
+#include <linux/kthread.h>
+#include <linux/percpu.h>
+#include <linux/irq_work.h>
+#include <linux/delay.h>
+#include <linux/string.h>
+
+#include "sched.h"
+
+#define THROTTLE_NSEC 50000000 /* 50ms default */
+
+struct static_key __read_mostly __sched_freq = STATIC_KEY_INIT_FALSE;
+static bool __read_mostly cpufreq_driver_slow;
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
+static struct cpufreq_governor cpufreq_gov_sched;
+#endif
+
+/*
+ * Capacity margin added to CFS and RT capacity requests to provide
+ * some head room if task utilization further increases.
+ */
+unsigned int capacity_margin = 1280;
+
+static DEFINE_PER_CPU(unsigned long, enabled);
+DEFINE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs);
+
+/**
+ * gov_data - per-policy data internal to the governor
+ * @throttle: next throttling period expiry. Derived from throttle_nsec
+ * @throttle_nsec: throttle period length in nanoseconds
+ * @task: worker thread for dvfs transition that may block/sleep
+ * @irq_work: callback used to wake up worker thread
+ * @requested_freq: last frequency requested by the sched governor
+ *
+ * struct gov_data is the per-policy cpufreq_sched-specific data structure. A
+ * per-policy instance of it is created when the cpufreq_sched governor receives
+ * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
+ * member of struct cpufreq_policy.
+ *
+ * Readers of this data must call down_read(policy->rwsem). Writers must
+ * call down_write(policy->rwsem).
+ */
+struct gov_data {
+ ktime_t throttle;
+ unsigned int throttle_nsec;
+ struct task_struct *task;
+ struct irq_work irq_work;
+ unsigned int requested_freq;
+};
+
+static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy,
+ unsigned int freq)
+{
+ struct gov_data *gd = policy->governor_data;
+
+ /* avoid race with cpufreq_sched_stop */
+ if (!down_write_trylock(&policy->rwsem))
+ return;
+
+ __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
+
+ gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
+ up_write(&policy->rwsem);
+}
+
+static bool finish_last_request(struct gov_data *gd)
+{
+ ktime_t now = ktime_get();
+
+ if (ktime_after(now, gd->throttle))
+ return false;
+
+ while (1) {
+ int usec_left = ktime_to_ns(ktime_sub(gd->throttle, now));
+
+ usec_left /= NSEC_PER_USEC;
+ usleep_range(usec_left, usec_left + 100);
+ now = ktime_get();
+ if (ktime_after(now, gd->throttle))
+ return true;
+ }
+}
+
+/*
+ * we pass in struct cpufreq_policy. This is safe because changing out the
+ * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
+ * which tears down all of the data structures and __cpufreq_governor(policy,
+ * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
+ * new policy pointer
+ */
+static int cpufreq_sched_thread(void *data)
+{
+ struct sched_param param;
+ struct cpufreq_policy *policy;
+ struct gov_data *gd;
+ unsigned int new_request = 0;
+ unsigned int last_request = 0;
+ int ret;
+
+ policy = (struct cpufreq_policy *) data;
+ gd = policy->governor_data;
+
+ param.sched_priority = 50;
+ ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, ¶m);
+ if (ret) {
+ pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
+ do_exit(-EINVAL);
+ } else {
+ pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
+ __func__, gd->task->pid);
+ }
+
+ do {
+ set_current_state(TASK_INTERRUPTIBLE);
+ new_request = gd->requested_freq;
+ if (new_request == last_request) {
+ schedule();
+ } else {
+ /*
+ * if the frequency thread sleeps while waiting to be
+ * unthrottled, start over to check for a newer request
+ */
+ if (finish_last_request(gd))
+ continue;
+ last_request = new_request;
+ cpufreq_sched_try_driver_target(policy, new_request);
+ }
+ } while (!kthread_should_stop());
+
+ return 0;
+}
+
+static void cpufreq_sched_irq_work(struct irq_work *irq_work)
+{
+ struct gov_data *gd;
+
+ gd = container_of(irq_work, struct gov_data, irq_work);
+ if (!gd)
+ return;
+
+ wake_up_process(gd->task);
+}
+
+static void update_fdomain_capacity_request(int cpu)
+{
+ unsigned int freq_new, index_new, cpu_tmp;
+ struct cpufreq_policy *policy;
+ struct gov_data *gd;
+ unsigned long capacity = 0;
+
+ /*
+ * Avoid grabbing the policy if possible. A test is still
+ * required after locking the CPU's policy to avoid racing
+ * with the governor changing.
+ */
+ if (!per_cpu(enabled, cpu))
+ return;
+
+ policy = cpufreq_cpu_get(cpu);
+ if (IS_ERR_OR_NULL(policy))
+ return;
+
+ if (policy->governor != &cpufreq_gov_sched ||
+ !policy->governor_data)
+ goto out;
+
+ gd = policy->governor_data;
+
+ /* find max capacity requested by cpus in this policy */
+ for_each_cpu(cpu_tmp, policy->cpus) {
+ struct sched_capacity_reqs *scr;
+
+ scr = &per_cpu(cpu_sched_capacity_reqs, cpu_tmp);
+ capacity = max(capacity, scr->total);
+ }
+
+ /* Convert the new maximum capacity request into a cpu frequency */
+ freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
+ if (cpufreq_frequency_table_target(policy, policy->freq_table,
+ freq_new, CPUFREQ_RELATION_L,
+ &index_new))
+ goto out;
+ freq_new = policy->freq_table[index_new].frequency;
+
+ if (freq_new == gd->requested_freq)
+ goto out;
+
+ gd->requested_freq = freq_new;
+
+ /*
+ * Throttling is not yet supported on platforms with fast cpufreq
+ * drivers.
+ */
+ if (cpufreq_driver_slow)
+ irq_work_queue_on(&gd->irq_work, cpu);
+ else
+ cpufreq_sched_try_driver_target(policy, freq_new);
+
+out:
+ cpufreq_cpu_put(policy);
+}
+
+void update_cpu_capacity_request(int cpu, bool request)
+{
+ unsigned long new_capacity;
+ struct sched_capacity_reqs *scr;
+
+ /* The rq lock serializes access to the CPU's sched_capacity_reqs. */
+ lockdep_assert_held(&cpu_rq(cpu)->lock);
+
+ scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
+
+ new_capacity = scr->cfs + scr->rt;
+ new_capacity = new_capacity * capacity_margin
+ / SCHED_CAPACITY_SCALE;
+ new_capacity += scr->dl;
+
+ if (new_capacity == scr->total)
+ return;
+
+ scr->total = new_capacity;
+ if (request)
+ update_fdomain_capacity_request(cpu);
+}
+
+static inline void set_sched_freq(void)
+{
+ static_key_slow_inc(&__sched_freq);
+}
+
+static inline void clear_sched_freq(void)
+{
+ static_key_slow_dec(&__sched_freq);
+}
+
+static int cpufreq_sched_policy_init(struct cpufreq_policy *policy)
+{
+ struct gov_data *gd;
+ int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ memset(&per_cpu(cpu_sched_capacity_reqs, cpu), 0,
+ sizeof(struct sched_capacity_reqs));
+
+ gd = kzalloc(sizeof(*gd), GFP_KERNEL);
+ if (!gd)
+ return -ENOMEM;
+
+ gd->throttle_nsec = policy->cpuinfo.transition_latency ?
+ policy->cpuinfo.transition_latency :
+ THROTTLE_NSEC;
+ pr_debug("%s: throttle threshold = %u [ns]\n",
+ __func__, gd->throttle_nsec);
+
+ if (cpufreq_driver_is_slow()) {
+ cpufreq_driver_slow = true;
+ gd->task = kthread_create(cpufreq_sched_thread, policy,
+ "kschedfreq:%d",
+ cpumask_first(policy->related_cpus));
+ if (IS_ERR_OR_NULL(gd->task)) {
+ pr_err("%s: failed to create kschedfreq thread\n",
+ __func__);
+ goto err;
+ }
+ get_task_struct(gd->task);
+ kthread_bind_mask(gd->task, policy->related_cpus);
+ wake_up_process(gd->task);
+ init_irq_work(&gd->irq_work, cpufreq_sched_irq_work);
+ }
+
+ policy->governor_data = gd;
+ set_sched_freq();
+
+ return 0;
+
+err:
+ kfree(gd);
+ return -ENOMEM;
+}
+
+static int cpufreq_sched_policy_exit(struct cpufreq_policy *policy)
+{
+ struct gov_data *gd = policy->governor_data;
+
+ clear_sched_freq();
+ if (cpufreq_driver_slow) {
+ kthread_stop(gd->task);
+ put_task_struct(gd->task);
+ }
+
+ policy->governor_data = NULL;
+
+ kfree(gd);
+ return 0;
+}
+
+static int cpufreq_sched_start(struct cpufreq_policy *policy)
+{
+ int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ per_cpu(enabled, cpu) = 1;
+
+ return 0;
+}
+
+static int cpufreq_sched_stop(struct cpufreq_policy *policy)
+{
+ int cpu;
+
+ for_each_cpu(cpu, policy->cpus)
+ per_cpu(enabled, cpu) = 0;
+
+ return 0;
+}
+
+static int cpufreq_sched_setup(struct cpufreq_policy *policy,
+ unsigned int event)
+{
+ switch (event) {
+ case CPUFREQ_GOV_POLICY_INIT:
+ return cpufreq_sched_policy_init(policy);
+ case CPUFREQ_GOV_POLICY_EXIT:
+ return cpufreq_sched_policy_exit(policy);
+ case CPUFREQ_GOV_START:
+ return cpufreq_sched_start(policy);
+ case CPUFREQ_GOV_STOP:
+ return cpufreq_sched_stop(policy);
+ case CPUFREQ_GOV_LIMITS:
+ break;
+ }
+ return 0;
+}
+
+#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
+static
+#endif
+struct cpufreq_governor cpufreq_gov_sched = {
+ .name = "sched",
+ .governor = cpufreq_sched_setup,
+ .owner = THIS_MODULE,
+};
+
+static int __init cpufreq_sched_init(void)
+{
+ int cpu;
+
+ for_each_cpu(cpu, cpu_possible_mask)
+ per_cpu(enabled, cpu) = 0;
+ return cpufreq_register_governor(&cpufreq_gov_sched);
+}
+
+/* Try to make this the default governor */
+fs_initcall(cpufreq_sched_init);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a5a6b3e..a88dbec 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1383,6 +1383,57 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
}
#endif
+#ifdef CONFIG_CPU_FREQ_GOV_SCHED
+extern unsigned int capacity_margin;
+extern struct static_key __sched_freq;
+
+static inline bool sched_freq(void)
+{
+ return static_key_false(&__sched_freq);
+}
+
+DECLARE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs);
+void update_cpu_capacity_request(int cpu, bool request);
+
+static inline void set_cfs_cpu_capacity(int cpu, bool request,
+ unsigned long capacity)
+{
+ if (per_cpu(cpu_sched_capacity_reqs, cpu).cfs != capacity) {
+ per_cpu(cpu_sched_capacity_reqs, cpu).cfs = capacity;
+ update_cpu_capacity_request(cpu, request);
+ }
+}
+
+static inline void set_rt_cpu_capacity(int cpu, bool request,
+ unsigned long capacity)
+{
+ if (per_cpu(cpu_sched_capacity_reqs, cpu).rt != capacity) {
+ per_cpu(cpu_sched_capacity_reqs, cpu).rt = capacity;
+ update_cpu_capacity_request(cpu, request);
+ }
+}
+
+static inline void set_dl_cpu_capacity(int cpu, bool request,
+ unsigned long capacity)
+{
+ if (per_cpu(cpu_sched_capacity_reqs, cpu).dl != capacity) {
+ per_cpu(cpu_sched_capacity_reqs, cpu).dl = capacity;
+ update_cpu_capacity_request(cpu, request);
+ }
+}
+#else
+static inline bool sched_freq(void) { return false; }
+static inline void set_cfs_cpu_capacity(int cpu, bool request,
+ unsigned long capacity)
+{ }
+static inline void set_rt_cpu_capacity(int cpu, bool request,
+ unsigned long capacity)
+{ }
+static inline void set_dl_cpu_capacity(int cpu, bool request,
+ unsigned long capacity)
+{ }
+#endif
+
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
--
2.4.10
From: Juri Lelli <[email protected]>
Each time a task is {en,de}queued we might need to adapt the current
frequency to the new usage. Add triggers on {en,de}queue_task_fair() for
this purpose. Only trigger a freq request if we are effectively waking up
or going to sleep. Filter out load balancing related calls to reduce the
number of triggers.
[[email protected]: resolve merge conflicts, define task_new,
use renamed static key sched_freq]
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Steve Muckle <[email protected]>
---
kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 47 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 95b83c4..904188a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4199,6 +4199,21 @@ static inline void hrtick_update(struct rq *rq)
}
#endif
+static unsigned long capacity_orig_of(int cpu);
+static int cpu_util(int cpu);
+
+static void update_capacity_of(int cpu)
+{
+ unsigned long req_cap;
+
+ if (!sched_freq())
+ return;
+
+ /* Convert scale-invariant capacity to cpu. */
+ req_cap = cpu_util(cpu) * SCHED_CAPACITY_SCALE / capacity_orig_of(cpu);
+ set_cfs_cpu_capacity(cpu, true, req_cap);
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -4209,6 +4224,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ int task_new = !(flags & ENQUEUE_WAKEUP);
for_each_sched_entity(se) {
if (se->on_rq)
@@ -4240,9 +4256,23 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq);
}
- if (!se)
+ if (!se) {
add_nr_running(rq, 1);
+ /*
+ * We want to potentially trigger a freq switch
+ * request only for tasks that are waking up; this is
+ * because we get here also during load balancing, but
+ * in these cases it seems wise to trigger as single
+ * request after load balancing is done.
+ *
+ * XXX: how about fork()? Do we need a special
+ * flag/something to tell if we are here after a
+ * fork() (wakeup_task_new)?
+ */
+ if (!task_new)
+ update_capacity_of(cpu_of(rq));
+ }
hrtick_update(rq);
}
@@ -4300,9 +4330,24 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_shares(cfs_rq);
}
- if (!se)
+ if (!se) {
sub_nr_running(rq, 1);
+ /*
+ * We want to potentially trigger a freq switch
+ * request only for tasks that are going to sleep;
+ * this is because we get here also during load
+ * balancing, but in these cases it seems wise to
+ * trigger as single request after load balancing is
+ * done.
+ */
+ if (task_sleep) {
+ if (rq->cfs.nr_running)
+ update_capacity_of(cpu_of(rq));
+ else if (sched_freq())
+ set_cfs_cpu_capacity(cpu_of(rq), false, 0);
+ }
+ }
hrtick_update(rq);
}
--
2.4.10
From: Juri Lelli <[email protected]>
Patch "sched/fair: add triggers for OPP change requests" introduced OPP
change triggers for enqueue_task_fair(), but the trigger was operating only
for wakeups. Fact is that it makes sense to consider wakeup_new also (i.e.,
fork()), as we don't know anything about a newly created task and thus we
most certainly want to jump to max OPP to not harm performance too much.
However, it is not currently possible (or at least it wasn't evident to me
how to do so :/) to tell new wakeups from other (non wakeup) operations.
This patch introduces an additional flag in sched.h that is only set at
fork() time and it is then consumed in enqueue_task_fair() for our purpose.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Steve Muckle <[email protected]>
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 9 +++------
kernel/sched/sched.h | 1 +
3 files changed, 5 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index aa3f978..4c8c353e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2402,7 +2402,7 @@ void wake_up_new_task(struct task_struct *p)
#endif
rq = __task_rq_lock(p);
- activate_task(rq, p, 0);
+ activate_task(rq, p, ENQUEUE_WAKEUP_NEW);
p->on_rq = TASK_ON_RQ_QUEUED;
trace_sched_wakeup_new(p);
check_preempt_curr(rq, p, WF_FORK);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 904188a..1bfbbb7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4224,7 +4224,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
- int task_new = !(flags & ENQUEUE_WAKEUP);
+ int task_new = flags & ENQUEUE_WAKEUP_NEW;
+ int task_wakeup = flags & ENQUEUE_WAKEUP;
for_each_sched_entity(se) {
if (se->on_rq)
@@ -4265,12 +4266,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
* because we get here also during load balancing, but
* in these cases it seems wise to trigger as single
* request after load balancing is done.
- *
- * XXX: how about fork()? Do we need a special
- * flag/something to tell if we are here after a
- * fork() (wakeup_task_new)?
*/
- if (!task_new)
+ if (task_new || task_wakeup)
update_capacity_of(cpu_of(rq));
}
hrtick_update(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a88dbec..ad82274 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1139,6 +1139,7 @@ extern const u32 sched_prio_to_wmult[40];
#endif
#define ENQUEUE_REPLENISH 0x08
#define ENQUEUE_RESTORE 0x10
+#define ENQUEUE_WAKEUP_NEW 0x20
#define DEQUEUE_SLEEP 0x01
#define DEQUEUE_SAVE 0x02
--
2.4.10
From: Juri Lelli <[email protected]>
As we don't trigger freq changes from {en,de}queue_task_fair() during load
balancing, we need to do explicitly so on load balancing paths.
[[email protected]: move update_capacity_of calls so rq lock is held]
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Steve Muckle <[email protected]>
---
kernel/sched/fair.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1bfbbb7..880ceee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6023,6 +6023,10 @@ static void attach_one_task(struct rq *rq, struct task_struct *p)
{
raw_spin_lock(&rq->lock);
attach_task(rq, p);
+ /*
+ * We want to potentially raise target_cpu's OPP.
+ */
+ update_capacity_of(cpu_of(rq));
raw_spin_unlock(&rq->lock);
}
@@ -6044,6 +6048,11 @@ static void attach_tasks(struct lb_env *env)
attach_task(env->dst_rq, p);
}
+ /*
+ * We want to potentially raise env.dst_cpu's OPP.
+ */
+ update_capacity_of(env->dst_cpu);
+
raw_spin_unlock(&env->dst_rq->lock);
}
@@ -7183,6 +7192,11 @@ more_balance:
* ld_moved - cumulative load moved across iterations
*/
cur_ld_moved = detach_tasks(&env);
+ /*
+ * We want to potentially lower env.src_cpu's OPP.
+ */
+ if (cur_ld_moved)
+ update_capacity_of(env.src_cpu);
/*
* We've detached some tasks from busiest_rq. Every
@@ -7547,8 +7561,13 @@ static int active_load_balance_cpu_stop(void *data)
schedstat_inc(sd, alb_count);
p = detach_one_task(&env);
- if (p)
+ if (p) {
schedstat_inc(sd, alb_pushed);
+ /*
+ * We want to potentially lower env.src_cpu's OPP.
+ */
+ update_capacity_of(env.src_cpu);
+ }
else
schedstat_inc(sd, alb_failed);
}
--
2.4.10
Since the true utilization of a long running task is not detectable
while it is running and might be bigger than the current cpu capacity,
create the maximum cpu capacity head room by requesting the maximum
cpu capacity once the cpu usage plus the capacity margin exceeds the
current capacity. This is also done to try to harm the performance of
a task the least.
Original fair-class only version authored by Juri Lelli
<[email protected]>.
cc: Ingo Molnar <[email protected]>
cc: Peter Zijlstra <[email protected]>
Signed-off-by: Juri Lelli <[email protected]>
Signed-off-by: Steve Muckle <[email protected]>
---
kernel/sched/core.c | 41 ++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 57 --------------------------------------------------
kernel/sched/sched.h | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 100 insertions(+), 57 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4c8c353e..3f4d907 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2869,6 +2869,45 @@ unsigned long long task_sched_runtime(struct task_struct *p)
return ns;
}
+#ifdef CONFIG_CPU_FREQ_GOV_SCHED
+static unsigned long sum_capacity_reqs(unsigned long cfs_cap,
+ struct sched_capacity_reqs *scr)
+{
+ unsigned long total = cfs_cap + scr->rt;
+
+ total = total * capacity_margin;
+ total /= SCHED_CAPACITY_SCALE;
+ total += scr->dl;
+ return total;
+}
+
+static void sched_freq_tick(int cpu)
+{
+ struct sched_capacity_reqs *scr;
+ unsigned long capacity_orig, capacity_curr;
+
+ if (!sched_freq())
+ return;
+
+ capacity_orig = capacity_orig_of(cpu);
+ capacity_curr = capacity_curr_of(cpu);
+ if (capacity_curr == capacity_orig)
+ return;
+
+ /*
+ * To make free room for a task that is building up its "real"
+ * utilization and to harm its performance the least, request
+ * a jump to max OPP as soon as the margin of free capacity is
+ * impacted (specified by capacity_margin).
+ */
+ scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
+ if (capacity_curr < sum_capacity_reqs(cpu_util(cpu), scr))
+ set_cfs_cpu_capacity(cpu, true, capacity_max);
+}
+#else
+static inline void sched_freq_tick(int cpu) { }
+#endif
+
/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
@@ -2895,6 +2934,8 @@ void scheduler_tick(void)
trigger_load_balance(rq);
#endif
rq_last_tick_reset(rq);
+
+ sched_freq_tick(cpu);
}
#ifdef CONFIG_NO_HZ_FULL
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 880ceee..4c49f76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4199,9 +4199,6 @@ static inline void hrtick_update(struct rq *rq)
}
#endif
-static unsigned long capacity_orig_of(int cpu);
-static int cpu_util(int cpu);
-
static void update_capacity_of(int cpu)
{
unsigned long req_cap;
@@ -4601,15 +4598,6 @@ static unsigned long target_load(int cpu, int type)
return max(rq->cpu_load[type-1], total);
}
-static unsigned long capacity_of(int cpu)
-{
- return cpu_rq(cpu)->cpu_capacity;
-}
-
-static unsigned long capacity_orig_of(int cpu)
-{
- return cpu_rq(cpu)->cpu_capacity_orig;
-}
static unsigned long cpu_avg_load_per_task(int cpu)
{
@@ -4779,17 +4767,6 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
#endif
/*
- * Returns the current capacity of cpu after applying both
- * cpu and freq scaling.
- */
-static unsigned long capacity_curr_of(int cpu)
-{
- return cpu_rq(cpu)->cpu_capacity_orig *
- arch_scale_freq_capacity(NULL, cpu)
- >> SCHED_CAPACITY_SHIFT;
-}
-
-/*
* Detect M:N waker/wakee relationships via a switching-frequency heuristic.
* A waker of many should wake a different task than the one last awakened
* at a frequency roughly N times higher than one of its wakees. In order
@@ -5033,40 +5010,6 @@ done:
}
/*
- * cpu_util returns the amount of capacity of a CPU that is used by CFS
- * tasks. The unit of the return value must be the one of capacity so we can
- * compare the utilization with the capacity of the CPU that is available for
- * CFS task (ie cpu_capacity).
- *
- * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
- * recent utilization of currently non-runnable tasks on a CPU. It represents
- * the amount of utilization of a CPU in the range [0..capacity_orig] where
- * capacity_orig is the cpu_capacity available at the highest frequency
- * (arch_scale_freq_capacity()).
- * The utilization of a CPU converges towards a sum equal to or less than the
- * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
- * the running time on this CPU scaled by capacity_curr.
- *
- * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
- * higher than capacity_orig because of unfortunate rounding in
- * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
- * the average stabilizes with the new running time. We need to check that the
- * utilization stays within the range of [0..capacity_orig] and cap it if
- * necessary. Without utilization capping, a group could be seen as overloaded
- * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
- * available capacity. We allow utilization to overshoot capacity_curr (but not
- * capacity_orig) as it useful for predicting the capacity required after task
- * migrations (scheduler-driven DVFS).
- */
-static int cpu_util(int cpu)
-{
- unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
- unsigned long capacity = capacity_orig_of(cpu);
-
- return (util >= capacity) ? capacity : util;
-}
-
-/*
* select_task_rq_fair: Select target runqueue for the waking task in domains
* that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
* SD_BALANCE_FORK, or SD_BALANCE_EXEC.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ad82274..90d5df6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1384,7 +1384,66 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
}
#endif
+#ifdef CONFIG_SMP
+static inline unsigned long capacity_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity;
+}
+
+static inline unsigned long capacity_orig_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
+/*
+ * cpu_util returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must be the one of capacity so we can
+ * compare the utilization with the capacity of the CPU that is available for
+ * CFS task (ie cpu_capacity).
+ *
+ * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
+ * recent utilization of currently non-runnable tasks on a CPU. It represents
+ * the amount of utilization of a CPU in the range [0..capacity_orig] where
+ * capacity_orig is the cpu_capacity available at the highest frequency
+ * (arch_scale_freq_capacity()).
+ * The utilization of a CPU converges towards a sum equal to or less than the
+ * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
+ * the running time on this CPU scaled by capacity_curr.
+ *
+ * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
+ * higher than capacity_orig because of unfortunate rounding in
+ * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
+ * the average stabilizes with the new running time. We need to check that the
+ * utilization stays within the range of [0..capacity_orig] and cap it if
+ * necessary. Without utilization capping, a group could be seen as overloaded
+ * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
+ * available capacity. We allow utilization to overshoot capacity_curr (but not
+ * capacity_orig) as it useful for predicting the capacity required after task
+ * migrations (scheduler-driven DVFS).
+ */
+static inline int cpu_util(int cpu)
+{
+ unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
+ unsigned long capacity = capacity_orig_of(cpu);
+
+ return (util >= capacity) ? capacity : util;
+}
+
+/*
+ * Returns the current capacity of cpu after applying both
+ * cpu and freq scaling.
+ */
+static inline unsigned long capacity_curr_of(int cpu)
+{
+ return cpu_rq(cpu)->cpu_capacity_orig *
+ arch_scale_freq_capacity(NULL, cpu)
+ >> SCHED_CAPACITY_SHIFT;
+}
+
+#endif
+
#ifdef CONFIG_CPU_FREQ_GOV_SCHED
+#define capacity_max SCHED_CAPACITY_SCALE
extern unsigned int capacity_margin;
extern struct static_key __sched_freq;
--
2.4.10
From: Vincent Guittot <[email protected]>
rt_avg is only used to scale the available CPU's capacity for CFS
tasks. As the update of this scaling is done during periodic load
balance, we only have to ensure that sched_avg_update has been called
before any periodic load balancing. This requirement is already
fulfilled by __update_cpu_load so the call in sched_rt_avg_update,
which is part of the hotpath, is useless.
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Steve Muckle <[email protected]>
---
kernel/sched/sched.h | 1 -
1 file changed, 1 deletion(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 90d5df6..08858d1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1497,7 +1497,6 @@ static inline void set_dl_cpu_capacity(int cpu, bool request,
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
{
rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
- sched_avg_update(rq);
}
#else
static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta) { }
--
2.4.10
From: Vincent Guittot <[email protected]>
Instead of monitoring the exec time of deadline tasks to evaluate the
CPU capacity consumed by deadline scheduler class, we can directly
calculate it thanks to the sum of utilization of deadline tasks on the
CPU. We can remove deadline tasks from rt_avg metric and directly use
the average bandwidth of deadline scheduler in scale_rt_capacity.
Based in part on a similar patch from Luca Abeni <[email protected]>.
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Steve Muckle <[email protected]>
---
kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++--
kernel/sched/fair.c | 8 ++++++++
kernel/sched/sched.h | 2 ++
3 files changed, 41 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 8b0a15e..9d9eb50 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -43,6 +43,24 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
return !RB_EMPTY_NODE(&dl_se->rb_node);
}
+static void add_average_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ u64 se_bw = dl_se->dl_bw;
+
+ dl_rq->avg_bw += se_bw;
+}
+
+static void clear_average_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
+{
+ u64 se_bw = dl_se->dl_bw;
+
+ dl_rq->avg_bw -= se_bw;
+ if (dl_rq->avg_bw < 0) {
+ WARN_ON(1);
+ dl_rq->avg_bw = 0;
+ }
+}
+
static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
{
struct sched_dl_entity *dl_se = &p->dl;
@@ -494,6 +512,9 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
struct rq *rq = rq_of_dl_rq(dl_rq);
+ if (dl_se->dl_new)
+ add_average_bw(dl_se, dl_rq);
+
/*
* The arrival of a new instance needs special treatment, i.e.,
* the actual scheduling parameters have to be "renewed".
@@ -741,8 +762,6 @@ static void update_curr_dl(struct rq *rq)
curr->se.exec_start = rq_clock_task(rq);
cpuacct_charge(curr, delta_exec);
- sched_rt_avg_update(rq, delta_exec);
-
dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec;
if (dl_runtime_exceeded(dl_se)) {
dl_se->dl_throttled = 1;
@@ -1241,6 +1260,8 @@ static void task_fork_dl(struct task_struct *p)
static void task_dead_dl(struct task_struct *p)
{
struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
+ struct dl_rq *dl_rq = dl_rq_of_se(&p->dl);
+ struct rq *rq = rq_of_dl_rq(dl_rq);
/*
* Since we are TASK_DEAD we won't slip out of the domain!
@@ -1249,6 +1270,8 @@ static void task_dead_dl(struct task_struct *p)
/* XXX we should retain the bw until 0-lag */
dl_b->total_bw -= p->dl.dl_bw;
raw_spin_unlock_irq(&dl_b->lock);
+
+ clear_average_bw(&p->dl, &rq->dl);
}
static void set_curr_task_dl(struct rq *rq)
@@ -1556,7 +1579,9 @@ retry:
}
deactivate_task(rq, next_task, 0);
+ clear_average_bw(&next_task->dl, &rq->dl);
set_task_cpu(next_task, later_rq->cpu);
+ add_average_bw(&next_task->dl, &later_rq->dl);
activate_task(later_rq, next_task, 0);
ret = 1;
@@ -1644,7 +1669,9 @@ static void pull_dl_task(struct rq *this_rq)
resched = true;
deactivate_task(src_rq, p, 0);
+ clear_average_bw(&p->dl, &src_rq->dl);
set_task_cpu(p, this_cpu);
+ add_average_bw(&p->dl, &this_rq->dl);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline;
@@ -1750,6 +1777,8 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
if (!start_dl_timer(p))
__dl_clear_params(p);
+ clear_average_bw(&p->dl, &rq->dl);
+
/*
* Since this might be the only -deadline task on the rq,
* this is the right place to try to pull some other one
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4c49f76..ce05f61 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6203,6 +6203,14 @@ static unsigned long scale_rt_capacity(int cpu)
used = div_u64(avg, total);
+ /*
+ * deadline bandwidth is defined at system level so we must
+ * weight this bandwidth with the max capacity of the system.
+ * As a reminder, avg_bw is 20bits width and
+ * scale_cpu_capacity is 10 bits width
+ */
+ used += div_u64(rq->dl.avg_bw, arch_scale_cpu_capacity(NULL, cpu));
+
if (likely(used < SCHED_CAPACITY_SCALE))
return SCHED_CAPACITY_SCALE - used;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 08858d1..e44c6be 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -519,6 +519,8 @@ struct dl_rq {
#else
struct dl_bw dl_bw;
#endif
+ /* This is the "average utilization" for this runqueue */
+ s64 avg_bw;
};
#ifdef CONFIG_SMP
--
2.4.10
From: Vincent Guittot <[email protected]>
RT tasks don't provide any running constraints like deadline ones
except their running priority. The only current usable input to
estimate the capacity needed by RT tasks is the rt_avg metric. We use
it to estimate the CPU capacity needed for the RT scheduler class.
In order to monitor the evolution for RT task load, we must
peridiocally check it during the tick.
Then, we use the estimated capacity of the last activity to estimate
the next one which can not be that accurate but is a good starting
point without any impact on the wake up path of RT tasks.
Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Steve Muckle <[email protected]>
---
kernel/sched/rt.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 48 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8ec86ab..9694204 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1426,6 +1426,41 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
#endif
}
+#ifdef CONFIG_SMP
+static void sched_rt_update_capacity_req(struct rq *rq)
+{
+ u64 total, used, age_stamp, avg;
+ s64 delta;
+
+ if (!sched_freq())
+ return;
+
+ sched_avg_update(rq);
+ /*
+ * Since we're reading these variables without serialization make sure
+ * we read them once before doing sanity checks on them.
+ */
+ age_stamp = READ_ONCE(rq->age_stamp);
+ avg = READ_ONCE(rq->rt_avg);
+ delta = rq_clock(rq) - age_stamp;
+
+ if (unlikely(delta < 0))
+ delta = 0;
+
+ total = sched_avg_period() + delta;
+
+ used = div_u64(avg, total);
+ if (unlikely(used > SCHED_CAPACITY_SCALE))
+ used = SCHED_CAPACITY_SCALE;
+
+ set_rt_cpu_capacity(rq->cpu, 1, (unsigned long)(used));
+}
+#else
+static inline void sched_rt_update_capacity_req(struct rq *rq)
+{ }
+
+#endif
+
static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
struct rt_rq *rt_rq)
{
@@ -1494,8 +1529,17 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
if (prev->sched_class == &rt_sched_class)
update_curr_rt(rq);
- if (!rt_rq->rt_queued)
+ if (!rt_rq->rt_queued) {
+ /*
+ * The next task to be picked on this rq will have a lower
+ * priority than rt tasks so we can spend some time to update
+ * the capacity used by rt tasks based on the last activity.
+ * This value will be the used as an estimation of the next
+ * activity.
+ */
+ sched_rt_update_capacity_req(rq);
return NULL;
+ }
put_prev_task(rq, prev);
@@ -2212,6 +2256,9 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
update_curr_rt(rq);
+ if (rq->rt.rt_nr_running)
+ sched_rt_update_capacity_req(rq);
+
watchdog(rq, p);
/*
--
2.4.10
adding Lucas
On 9 December 2015 at 07:19, Steve Muckle <[email protected]> wrote:
> From: Vincent Guittot <[email protected]>
>
> Instead of monitoring the exec time of deadline tasks to evaluate the
> CPU capacity consumed by deadline scheduler class, we can directly
> calculate it thanks to the sum of utilization of deadline tasks on the
> CPU. We can remove deadline tasks from rt_avg metric and directly use
> the average bandwidth of deadline scheduler in scale_rt_capacity.
>
> Based in part on a similar patch from Luca Abeni <[email protected]>.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Signed-off-by: Steve Muckle <[email protected]>
> ---
> kernel/sched/deadline.c | 33 +++++++++++++++++++++++++++++++--
> kernel/sched/fair.c | 8 ++++++++
> kernel/sched/sched.h | 2 ++
> 3 files changed, 41 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 8b0a15e..9d9eb50 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -43,6 +43,24 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
> return !RB_EMPTY_NODE(&dl_se->rb_node);
> }
>
> +static void add_average_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> + u64 se_bw = dl_se->dl_bw;
> +
> + dl_rq->avg_bw += se_bw;
> +}
> +
> +static void clear_average_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> + u64 se_bw = dl_se->dl_bw;
> +
> + dl_rq->avg_bw -= se_bw;
> + if (dl_rq->avg_bw < 0) {
> + WARN_ON(1);
> + dl_rq->avg_bw = 0;
> + }
> +}
> +
> static inline int is_leftmost(struct task_struct *p, struct dl_rq *dl_rq)
> {
> struct sched_dl_entity *dl_se = &p->dl;
> @@ -494,6 +512,9 @@ static void update_dl_entity(struct sched_dl_entity *dl_se,
> struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
> struct rq *rq = rq_of_dl_rq(dl_rq);
>
> + if (dl_se->dl_new)
> + add_average_bw(dl_se, dl_rq);
> +
> /*
> * The arrival of a new instance needs special treatment, i.e.,
> * the actual scheduling parameters have to be "renewed".
> @@ -741,8 +762,6 @@ static void update_curr_dl(struct rq *rq)
> curr->se.exec_start = rq_clock_task(rq);
> cpuacct_charge(curr, delta_exec);
>
> - sched_rt_avg_update(rq, delta_exec);
> -
> dl_se->runtime -= dl_se->dl_yielded ? 0 : delta_exec;
> if (dl_runtime_exceeded(dl_se)) {
> dl_se->dl_throttled = 1;
> @@ -1241,6 +1260,8 @@ static void task_fork_dl(struct task_struct *p)
> static void task_dead_dl(struct task_struct *p)
> {
> struct dl_bw *dl_b = dl_bw_of(task_cpu(p));
> + struct dl_rq *dl_rq = dl_rq_of_se(&p->dl);
> + struct rq *rq = rq_of_dl_rq(dl_rq);
>
> /*
> * Since we are TASK_DEAD we won't slip out of the domain!
> @@ -1249,6 +1270,8 @@ static void task_dead_dl(struct task_struct *p)
> /* XXX we should retain the bw until 0-lag */
> dl_b->total_bw -= p->dl.dl_bw;
> raw_spin_unlock_irq(&dl_b->lock);
> +
> + clear_average_bw(&p->dl, &rq->dl);
> }
>
> static void set_curr_task_dl(struct rq *rq)
> @@ -1556,7 +1579,9 @@ retry:
> }
>
> deactivate_task(rq, next_task, 0);
> + clear_average_bw(&next_task->dl, &rq->dl);
> set_task_cpu(next_task, later_rq->cpu);
> + add_average_bw(&next_task->dl, &later_rq->dl);
> activate_task(later_rq, next_task, 0);
> ret = 1;
>
> @@ -1644,7 +1669,9 @@ static void pull_dl_task(struct rq *this_rq)
> resched = true;
>
> deactivate_task(src_rq, p, 0);
> + clear_average_bw(&p->dl, &src_rq->dl);
> set_task_cpu(p, this_cpu);
> + add_average_bw(&p->dl, &this_rq->dl);
> activate_task(this_rq, p, 0);
> dmin = p->dl.deadline;
>
> @@ -1750,6 +1777,8 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
> if (!start_dl_timer(p))
> __dl_clear_params(p);
>
> + clear_average_bw(&p->dl, &rq->dl);
> +
> /*
> * Since this might be the only -deadline task on the rq,
> * this is the right place to try to pull some other one
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4c49f76..ce05f61 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6203,6 +6203,14 @@ static unsigned long scale_rt_capacity(int cpu)
>
> used = div_u64(avg, total);
>
> + /*
> + * deadline bandwidth is defined at system level so we must
> + * weight this bandwidth with the max capacity of the system.
> + * As a reminder, avg_bw is 20bits width and
> + * scale_cpu_capacity is 10 bits width
> + */
> + used += div_u64(rq->dl.avg_bw, arch_scale_cpu_capacity(NULL, cpu));
> +
> if (likely(used < SCHED_CAPACITY_SCALE))
> return SCHED_CAPACITY_SCALE - used;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 08858d1..e44c6be 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -519,6 +519,8 @@ struct dl_rq {
> #else
> struct dl_bw dl_bw;
> #endif
> + /* This is the "average utilization" for this runqueue */
> + s64 avg_bw;
> };
>
> #ifdef CONFIG_SMP
> --
> 2.4.10
>
Hi Vincent,
first of all, thanks for adding me in the discussion.
On 12/09/2015 09:50 AM, Vincent Guittot wrote:
> adding Lucas
>
> On 9 December 2015 at 07:19, Steve Muckle <[email protected]> wrote:
>> From: Vincent Guittot <[email protected]>
>>
>> Instead of monitoring the exec time of deadline tasks to evaluate the
>> CPU capacity consumed by deadline scheduler class, we can directly
>> calculate it thanks to the sum of utilization of deadline tasks on the
>> CPU. We can remove deadline tasks from rt_avg metric and directly use
>> the average bandwidth of deadline scheduler in scale_rt_capacity.
>>
>> Based in part on a similar patch from Luca Abeni <[email protected]>.
Just to check if my understanding of your patch is correct: what you do is
to track the total utilisation of the tasks that are assigned to a CPU/core,
independently from their state (active or inactive). The difference with my
patch is that I try to track the "active utilisation" (eliminating the utilisation
of the tasks that are blocked).
Is this understanding correct?
If yes, I think your approach is safe (and easier to implement - modulo a small
issue when a task terminates of switches to other scheduling policies; I think
there already are some "XXX" comments in the current code). However, it allows to
save less energy (or reclaim less CPU time). For example, if I create a SCHED_DEADLINE
task with runtime 90ms and period 100ms it will not allow to scale the CPU frequency
even if it never executes (because is always blocked).
[...]
>> + /* This is the "average utilization" for this runqueue */
>> + s64 avg_bw;
>> };
Small nit: why "average" utilization? I think a better name would be "runqueue utilization"
or "local utilization", or something similar... If I understand correctly (sorry if I
missed something), this is not an average, but the sum of the utilisations of the tasks
on this runqueue... No?
Luca
On 10 December 2015 at 14:27, Luca Abeni <[email protected]> wrote:
> Hi Vincent,
>
> first of all, thanks for adding me in the discussion.
>
> On 12/09/2015 09:50 AM, Vincent Guittot wrote:
>>
>> adding Lucas
>>
>> On 9 December 2015 at 07:19, Steve Muckle <[email protected]> wrote:
>>>
>>> From: Vincent Guittot <[email protected]>
>>>
>>> Instead of monitoring the exec time of deadline tasks to evaluate the
>>> CPU capacity consumed by deadline scheduler class, we can directly
>>> calculate it thanks to the sum of utilization of deadline tasks on the
>>> CPU. We can remove deadline tasks from rt_avg metric and directly use
>>> the average bandwidth of deadline scheduler in scale_rt_capacity.
>>>
>>> Based in part on a similar patch from Luca Abeni <[email protected]>.
>
> Just to check if my understanding of your patch is correct: what you do is
> to track the total utilisation of the tasks that are assigned to a CPU/core,
> independently from their state (active or inactive). The difference with my
> patch is that I try to track the "active utilisation" (eliminating the
> utilisation
> of the tasks that are blocked).
>
> Is this understanding correct?
yes, i want to know the average utilization of the CPU/core by
deadline scheduler.
> If yes, I think your approach is safe (and easier to implement - modulo a
> small
> issue when a task terminates of switches to other scheduling policies; I
> think
> there already are some "XXX" comments in the current code). However, it
> allows to
> save less energy (or reclaim less CPU time). For example, if I create a
> SCHED_DEADLINE
> task with runtime 90ms and period 100ms it will not allow to scale the CPU
> frequency
> even if it never executes (because is always blocked).
Yes, i agree. If the task behavior is not aligned with its deadline
parameters, we will over provisioned CPU capacity to the deadline
scheduler.
This metric is not used to select the OPP but to provisioned some CPU
capacity to the deadline scheduler and to inform the CFS scheduler of
the remaining capacity.
So the main side effect of an always blocked deadline task will be to
decrease the rq->cpu_capacity that is then used by the CFS scheduler.
>
>
> [...]
>>>
>>> + /* This is the "average utilization" for this runqueue */
>>> + s64 avg_bw;
>>> };
>
> Small nit: why "average" utilization? I think a better name would be
> "runqueue utilization"
> or "local utilization", or something similar... If I understand correctly
> (sorry if I
> missed something), this is not an average, but the sum of the utilisations
> of the tasks
> on this runqueue... No?
I have used "average" because it doesn't reflect the active/actual
utilization of the run-queue but the forecasted average bandwidth of
the CPU that will be used by deadline task. I'm open to change the
name if another one makes more sense
Regards,
Vincent
>
>
>
> Luca
Hi Vincent,
On 12/10/2015 05:11 PM, Vincent Guittot wrote:
[...]
>> If yes, I think your approach is safe (and easier to implement - modulo a
>> small
>> issue when a task terminates of switches to other scheduling policies; I
>> think
>> there already are some "XXX" comments in the current code). However, it
>> allows to
>> save less energy (or reclaim less CPU time). For example, if I create a
>> SCHED_DEADLINE
>> task with runtime 90ms and period 100ms it will not allow to scale the CPU
>> frequency
>> even if it never executes (because is always blocked).
>
> Yes, i agree. If the task behavior is not aligned with its deadline
> parameters, we will over provisioned CPU capacity to the deadline
> scheduler.
> This metric is not used to select the OPP but to provisioned some CPU
> capacity to the deadline scheduler and to inform the CFS scheduler of
> the remaining capacity.
Ah, sorry; I missed this point.
>>>> + /* This is the "average utilization" for this runqueue */
>>>> + s64 avg_bw;
>>>> };
>>
>> Small nit: why "average" utilization? I think a better name would be
>> "runqueue utilization"
>> or "local utilization", or something similar... If I understand correctly
>> (sorry if I
>> missed something), this is not an average, but the sum of the utilisations
>> of the tasks
>> on this runqueue... No?
>
> I have used "average" because it doesn't reflect the active/actual
> utilization of the run-queue but the forecasted average bandwidth of
> the CPU that will be used by deadline task.
Well, this is just nitpicking, so feel free to ignore (I just mentioned
this point because I was initially confused by the "average" name). But I
think this is "maximum", or "worst-case", not "average", because (as far
as I can understand) this field indicates that SCHED_DEADLINE tasks will
not be able to consume more than this fraction of CPU (if they try to
consume more, the scheduler throttles them).
> I'm open to change the name if another one makes more sense
In real-time literature this is often called simply "utilization" (or "worst-case
utilization" by someone): when a task can have a variable execution time, its
utilization is defined as WCET (maximum execution time) / period.
Luca
Hi Steve,
On 08/12/15 22:19, Steve Muckle wrote:
> From: Michael Turquette <[email protected]>
>
> Scheduler-driven CPU frequency selection hopes to exploit both
> per-task and global information in the scheduler to improve frequency
> selection policy, achieving lower power consumption, improved
> responsiveness/performance, and less reliance on heuristics and
> tunables. For further discussion on the motivation of this integration
> see [0].
>
> This patch implements a shim layer between the Linux scheduler and the
> cpufreq subsystem. The interface accepts capacity requests from the
> CFS, RT and deadline sched classes. The requests from each sched class
> are summed on each CPU with a margin applied to the CFS and RT
> capacity requests to provide some headroom. Deadline requests are
> expected to be precise enough given their nature to not require
> headroom. The maximum total capacity request for a CPU in a frequency
> domain drives the requested frequency for that domain.
>
> Policy is determined by both the sched classes and this shim layer.
>
> Note that this algorithm is event-driven. There is no polling loop to
> check cpu idle time nor any other method which is unsynchronized with
> the scheduler, aside from a throttling mechanism to ensure frequency
> changes are not attempted faster than the hardware can accommodate them.
>
> Thanks to Juri Lelli <[email protected]> for contributing design ideas,
> code and test results, and to Ricky Liang <[email protected]>
> for initialization and static key inc/dec fixes.
>
> [0] http://article.gmane.org/gmane.linux.kernel/1499836
>
> [[email protected]: various additions and fixes, revised commit text]
>
> CC: Ricky Liang <[email protected]>
> Signed-off-by: Michael Turquette <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Steve Muckle <[email protected]>
> ---
> drivers/cpufreq/Kconfig | 20 +++
> include/linux/cpufreq.h | 3 +
> include/linux/sched.h | 8 +
> kernel/sched/Makefile | 1 +
> kernel/sched/cpufreq_sched.c | 364 +++++++++++++++++++++++++++++++++++++++++++
> kernel/sched/sched.h | 51 ++++++
> 6 files changed, 447 insertions(+)
> create mode 100644 kernel/sched/cpufreq_sched.c
>
> diff --git a/drivers/cpufreq/Kconfig b/drivers/cpufreq/Kconfig
> index 659879a..6f2e96c 100644
> --- a/drivers/cpufreq/Kconfig
> +++ b/drivers/cpufreq/Kconfig
> @@ -102,6 +102,14 @@ config CPU_FREQ_DEFAULT_GOV_CONSERVATIVE
> Be aware that not all cpufreq drivers support the conservative
> governor. If unsure have a look at the help section of the
> driver. Fallback governor will be the performance governor.
> +
> +config CPU_FREQ_DEFAULT_GOV_SCHED
> + bool "sched"
> + select CPU_FREQ_GOV_SCHED
> + help
> + Use the CPUfreq governor 'sched' as default. This scales
> + cpu frequency using CPU utilization estimates from the
> + scheduler.
> endchoice
>
> config CPU_FREQ_GOV_PERFORMANCE
> @@ -183,6 +191,18 @@ config CPU_FREQ_GOV_CONSERVATIVE
>
> If in doubt, say N.
>
> +config CPU_FREQ_GOV_SCHED
> + bool "'sched' cpufreq governor"
> + depends on CPU_FREQ
We depend on IRQ_WORK as well, which in turn I think depends on SMP. As
briefly discussed with Peter on IRC, we might want to use
smp_call_function_single_async() instead to break this dependecies
chain (and be able to use this governor on UP as well).
> + select CPU_FREQ_GOV_COMMON
> + help
> + 'sched' - this governor scales cpu frequency from the
> + scheduler as a function of cpu capacity utilization. It does
> + not evaluate utilization on a periodic basis (as ondemand
> + does) but instead is event-driven by the scheduler.
> +
> + If in doubt, say N.
> +
> comment "CPU frequency scaling drivers"
>
> config CPUFREQ_DT
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index 7f8c63d..7e4bde1 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -495,6 +495,9 @@ extern struct cpufreq_governor cpufreq_gov_ondemand;
> #elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE)
> extern struct cpufreq_governor cpufreq_gov_conservative;
> #define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_conservative)
> +#elif defined(CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED)
> +extern struct cpufreq_governor cpufreq_gov_sched;
> +#define CPUFREQ_DEFAULT_GOVERNOR (&cpufreq_gov_sched)
> #endif
>
> /*********************************************************************
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 3b0de68..d910a31 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -927,6 +927,14 @@ enum cpu_idle_type {
> #define SCHED_CAPACITY_SHIFT 10
> #define SCHED_CAPACITY_SCALE (1L << SCHED_CAPACITY_SHIFT)
>
> +struct sched_capacity_reqs {
> + unsigned long cfs;
> + unsigned long rt;
> + unsigned long dl;
> +
> + unsigned long total;
> +};
> +
> /*
> * Wake-queues are lists of tasks with a pending wakeup, whose
> * callers have already marked the task as woken internally,
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 6768797..90ed832 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
> obj-$(CONFIG_SCHEDSTATS) += stats.o
> obj-$(CONFIG_SCHED_DEBUG) += debug.o
> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> +obj-$(CONFIG_CPU_FREQ_GOV_SCHED) += cpufreq_sched.o
> diff --git a/kernel/sched/cpufreq_sched.c b/kernel/sched/cpufreq_sched.c
> new file mode 100644
> index 0000000..af8b5bc
> --- /dev/null
> +++ b/kernel/sched/cpufreq_sched.c
> @@ -0,0 +1,364 @@
> +/*
> + * Copyright (C) 2015 Michael Turquette <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/cpufreq.h>
> +#include <linux/module.h>
> +#include <linux/kthread.h>
> +#include <linux/percpu.h>
> +#include <linux/irq_work.h>
> +#include <linux/delay.h>
> +#include <linux/string.h>
> +
> +#include "sched.h"
> +
> +#define THROTTLE_NSEC 50000000 /* 50ms default */
> +
> +struct static_key __read_mostly __sched_freq = STATIC_KEY_INIT_FALSE;
> +static bool __read_mostly cpufreq_driver_slow;
> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
> +static struct cpufreq_governor cpufreq_gov_sched;
> +#endif
> +
> +/*
> + * Capacity margin added to CFS and RT capacity requests to provide
> + * some head room if task utilization further increases.
> + */
> +unsigned int capacity_margin = 1280;
> +
> +static DEFINE_PER_CPU(unsigned long, enabled);
> +DEFINE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs);
> +
> +/**
> + * gov_data - per-policy data internal to the governor
> + * @throttle: next throttling period expiry. Derived from throttle_nsec
> + * @throttle_nsec: throttle period length in nanoseconds
> + * @task: worker thread for dvfs transition that may block/sleep
> + * @irq_work: callback used to wake up worker thread
> + * @requested_freq: last frequency requested by the sched governor
> + *
> + * struct gov_data is the per-policy cpufreq_sched-specific data structure. A
> + * per-policy instance of it is created when the cpufreq_sched governor receives
> + * the CPUFREQ_GOV_START condition and a pointer to it exists in the gov_data
> + * member of struct cpufreq_policy.
> + *
> + * Readers of this data must call down_read(policy->rwsem). Writers must
> + * call down_write(policy->rwsem).
> + */
> +struct gov_data {
> + ktime_t throttle;
> + unsigned int throttle_nsec;
> + struct task_struct *task;
> + struct irq_work irq_work;
> + unsigned int requested_freq;
> +};
> +
> +static void cpufreq_sched_try_driver_target(struct cpufreq_policy *policy,
> + unsigned int freq)
> +{
> + struct gov_data *gd = policy->governor_data;
> +
> + /* avoid race with cpufreq_sched_stop */
> + if (!down_write_trylock(&policy->rwsem))
> + return;
> +
> + __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
> +
> + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
As I think you proposed at Connect, we could use post frequency
transition notifiers to implement throttling. Is this something that you
already tried implementing/planning to experiment with?
> + up_write(&policy->rwsem);
> +}
> +
> +static bool finish_last_request(struct gov_data *gd)
> +{
> + ktime_t now = ktime_get();
> +
> + if (ktime_after(now, gd->throttle))
> + return false;
> +
> + while (1) {
> + int usec_left = ktime_to_ns(ktime_sub(gd->throttle, now));
> +
> + usec_left /= NSEC_PER_USEC;
> + usleep_range(usec_left, usec_left + 100);
> + now = ktime_get();
> + if (ktime_after(now, gd->throttle))
> + return true;
> + }
> +}
> +
> +/*
> + * we pass in struct cpufreq_policy. This is safe because changing out the
> + * policy requires a call to __cpufreq_governor(policy, CPUFREQ_GOV_STOP),
> + * which tears down all of the data structures and __cpufreq_governor(policy,
> + * CPUFREQ_GOV_START) will do a full rebuild, including this kthread with the
> + * new policy pointer
> + */
> +static int cpufreq_sched_thread(void *data)
> +{
> + struct sched_param param;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + unsigned int new_request = 0;
> + unsigned int last_request = 0;
> + int ret;
> +
> + policy = (struct cpufreq_policy *) data;
> + gd = policy->governor_data;
> +
> + param.sched_priority = 50;
> + ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, ¶m);
> + if (ret) {
> + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> + do_exit(-EINVAL);
> + } else {
> + pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
> + __func__, gd->task->pid);
> + }
> +
> + do {
> + set_current_state(TASK_INTERRUPTIBLE);
> + new_request = gd->requested_freq;
> + if (new_request == last_request) {
> + schedule();
> + } else {
Shouldn't we have to do the following here?
@@ -125,9 +125,9 @@ static int cpufreq_sched_thread(void *data)
}
do {
- set_current_state(TASK_INTERRUPTIBLE);
new_request = gd->requested_freq;
if (new_request == last_request) {
+ set_current_state(TASK_INTERRUPTIBLE);
schedule();
} else {
/*
Otherwise we set task to INTERRUPTIBLE state right after it has been
woken up.
Thanks,
- Juri
> + /*
> + * if the frequency thread sleeps while waiting to be
> + * unthrottled, start over to check for a newer request
> + */
> + if (finish_last_request(gd))
> + continue;
> + last_request = new_request;
> + cpufreq_sched_try_driver_target(policy, new_request);
> + }
> + } while (!kthread_should_stop());
> +
> + return 0;
> +}
> +
> +static void cpufreq_sched_irq_work(struct irq_work *irq_work)
> +{
> + struct gov_data *gd;
> +
> + gd = container_of(irq_work, struct gov_data, irq_work);
> + if (!gd)
> + return;
> +
> + wake_up_process(gd->task);
> +}
> +
> +static void update_fdomain_capacity_request(int cpu)
> +{
> + unsigned int freq_new, index_new, cpu_tmp;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + unsigned long capacity = 0;
> +
> + /*
> + * Avoid grabbing the policy if possible. A test is still
> + * required after locking the CPU's policy to avoid racing
> + * with the governor changing.
> + */
> + if (!per_cpu(enabled, cpu))
> + return;
> +
> + policy = cpufreq_cpu_get(cpu);
> + if (IS_ERR_OR_NULL(policy))
> + return;
> +
> + if (policy->governor != &cpufreq_gov_sched ||
> + !policy->governor_data)
> + goto out;
> +
> + gd = policy->governor_data;
> +
> + /* find max capacity requested by cpus in this policy */
> + for_each_cpu(cpu_tmp, policy->cpus) {
> + struct sched_capacity_reqs *scr;
> +
> + scr = &per_cpu(cpu_sched_capacity_reqs, cpu_tmp);
> + capacity = max(capacity, scr->total);
> + }
> +
> + /* Convert the new maximum capacity request into a cpu frequency */
> + freq_new = capacity * policy->max >> SCHED_CAPACITY_SHIFT;
> + if (cpufreq_frequency_table_target(policy, policy->freq_table,
> + freq_new, CPUFREQ_RELATION_L,
> + &index_new))
> + goto out;
> + freq_new = policy->freq_table[index_new].frequency;
> +
> + if (freq_new == gd->requested_freq)
> + goto out;
> +
> + gd->requested_freq = freq_new;
> +
> + /*
> + * Throttling is not yet supported on platforms with fast cpufreq
> + * drivers.
> + */
> + if (cpufreq_driver_slow)
> + irq_work_queue_on(&gd->irq_work, cpu);
> + else
> + cpufreq_sched_try_driver_target(policy, freq_new);
> +
> +out:
> + cpufreq_cpu_put(policy);
> +}
> +
> +void update_cpu_capacity_request(int cpu, bool request)
> +{
> + unsigned long new_capacity;
> + struct sched_capacity_reqs *scr;
> +
> + /* The rq lock serializes access to the CPU's sched_capacity_reqs. */
> + lockdep_assert_held(&cpu_rq(cpu)->lock);
> +
> + scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
> +
> + new_capacity = scr->cfs + scr->rt;
> + new_capacity = new_capacity * capacity_margin
> + / SCHED_CAPACITY_SCALE;
> + new_capacity += scr->dl;
> +
> + if (new_capacity == scr->total)
> + return;
> +
> + scr->total = new_capacity;
> + if (request)
> + update_fdomain_capacity_request(cpu);
> +}
> +
> +static inline void set_sched_freq(void)
> +{
> + static_key_slow_inc(&__sched_freq);
> +}
> +
> +static inline void clear_sched_freq(void)
> +{
> + static_key_slow_dec(&__sched_freq);
> +}
> +
> +static int cpufreq_sched_policy_init(struct cpufreq_policy *policy)
> +{
> + struct gov_data *gd;
> + int cpu;
> +
> + for_each_cpu(cpu, policy->cpus)
> + memset(&per_cpu(cpu_sched_capacity_reqs, cpu), 0,
> + sizeof(struct sched_capacity_reqs));
> +
> + gd = kzalloc(sizeof(*gd), GFP_KERNEL);
> + if (!gd)
> + return -ENOMEM;
> +
> + gd->throttle_nsec = policy->cpuinfo.transition_latency ?
> + policy->cpuinfo.transition_latency :
> + THROTTLE_NSEC;
> + pr_debug("%s: throttle threshold = %u [ns]\n",
> + __func__, gd->throttle_nsec);
> +
> + if (cpufreq_driver_is_slow()) {
> + cpufreq_driver_slow = true;
> + gd->task = kthread_create(cpufreq_sched_thread, policy,
> + "kschedfreq:%d",
> + cpumask_first(policy->related_cpus));
> + if (IS_ERR_OR_NULL(gd->task)) {
> + pr_err("%s: failed to create kschedfreq thread\n",
> + __func__);
> + goto err;
> + }
> + get_task_struct(gd->task);
> + kthread_bind_mask(gd->task, policy->related_cpus);
> + wake_up_process(gd->task);
> + init_irq_work(&gd->irq_work, cpufreq_sched_irq_work);
> + }
> +
> + policy->governor_data = gd;
> + set_sched_freq();
> +
> + return 0;
> +
> +err:
> + kfree(gd);
> + return -ENOMEM;
> +}
> +
> +static int cpufreq_sched_policy_exit(struct cpufreq_policy *policy)
> +{
> + struct gov_data *gd = policy->governor_data;
> +
> + clear_sched_freq();
> + if (cpufreq_driver_slow) {
> + kthread_stop(gd->task);
> + put_task_struct(gd->task);
> + }
> +
> + policy->governor_data = NULL;
> +
> + kfree(gd);
> + return 0;
> +}
> +
> +static int cpufreq_sched_start(struct cpufreq_policy *policy)
> +{
> + int cpu;
> +
> + for_each_cpu(cpu, policy->cpus)
> + per_cpu(enabled, cpu) = 1;
> +
> + return 0;
> +}
> +
> +static int cpufreq_sched_stop(struct cpufreq_policy *policy)
> +{
> + int cpu;
> +
> + for_each_cpu(cpu, policy->cpus)
> + per_cpu(enabled, cpu) = 0;
> +
> + return 0;
> +}
> +
> +static int cpufreq_sched_setup(struct cpufreq_policy *policy,
> + unsigned int event)
> +{
> + switch (event) {
> + case CPUFREQ_GOV_POLICY_INIT:
> + return cpufreq_sched_policy_init(policy);
> + case CPUFREQ_GOV_POLICY_EXIT:
> + return cpufreq_sched_policy_exit(policy);
> + case CPUFREQ_GOV_START:
> + return cpufreq_sched_start(policy);
> + case CPUFREQ_GOV_STOP:
> + return cpufreq_sched_stop(policy);
> + case CPUFREQ_GOV_LIMITS:
> + break;
> + }
> + return 0;
> +}
> +
> +#ifndef CONFIG_CPU_FREQ_DEFAULT_GOV_SCHED
> +static
> +#endif
> +struct cpufreq_governor cpufreq_gov_sched = {
> + .name = "sched",
> + .governor = cpufreq_sched_setup,
> + .owner = THIS_MODULE,
> +};
> +
> +static int __init cpufreq_sched_init(void)
> +{
> + int cpu;
> +
> + for_each_cpu(cpu, cpu_possible_mask)
> + per_cpu(enabled, cpu) = 0;
> + return cpufreq_register_governor(&cpufreq_gov_sched);
> +}
> +
> +/* Try to make this the default governor */
> +fs_initcall(cpufreq_sched_init);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index a5a6b3e..a88dbec 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1383,6 +1383,57 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> }
> #endif
>
> +#ifdef CONFIG_CPU_FREQ_GOV_SCHED
> +extern unsigned int capacity_margin;
> +extern struct static_key __sched_freq;
> +
> +static inline bool sched_freq(void)
> +{
> + return static_key_false(&__sched_freq);
> +}
> +
> +DECLARE_PER_CPU(struct sched_capacity_reqs, cpu_sched_capacity_reqs);
> +void update_cpu_capacity_request(int cpu, bool request);
> +
> +static inline void set_cfs_cpu_capacity(int cpu, bool request,
> + unsigned long capacity)
> +{
> + if (per_cpu(cpu_sched_capacity_reqs, cpu).cfs != capacity) {
> + per_cpu(cpu_sched_capacity_reqs, cpu).cfs = capacity;
> + update_cpu_capacity_request(cpu, request);
> + }
> +}
> +
> +static inline void set_rt_cpu_capacity(int cpu, bool request,
> + unsigned long capacity)
> +{
> + if (per_cpu(cpu_sched_capacity_reqs, cpu).rt != capacity) {
> + per_cpu(cpu_sched_capacity_reqs, cpu).rt = capacity;
> + update_cpu_capacity_request(cpu, request);
> + }
> +}
> +
> +static inline void set_dl_cpu_capacity(int cpu, bool request,
> + unsigned long capacity)
> +{
> + if (per_cpu(cpu_sched_capacity_reqs, cpu).dl != capacity) {
> + per_cpu(cpu_sched_capacity_reqs, cpu).dl = capacity;
> + update_cpu_capacity_request(cpu, request);
> + }
> +}
> +#else
> +static inline bool sched_freq(void) { return false; }
> +static inline void set_cfs_cpu_capacity(int cpu, bool request,
> + unsigned long capacity)
> +{ }
> +static inline void set_rt_cpu_capacity(int cpu, bool request,
> + unsigned long capacity)
> +{ }
> +static inline void set_dl_cpu_capacity(int cpu, bool request,
> + unsigned long capacity)
> +{ }
> +#endif
> +
> static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
> {
> rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
> --
> 2.4.10
>
Hi Steve,
On 08/12/15 22:19, Steve Muckle wrote:
> Since the true utilization of a long running task is not detectable
> while it is running and might be bigger than the current cpu capacity,
> create the maximum cpu capacity head room by requesting the maximum
> cpu capacity once the cpu usage plus the capacity margin exceeds the
> current capacity. This is also done to try to harm the performance of
> a task the least.
>
> Original fair-class only version authored by Juri Lelli
> <[email protected]>.
>
> cc: Ingo Molnar <[email protected]>
> cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Juri Lelli <[email protected]>
> Signed-off-by: Steve Muckle <[email protected]>
> ---
> kernel/sched/core.c | 41 ++++++++++++++++++++++++++++++++++++
> kernel/sched/fair.c | 57 --------------------------------------------------
> kernel/sched/sched.h | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 100 insertions(+), 57 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 4c8c353e..3f4d907 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2869,6 +2869,45 @@ unsigned long long task_sched_runtime(struct task_struct *p)
> return ns;
> }
>
> +#ifdef CONFIG_CPU_FREQ_GOV_SCHED
> +static unsigned long sum_capacity_reqs(unsigned long cfs_cap,
> + struct sched_capacity_reqs *scr)
> +{
> + unsigned long total = cfs_cap + scr->rt;
> +
> + total = total * capacity_margin;
> + total /= SCHED_CAPACITY_SCALE;
> + total += scr->dl;
> + return total;
> +}
> +
> +static void sched_freq_tick(int cpu)
> +{
> + struct sched_capacity_reqs *scr;
> + unsigned long capacity_orig, capacity_curr;
> +
> + if (!sched_freq())
> + return;
> +
> + capacity_orig = capacity_orig_of(cpu);
> + capacity_curr = capacity_curr_of(cpu);
> + if (capacity_curr == capacity_orig)
> + return;
> +
> + /*
> + * To make free room for a task that is building up its "real"
> + * utilization and to harm its performance the least, request
> + * a jump to max OPP as soon as the margin of free capacity is
> + * impacted (specified by capacity_margin).
> + */
> + scr = &per_cpu(cpu_sched_capacity_reqs, cpu);
> + if (capacity_curr < sum_capacity_reqs(cpu_util(cpu), scr))
> + set_cfs_cpu_capacity(cpu, true, capacity_max);
> +}
> +#else
> +static inline void sched_freq_tick(int cpu) { }
> +#endif
> +
> /*
> * This function gets called by the timer code, with HZ frequency.
> * We call it with interrupts disabled.
> @@ -2895,6 +2934,8 @@ void scheduler_tick(void)
> trigger_load_balance(rq);
> #endif
> rq_last_tick_reset(rq);
> +
> + sched_freq_tick(cpu);
We are not holding rq->lock anymore at this points, and this collides
with comment in update_cpu_capacity_request(). Can't you just move this
up before raw_spin_unlock(&rq->lock)?
Thanks,
- Juri
> }
>
> #ifdef CONFIG_NO_HZ_FULL
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 880ceee..4c49f76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4199,9 +4199,6 @@ static inline void hrtick_update(struct rq *rq)
> }
> #endif
>
> -static unsigned long capacity_orig_of(int cpu);
> -static int cpu_util(int cpu);
> -
> static void update_capacity_of(int cpu)
> {
> unsigned long req_cap;
> @@ -4601,15 +4598,6 @@ static unsigned long target_load(int cpu, int type)
> return max(rq->cpu_load[type-1], total);
> }
>
> -static unsigned long capacity_of(int cpu)
> -{
> - return cpu_rq(cpu)->cpu_capacity;
> -}
> -
> -static unsigned long capacity_orig_of(int cpu)
> -{
> - return cpu_rq(cpu)->cpu_capacity_orig;
> -}
>
> static unsigned long cpu_avg_load_per_task(int cpu)
> {
> @@ -4779,17 +4767,6 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
> #endif
>
> /*
> - * Returns the current capacity of cpu after applying both
> - * cpu and freq scaling.
> - */
> -static unsigned long capacity_curr_of(int cpu)
> -{
> - return cpu_rq(cpu)->cpu_capacity_orig *
> - arch_scale_freq_capacity(NULL, cpu)
> - >> SCHED_CAPACITY_SHIFT;
> -}
> -
> -/*
> * Detect M:N waker/wakee relationships via a switching-frequency heuristic.
> * A waker of many should wake a different task than the one last awakened
> * at a frequency roughly N times higher than one of its wakees. In order
> @@ -5033,40 +5010,6 @@ done:
> }
>
> /*
> - * cpu_util returns the amount of capacity of a CPU that is used by CFS
> - * tasks. The unit of the return value must be the one of capacity so we can
> - * compare the utilization with the capacity of the CPU that is available for
> - * CFS task (ie cpu_capacity).
> - *
> - * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
> - * recent utilization of currently non-runnable tasks on a CPU. It represents
> - * the amount of utilization of a CPU in the range [0..capacity_orig] where
> - * capacity_orig is the cpu_capacity available at the highest frequency
> - * (arch_scale_freq_capacity()).
> - * The utilization of a CPU converges towards a sum equal to or less than the
> - * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
> - * the running time on this CPU scaled by capacity_curr.
> - *
> - * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
> - * higher than capacity_orig because of unfortunate rounding in
> - * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
> - * the average stabilizes with the new running time. We need to check that the
> - * utilization stays within the range of [0..capacity_orig] and cap it if
> - * necessary. Without utilization capping, a group could be seen as overloaded
> - * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
> - * available capacity. We allow utilization to overshoot capacity_curr (but not
> - * capacity_orig) as it useful for predicting the capacity required after task
> - * migrations (scheduler-driven DVFS).
> - */
> -static int cpu_util(int cpu)
> -{
> - unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
> - unsigned long capacity = capacity_orig_of(cpu);
> -
> - return (util >= capacity) ? capacity : util;
> -}
> -
> -/*
> * select_task_rq_fair: Select target runqueue for the waking task in domains
> * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index ad82274..90d5df6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1384,7 +1384,66 @@ unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> }
> #endif
>
> +#ifdef CONFIG_SMP
> +static inline unsigned long capacity_of(int cpu)
> +{
> + return cpu_rq(cpu)->cpu_capacity;
> +}
> +
> +static inline unsigned long capacity_orig_of(int cpu)
> +{
> + return cpu_rq(cpu)->cpu_capacity_orig;
> +}
> +
> +/*
> + * cpu_util returns the amount of capacity of a CPU that is used by CFS
> + * tasks. The unit of the return value must be the one of capacity so we can
> + * compare the utilization with the capacity of the CPU that is available for
> + * CFS task (ie cpu_capacity).
> + *
> + * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
> + * recent utilization of currently non-runnable tasks on a CPU. It represents
> + * the amount of utilization of a CPU in the range [0..capacity_orig] where
> + * capacity_orig is the cpu_capacity available at the highest frequency
> + * (arch_scale_freq_capacity()).
> + * The utilization of a CPU converges towards a sum equal to or less than the
> + * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
> + * the running time on this CPU scaled by capacity_curr.
> + *
> + * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
> + * higher than capacity_orig because of unfortunate rounding in
> + * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
> + * the average stabilizes with the new running time. We need to check that the
> + * utilization stays within the range of [0..capacity_orig] and cap it if
> + * necessary. Without utilization capping, a group could be seen as overloaded
> + * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
> + * available capacity. We allow utilization to overshoot capacity_curr (but not
> + * capacity_orig) as it useful for predicting the capacity required after task
> + * migrations (scheduler-driven DVFS).
> + */
> +static inline int cpu_util(int cpu)
> +{
> + unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
> + unsigned long capacity = capacity_orig_of(cpu);
> +
> + return (util >= capacity) ? capacity : util;
> +}
> +
> +/*
> + * Returns the current capacity of cpu after applying both
> + * cpu and freq scaling.
> + */
> +static inline unsigned long capacity_curr_of(int cpu)
> +{
> + return cpu_rq(cpu)->cpu_capacity_orig *
> + arch_scale_freq_capacity(NULL, cpu)
> + >> SCHED_CAPACITY_SHIFT;
> +}
> +
> +#endif
> +
> #ifdef CONFIG_CPU_FREQ_GOV_SCHED
> +#define capacity_max SCHED_CAPACITY_SCALE
> extern unsigned int capacity_margin;
> extern struct static_key __sched_freq;
>
> --
> 2.4.10
>
On 08/12/15 22:19, Steve Muckle wrote:
> From: Vincent Guittot <[email protected]>
>
> RT tasks don't provide any running constraints like deadline ones
> except their running priority. The only current usable input to
> estimate the capacity needed by RT tasks is the rt_avg metric. We use
> it to estimate the CPU capacity needed for the RT scheduler class.
>
> In order to monitor the evolution for RT task load, we must
> peridiocally check it during the tick.
>
> Then, we use the estimated capacity of the last activity to estimate
> the next one which can not be that accurate but is a good starting
> point without any impact on the wake up path of RT tasks.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Signed-off-by: Steve Muckle <[email protected]>
> ---
> kernel/sched/rt.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 48 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 8ec86ab..9694204 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1426,6 +1426,41 @@ static void check_preempt_curr_rt(struct rq *rq, struct task_struct *p, int flag
> #endif
> }
>
> +#ifdef CONFIG_SMP
s/SMP/CPU_FREQ_GOV_SCHED/ ?
> +static void sched_rt_update_capacity_req(struct rq *rq)
> +{
> + u64 total, used, age_stamp, avg;
> + s64 delta;
> +
> + if (!sched_freq())
> + return;
> +
> + sched_avg_update(rq);
> + /*
> + * Since we're reading these variables without serialization make sure
> + * we read them once before doing sanity checks on them.
> + */
> + age_stamp = READ_ONCE(rq->age_stamp);
> + avg = READ_ONCE(rq->rt_avg);
> + delta = rq_clock(rq) - age_stamp;
> +
> + if (unlikely(delta < 0))
> + delta = 0;
> +
> + total = sched_avg_period() + delta;
> +
> + used = div_u64(avg, total);
> + if (unlikely(used > SCHED_CAPACITY_SCALE))
> + used = SCHED_CAPACITY_SCALE;
> +
> + set_rt_cpu_capacity(rq->cpu, 1, (unsigned long)(used));
Minor nitpick: s/1/true/ .
Thanks,
- Juri
> +}
> +#else
> +static inline void sched_rt_update_capacity_req(struct rq *rq)
> +{ }
> +
> +#endif
> +
> static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
> struct rt_rq *rt_rq)
> {
> @@ -1494,8 +1529,17 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
> if (prev->sched_class == &rt_sched_class)
> update_curr_rt(rq);
>
> - if (!rt_rq->rt_queued)
> + if (!rt_rq->rt_queued) {
> + /*
> + * The next task to be picked on this rq will have a lower
> + * priority than rt tasks so we can spend some time to update
> + * the capacity used by rt tasks based on the last activity.
> + * This value will be the used as an estimation of the next
> + * activity.
> + */
> + sched_rt_update_capacity_req(rq);
> return NULL;
> + }
>
> put_prev_task(rq, prev);
>
> @@ -2212,6 +2256,9 @@ static void task_tick_rt(struct rq *rq, struct task_struct *p, int queued)
>
> update_curr_rt(rq);
>
> + if (rq->rt.rt_nr_running)
> + sched_rt_update_capacity_req(rq);
> +
> watchdog(rq, p);
>
> /*
> --
> 2.4.10
>
On 11 December 2015 at 08:48, Luca Abeni <[email protected]> wrote:
> Hi Vincent,
>
> On 12/10/2015 05:11 PM, Vincent Guittot wrote:
> [...]
>>>
>>> If yes, I think your approach is safe (and easier to implement - modulo a
>>> small
>>> issue when a task terminates of switches to other scheduling policies; I
>>> think
>>> there already are some "XXX" comments in the current code). However, it
>>> allows to
>>> save less energy (or reclaim less CPU time). For example, if I create a
>>> SCHED_DEADLINE
>>> task with runtime 90ms and period 100ms it will not allow to scale the
>>> CPU
>>> frequency
>>> even if it never executes (because is always blocked).
>>
>>
>> Yes, i agree. If the task behavior is not aligned with its deadline
>> parameters, we will over provisioned CPU capacity to the deadline
>> scheduler.
>> This metric is not used to select the OPP but to provisioned some CPU
>> capacity to the deadline scheduler and to inform the CFS scheduler of
>> the remaining capacity.
>
> Ah, sorry; I missed this point.
>
>
>>>>> + /* This is the "average utilization" for this runqueue */
>>>>> + s64 avg_bw;
>>>>> };
>>>
>>>
>>> Small nit: why "average" utilization? I think a better name would be
>>> "runqueue utilization"
>>> or "local utilization", or something similar... If I understand correctly
>>> (sorry if I
>>> missed something), this is not an average, but the sum of the
>>> utilisations
>>> of the tasks
>>> on this runqueue... No?
>>
>>
>> I have used "average" because it doesn't reflect the active/actual
>> utilization of the run-queue but the forecasted average bandwidth of
>> the CPU that will be used by deadline task.
>
> Well, this is just nitpicking, so feel free to ignore (I just mentioned
> this point because I was initially confused by the "average" name). But I
> think this is "maximum", or "worst-case", not "average", because (as far
> as I can understand) this field indicates that SCHED_DEADLINE tasks will
> not be able to consume more than this fraction of CPU (if they try to
> consume more, the scheduler throttles them).
>
>> I'm open to change the name if another one makes more sense
>
> In real-time literature this is often called simply "utilization" (or
> "worst-case
> utilization" by someone): when a task can have a variable execution time,
> its
> utilization is defined as WCET (maximum execution time) / period.
ok. Let follow real-time literature wording and remove "average" to
keep only utilization.
so the variable will be named:
s64 util_bw;
Thanks
>
>
>
> Luca
On 12/14/2015 03:02 PM, Vincent Guittot wrote:
[...]
>>>> Small nit: why "average" utilization? I think a better name would be
>>>> "runqueue utilization"
>>>> or "local utilization", or something similar... If I understand correctly
>>>> (sorry if I
>>>> missed something), this is not an average, but the sum of the
>>>> utilisations
>>>> of the tasks
>>>> on this runqueue... No?
>>>
>>>
>>> I have used "average" because it doesn't reflect the active/actual
>>> utilization of the run-queue but the forecasted average bandwidth of
>>> the CPU that will be used by deadline task.
>>
>> Well, this is just nitpicking, so feel free to ignore (I just mentioned
>> this point because I was initially confused by the "average" name). But I
>> think this is "maximum", or "worst-case", not "average", because (as far
>> as I can understand) this field indicates that SCHED_DEADLINE tasks will
>> not be able to consume more than this fraction of CPU (if they try to
>> consume more, the scheduler throttles them).
>>
>>> I'm open to change the name if another one makes more sense
>>
>> In real-time literature this is often called simply "utilization" (or
>> "worst-case
>> utilization" by someone): when a task can have a variable execution time,
>> its
>> utilization is defined as WCET (maximum execution time) / period.
>
> ok. Let follow real-time literature wording and remove "average" to
> keep only utilization.
> so the variable will be named:
>
> s64 util_bw;
Well, "utilization" and "bandwidth" are often used to indicate the same
quantity, so "util_bw" sounds strange. I'd call it simply "utilization" or
"bandwidth" (otherwise, just leave the name as it is... I said this is just
nitpicking).
Luca
On Tue, Dec 08, 2015 at 10:19:30PM -0800, Steve Muckle wrote:
> From: Vincent Guittot <[email protected]>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 8b0a15e..9d9eb50 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -43,6 +43,24 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
> return !RB_EMPTY_NODE(&dl_se->rb_node);
> }
>
> +static void add_average_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> + u64 se_bw = dl_se->dl_bw;
> +
> + dl_rq->avg_bw += se_bw;
> +}
> +
> +static void clear_average_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> +{
> + u64 se_bw = dl_se->dl_bw;
> +
> + dl_rq->avg_bw -= se_bw;
> + if (dl_rq->avg_bw < 0) {
> + WARN_ON(1);
> + dl_rq->avg_bw = 0;
> + }
> +}
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4c49f76..ce05f61 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6203,6 +6203,14 @@ static unsigned long scale_rt_capacity(int cpu)
>
> used = div_u64(avg, total);
>
> + /*
> + * deadline bandwidth is defined at system level so we must
> + * weight this bandwidth with the max capacity of the system.
> + * As a reminder, avg_bw is 20bits width and
> + * scale_cpu_capacity is 10 bits width
> + */
> + used += div_u64(rq->dl.avg_bw, arch_scale_cpu_capacity(NULL, cpu));
> +
> if (likely(used < SCHED_CAPACITY_SCALE))
> return SCHED_CAPACITY_SCALE - used;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 08858d1..e44c6be 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -519,6 +519,8 @@ struct dl_rq {
> #else
> struct dl_bw dl_bw;
> #endif
> + /* This is the "average utilization" for this runqueue */
> + s64 avg_bw;
> };
So I don't think this is right. AFAICT this projects the WCET as the
amount of time actually used by DL. This will, under many circumstances,
vastly overestimate the amount of time actually spend on it. Therefore
unduly pessimisme the fair capacity of this CPU.
On 14 December 2015 at 16:17, Peter Zijlstra <[email protected]> wrote:
> On Tue, Dec 08, 2015 at 10:19:30PM -0800, Steve Muckle wrote:
>> From: Vincent Guittot <[email protected]>
>
>> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
>> index 8b0a15e..9d9eb50 100644
>> --- a/kernel/sched/deadline.c
>> +++ b/kernel/sched/deadline.c
>> @@ -43,6 +43,24 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
>> return !RB_EMPTY_NODE(&dl_se->rb_node);
>> }
>>
>> +static void add_average_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>> + u64 se_bw = dl_se->dl_bw;
>> +
>> + dl_rq->avg_bw += se_bw;
>> +}
>> +
>> +static void clear_average_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
>> +{
>> + u64 se_bw = dl_se->dl_bw;
>> +
>> + dl_rq->avg_bw -= se_bw;
>> + if (dl_rq->avg_bw < 0) {
>> + WARN_ON(1);
>> + dl_rq->avg_bw = 0;
>> + }
>> +}
>
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4c49f76..ce05f61 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6203,6 +6203,14 @@ static unsigned long scale_rt_capacity(int cpu)
>>
>> used = div_u64(avg, total);
>>
>> + /*
>> + * deadline bandwidth is defined at system level so we must
>> + * weight this bandwidth with the max capacity of the system.
>> + * As a reminder, avg_bw is 20bits width and
>> + * scale_cpu_capacity is 10 bits width
>> + */
>> + used += div_u64(rq->dl.avg_bw, arch_scale_cpu_capacity(NULL, cpu));
>> +
>> if (likely(used < SCHED_CAPACITY_SCALE))
>> return SCHED_CAPACITY_SCALE - used;
>>
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 08858d1..e44c6be 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -519,6 +519,8 @@ struct dl_rq {
>> #else
>> struct dl_bw dl_bw;
>> #endif
>> + /* This is the "average utilization" for this runqueue */
>> + s64 avg_bw;
>> };
>
> So I don't think this is right. AFAICT this projects the WCET as the
> amount of time actually used by DL. This will, under many circumstances,
> vastly overestimate the amount of time actually spend on it. Therefore
> unduly pessimisme the fair capacity of this CPU.
I agree that if the WCET is far from reality, we will underestimate
available capacity for CFS. Have you got some use case in mind which
overestimates the WCET ?
If we can't rely on this parameters to evaluate the amount of capacity
used by deadline scheduler on a core, this will imply that we can't
also use it for requesting capacity to cpufreq and we should fallback
on a monitoring mechanism which reacts to a change instead of
anticipating it.
>
>
On 14/12/15 16:56, Vincent Guittot wrote:
> On 14 December 2015 at 16:17, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Dec 08, 2015 at 10:19:30PM -0800, Steve Muckle wrote:
> >> From: Vincent Guittot <[email protected]>
> >
> >> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> >> index 8b0a15e..9d9eb50 100644
> >> --- a/kernel/sched/deadline.c
> >> +++ b/kernel/sched/deadline.c
> >> @@ -43,6 +43,24 @@ static inline int on_dl_rq(struct sched_dl_entity *dl_se)
> >> return !RB_EMPTY_NODE(&dl_se->rb_node);
> >> }
> >>
> >> +static void add_average_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >> +{
> >> + u64 se_bw = dl_se->dl_bw;
> >> +
> >> + dl_rq->avg_bw += se_bw;
> >> +}
> >> +
> >> +static void clear_average_bw(struct sched_dl_entity *dl_se, struct dl_rq *dl_rq)
> >> +{
> >> + u64 se_bw = dl_se->dl_bw;
> >> +
> >> + dl_rq->avg_bw -= se_bw;
> >> + if (dl_rq->avg_bw < 0) {
> >> + WARN_ON(1);
> >> + dl_rq->avg_bw = 0;
> >> + }
> >> +}
> >
> >
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 4c49f76..ce05f61 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6203,6 +6203,14 @@ static unsigned long scale_rt_capacity(int cpu)
> >>
> >> used = div_u64(avg, total);
> >>
> >> + /*
> >> + * deadline bandwidth is defined at system level so we must
> >> + * weight this bandwidth with the max capacity of the system.
> >> + * As a reminder, avg_bw is 20bits width and
> >> + * scale_cpu_capacity is 10 bits width
> >> + */
> >> + used += div_u64(rq->dl.avg_bw, arch_scale_cpu_capacity(NULL, cpu));
> >> +
> >> if (likely(used < SCHED_CAPACITY_SCALE))
> >> return SCHED_CAPACITY_SCALE - used;
> >>
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index 08858d1..e44c6be 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -519,6 +519,8 @@ struct dl_rq {
> >> #else
> >> struct dl_bw dl_bw;
> >> #endif
> >> + /* This is the "average utilization" for this runqueue */
> >> + s64 avg_bw;
> >> };
> >
> > So I don't think this is right. AFAICT this projects the WCET as the
> > amount of time actually used by DL. This will, under many circumstances,
> > vastly overestimate the amount of time actually spend on it. Therefore
> > unduly pessimisme the fair capacity of this CPU.
>
> I agree that if the WCET is far from reality, we will underestimate
> available capacity for CFS. Have you got some use case in mind which
> overestimates the WCET ?
I guess simply the fact that one task can be admitted to the system, but
then in practice sleep, waiting from some event to happen.
> If we can't rely on this parameters to evaluate the amount of capacity
> used by deadline scheduler on a core, this will imply that we can't
> also use it for requesting capacity to cpufreq and we should fallback
> on a monitoring mechanism which reacts to a change instead of
> anticipating it.
>
There is at least one way in the middle: use utilization of active
servers (as I think Luca was already mentioning). This solution should
remove some of the pessimism, but still be safe for our needs. I should
be able to play with this alternative in the (hopefully) near future.
Thanks,
- Juri
On Mon, Dec 14, 2015 at 04:56:17PM +0100, Vincent Guittot wrote:
> I agree that if the WCET is far from reality, we will underestimate
> available capacity for CFS. Have you got some use case in mind which
> overestimates the WCET ?
Pretty much any 'correct' WCET is pessimistic. There's heaps of smart
people working on improving WCET bounds, but they're still out there.
This is mostly because of the .00001% tail cases that 'never' happen but
would make your tokamak burn a hole just when you're outside.
> If we can't rely on this parameters to evaluate the amount of capacity
> used by deadline scheduler on a core, this will imply that we can't
> also use it for requesting capacity to cpufreq and we should fallback
> on a monitoring mechanism which reacts to a change instead of
> anticipating it.
No, since the WCET can and _will_ happen, its the best you can do with
cpufreq. If you were to set it lower you could not be able to execute
correctly in your 'never' tail cases.
There 'might' be smart pants ways around this, where you run part of the
execution at lower speed and switch to a higher speed to 'catch' up if
you exceed some boundary, such that, on average, you run at the same
speed the WCET mandates, but I'm not sure that's worth it. Juri/Luca
might know.
On Mon, 14 Dec 2015 16:56:17 +0100
Vincent Guittot <[email protected]> wrote:
[...]
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index 08858d1..e44c6be 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -519,6 +519,8 @@ struct dl_rq {
> >> #else
> >> struct dl_bw dl_bw;
> >> #endif
> >> + /* This is the "average utilization" for this runqueue */
> >> + s64 avg_bw;
> >> };
> >
> > So I don't think this is right. AFAICT this projects the WCET as the
> > amount of time actually used by DL. This will, under many
> > circumstances, vastly overestimate the amount of time actually
> > spend on it. Therefore unduly pessimisme the fair capacity of this
> > CPU.
>
> I agree that if the WCET is far from reality, we will underestimate
> available capacity for CFS. Have you got some use case in mind which
> overestimates the WCET ?
> If we can't rely on this parameters to evaluate the amount of capacity
> used by deadline scheduler on a core, this will imply that we can't
> also use it for requesting capacity to cpufreq and we should fallback
> on a monitoring mechanism which reacts to a change instead of
> anticipating it.
I think a more "theoretically sound" approach would be to track the
_active_ utilisation (informally speaking, the sum of the utilisations
of the tasks that are actually active on a core - the exact definition
of "active" is the trick here).
As done, for example, here:
https://github.com/lucabe72/linux-reclaiming/tree/track-utilisation-v2
(in particular, see
https://github.com/lucabe72/linux-reclaiming/commit/49fc786a1c453148625f064fa38ea538470df55b
)
I understand this approach might look too complex... But I think it is
much less pessimistic while still being "safe".
If there is something that I can do to make that code more acceptable,
let me know.
Luca
On Mon, 14 Dec 2015 16:07:59 +0000
Juri Lelli <[email protected]> wrote:
[...]
> > I agree that if the WCET is far from reality, we will underestimate
> > available capacity for CFS. Have you got some use case in mind which
> > overestimates the WCET ?
>
> I guess simply the fact that one task can be admitted to the system,
> but then in practice sleep, waiting from some event to happen.
My favourite example (since 1998 :) is a video player (but every task
processing compressed video should work as an example): there is a
noticeable difference between the time needed to process large I frames
with a lot of movement (that is about the WCET) and the time needed to
process small B frames with not much movement. And if we want to avoid
too much jitter in the video playback we have to allocate the runtime
based on the maximum time needed to process a video frame.
> > If we can't rely on this parameters to evaluate the amount of
> > capacity used by deadline scheduler on a core, this will imply that
> > we can't also use it for requesting capacity to cpufreq and we
> > should fallback on a monitoring mechanism which reacts to a change
> > instead of anticipating it.
> >
>
> There is at least one way in the middle: use utilization of active
> servers (as I think Luca was already mentioning). This solution should
> remove some of the pessimism, but still be safe for our needs.
If you track the active utilisation as done by the GRUB algorithm
( http://retis.sssup.it/~lipari/papers/lipariBaruah2000.pdf ) and by my
patches, you can remove _all_ the pessimism :)
Luca
On Mon, 14 Dec 2015 17:51:28 +0100
Peter Zijlstra <[email protected]> wrote:
> On Mon, Dec 14, 2015 at 04:56:17PM +0100, Vincent Guittot wrote:
> > I agree that if the WCET is far from reality, we will underestimate
> > available capacity for CFS. Have you got some use case in mind which
> > overestimates the WCET ?
>
> Pretty much any 'correct' WCET is pessimistic. There's heaps of smart
> people working on improving WCET bounds, but they're still out there.
> This is mostly because of the .00001% tail cases that 'never' happen
> but would make your tokamak burn a hole just when you're outside.
As I mentioned in a previous email, you do not even need to consider
these extreme cases... If a task has a highly variable execution time
(I always mention video players and compressed video processing, but
collegues working on computer vision told me that some video tracking
algorithms have similar characteristics) you might want to allocate the
runtime based on the maximum execution time (or a time near to the
maximum)... But the task will consume less than that a lot of times.
> > If we can't rely on this parameters to evaluate the amount of
> > capacity used by deadline scheduler on a core, this will imply that
> > we can't also use it for requesting capacity to cpufreq and we
> > should fallback on a monitoring mechanism which reacts to a change
> > instead of anticipating it.
>
> No, since the WCET can and _will_ happen, its the best you can do with
> cpufreq. If you were to set it lower you could not be able to execute
> correctly in your 'never' tail cases.
>
> There 'might' be smart pants ways around this, where you run part of
> the execution at lower speed and switch to a higher speed to 'catch'
> up if you exceed some boundary, such that, on average, you run at the
> same speed the WCET mandates, but I'm not sure that's worth it.
> Juri/Luca might know.
Some previous works (see for example
https://www.researchgate.net/profile/Giuseppe_Lipari/publication/220800940_Using_resource_reservation_techniques_for_power-aware_scheduling/links/09e41513639b2703fc000000.pdf
) investigated the usage of the "active utilisation" for switching the
CPU frequency. This "active utilisation tracking" mechanism is the same
I mentioned in the previous email, and implemented here:
https://github.com/lucabe72/linux-reclaiming/commit/49fc786a1c453148625f064fa38ea538470df55b .
I suspect the "inactive timer" I used to decrease the utilisation at
the so called 0-lag time might be problematic, but I did not find any
way to implement (or approximate) the active utilisation tracking
without this timer... Anyway, if there is interest I am willing to
adapt/rework/modify my patches as needed.
Luca
Hi Juri,
Thanks for the review.
On 12/11/2015 03:04 AM, Juri Lelli wrote:
>> +config CPU_FREQ_GOV_SCHED
>> + bool "'sched' cpufreq governor"
>> + depends on CPU_FREQ
>
> We depend on IRQ_WORK as well, which in turn I think depends on SMP. As
> briefly discussed with Peter on IRC, we might want to use
> smp_call_function_single_async() instead to break this dependecies
> chain (and be able to use this governor on UP as well).
FWIW I don't see an explicit dependency of IRQ_WORK on SMP
(init/Kconfig), nevertheless I'll take a look at moving to
smp_call_function_single_async() to reduce the dependency list of
sched-freq.
...
>> + /* avoid race with cpufreq_sched_stop */
>> + if (!down_write_trylock(&policy->rwsem))
>> + return;
>> +
>> + __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
>> +
>> + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
>
> As I think you proposed at Connect, we could use post frequency
> transition notifiers to implement throttling. Is this something that you
> already tried implementing/planning to experiment with?
I started to do this a while back and then decided to hold off. I think
(though I can't recall for sure) it may have been so I could
artificially throttle the rate of frequency change events further by
specifying an inflated frequency change time. That's useful to have as
we experiment with policy.
We probably want both of these mechanisms. Throttling at a minimum based
on transition end notifiers, and the option of throttling further for
policy purposes (at least for now, or as a debug option). Will look at
this again.
...
>> +static int cpufreq_sched_thread(void *data)
>> +{
>> + struct sched_param param;
>> + struct cpufreq_policy *policy;
>> + struct gov_data *gd;
>> + unsigned int new_request = 0;
>> + unsigned int last_request = 0;
>> + int ret;
>> +
>> + policy = (struct cpufreq_policy *) data;
>> + gd = policy->governor_data;
>> +
>> + param.sched_priority = 50;
>> + ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, ¶m);
>> + if (ret) {
>> + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
>> + do_exit(-EINVAL);
>> + } else {
>> + pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
>> + __func__, gd->task->pid);
>> + }
>> +
>> + do {
>> + set_current_state(TASK_INTERRUPTIBLE);
>> + new_request = gd->requested_freq;
>> + if (new_request == last_request) {
>> + schedule();
>> + } else {
>
> Shouldn't we have to do the following here?
>
>
> @@ -125,9 +125,9 @@ static int cpufreq_sched_thread(void *data)
> }
>
> do {
> - set_current_state(TASK_INTERRUPTIBLE);
> new_request = gd->requested_freq;
> if (new_request == last_request) {
> + set_current_state(TASK_INTERRUPTIBLE);
> schedule();
> } else {
> /*
>
> Otherwise we set task to INTERRUPTIBLE state right after it has been
> woken up.
The state must be set to TASK_INTERRUPTIBLE before the data used to
decide whether to sleep or not is read (gd->requested_freq in this case).
If it is set after, then once gd->requested_freq is read but before the
state is set to TASK_INTERRUPTIBLE, the other side may update
gd->requested_freq and issue a wakeup on the freq thread. The wakeup
will have no effect since the freq thread would still be TASK_RUNNING at
that time. The freq thread would proceed to go to sleep and the update
would be lost.
thanks,
Steve
Hi Juri,
On 12/11/2015 03:12 AM, Juri Lelli wrote:
>> @@ -2895,6 +2934,8 @@ void scheduler_tick(void)
>> > trigger_load_balance(rq);
>> > #endif
>> > rq_last_tick_reset(rq);
>> > +
>> > + sched_freq_tick(cpu);
> We are not holding rq->lock anymore at this points, and this collides
> with comment in update_cpu_capacity_request(). Can't you just move this
> up before raw_spin_unlock(&rq->lock)?
My thinking in putting it last was to have it after the possible
periodic load balance, so that we don't initiate a frequency change only
to have to modify the frequency again immediately afterwards.
Thinking more about it, the way we currently have the policy defined
there's no concern with having it earlier since sched_freq_tick only
causes the frequency to go to fmax (or do nothing). If we modify the
policy so that sched_freq_tick can cause arbitrary frequency changes
then I think this may need more thought.
thanks,
Steve
On 14 December 2015 at 17:51, Peter Zijlstra <[email protected]> wrote:
> On Mon, Dec 14, 2015 at 04:56:17PM +0100, Vincent Guittot wrote:
>> I agree that if the WCET is far from reality, we will underestimate
>> available capacity for CFS. Have you got some use case in mind which
>> overestimates the WCET ?
>
> Pretty much any 'correct' WCET is pessimistic. There's heaps of smart
> people working on improving WCET bounds, but they're still out there.
> This is mostly because of the .00001% tail cases that 'never' happen but
> would make your tokamak burn a hole just when you're outside.
>
>> If we can't rely on this parameters to evaluate the amount of capacity
>> used by deadline scheduler on a core, this will imply that we can't
>> also use it for requesting capacity to cpufreq and we should fallback
>> on a monitoring mechanism which reacts to a change instead of
>> anticipating it.
>
> No, since the WCET can and _will_ happen, its the best you can do with
> cpufreq. If you were to set it lower you could not be able to execute
> correctly in your 'never' tail cases.
In the context of frequency scaling, This mean that we will never
reach low frequency
>
> There 'might' be smart pants ways around this, where you run part of the
> execution at lower speed and switch to a higher speed to 'catch' up if
> you exceed some boundary, such that, on average, you run at the same
> speed the WCET mandates, but I'm not sure that's worth it. Juri/Luca
> might know.
On 14 December 2015 at 22:12, Luca Abeni <[email protected]> wrote:
> On Mon, 14 Dec 2015 16:56:17 +0100
> Vincent Guittot <[email protected]> wrote:
> [...]
>> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> >> index 08858d1..e44c6be 100644
>> >> --- a/kernel/sched/sched.h
>> >> +++ b/kernel/sched/sched.h
>> >> @@ -519,6 +519,8 @@ struct dl_rq {
>> >> #else
>> >> struct dl_bw dl_bw;
>> >> #endif
>> >> + /* This is the "average utilization" for this runqueue */
>> >> + s64 avg_bw;
>> >> };
>> >
>> > So I don't think this is right. AFAICT this projects the WCET as the
>> > amount of time actually used by DL. This will, under many
>> > circumstances, vastly overestimate the amount of time actually
>> > spend on it. Therefore unduly pessimisme the fair capacity of this
>> > CPU.
>>
>> I agree that if the WCET is far from reality, we will underestimate
>> available capacity for CFS. Have you got some use case in mind which
>> overestimates the WCET ?
>> If we can't rely on this parameters to evaluate the amount of capacity
>> used by deadline scheduler on a core, this will imply that we can't
>> also use it for requesting capacity to cpufreq and we should fallback
>> on a monitoring mechanism which reacts to a change instead of
>> anticipating it.
> I think a more "theoretically sound" approach would be to track the
> _active_ utilisation (informally speaking, the sum of the utilisations
> of the tasks that are actually active on a core - the exact definition
> of "active" is the trick here).
The point is that we probably need 2 definitions of "active" tasks.
The 1st one would be used to scale the frequency. From a power saving
point of view, it have to reflect the minimum frequency needed at the
current time to handle all works without missing deadline. This one
should be updated quite often with the wake up and the sleep of tasks
as well as the throttling.
The 2nd definition is used to compute the remaining capacity for the
CFS scheduler. This one doesn't need to be updated at each wake/sleep
of a deadline task but should reflect the capacity used by deadline in
a larger time scale. The latter will be used by the CFS scheduler at
the periodic load balance pace
> As done, for example, here:
> https://github.com/lucabe72/linux-reclaiming/tree/track-utilisation-v2
> (in particular, see
> https://github.com/lucabe72/linux-reclaiming/commit/49fc786a1c453148625f064fa38ea538470df55b
> )
> I understand this approach might look too complex... But I think it is
> much less pessimistic while still being "safe".
> If there is something that I can do to make that code more acceptable,
> let me know.
>
>
> Luca
On 12/15/2015 05:59 AM, Vincent Guittot wrote:
[...]
>>>> So I don't think this is right. AFAICT this projects the WCET as the
>>>> amount of time actually used by DL. This will, under many
>>>> circumstances, vastly overestimate the amount of time actually
>>>> spend on it. Therefore unduly pessimisme the fair capacity of this
>>>> CPU.
>>>
>>> I agree that if the WCET is far from reality, we will underestimate
>>> available capacity for CFS. Have you got some use case in mind which
>>> overestimates the WCET ?
>>> If we can't rely on this parameters to evaluate the amount of capacity
>>> used by deadline scheduler on a core, this will imply that we can't
>>> also use it for requesting capacity to cpufreq and we should fallback
>>> on a monitoring mechanism which reacts to a change instead of
>>> anticipating it.
>> I think a more "theoretically sound" approach would be to track the
>> _active_ utilisation (informally speaking, the sum of the utilisations
>> of the tasks that are actually active on a core - the exact definition
>> of "active" is the trick here).
>
> The point is that we probably need 2 definitions of "active" tasks.
Ok; thanks for clarifying. I do not know much about the remaining capacity
used by CFS; however, from what you write I guess CFS really need an "average"
utilisation (while frequency scaling needs the active utilisation).
So, I suspect you really need to track 2 different things.
From a quick look at the code that is currently in mainline, it seems to
me that it does a reasonable thing for tracking the remaining capacity
used by CFS...
> The 1st one would be used to scale the frequency. From a power saving
> point of view, it have to reflect the minimum frequency needed at the
> current time to handle all works without missing deadline.
Right. And it can be computed as shown in the GRUB-PA paper I mentioned
in a previous mail (that is, by tracking the active utilisation, as done
by my patches).
> This one
> should be updated quite often with the wake up and the sleep of tasks
> as well as the throttling.
Strictly speaking, the active utilisation must be updated when a task
wakes up and when a task sleeps/terminates (but when a task sleeps/terminates
you cannot decrease the active utilisation immediately: you have to wait
some time because the task might already have used part of its "future
utilisation").
The active utilisation must not be updated when a task is throttled: a
task is throttled when its current runtime is 0, so it already used all
of its utilisation for the current period (think about two tasks with
runtime=50ms and period 100ms: they consume 100% of the time on a CPU,
and when the first task consumed all of its runtime, you cannot decrease
the active utilisation).
> The 2nd definition is used to compute the remaining capacity for the
> CFS scheduler. This one doesn't need to be updated at each wake/sleep
> of a deadline task but should reflect the capacity used by deadline in
> a larger time scale. The latter will be used by the CFS scheduler at
> the periodic load balance pace
Ok, so as I wrote above this really looks like an average utilisation.
My impression (but I do not know the CFS code too much) is that the mainline
kernel is currently doing the right thing to compute it, so maybe there is no
need to change the current code in this regard.
If the current code is not acceptable for some reason, an alternative would
be to measure the active utilisation for frequency scaling, and then apply a
low-pass filter to it for CFS.
Luca
>
>> As done, for example, here:
>> https://github.com/lucabe72/linux-reclaiming/tree/track-utilisation-v2
>> (in particular, see
>> https://github.com/lucabe72/linux-reclaiming/commit/49fc786a1c453148625f064fa38ea538470df55b
>> )
>> I understand this approach might look too complex... But I think it is
>> much less pessimistic while still being "safe".
>> If there is something that I can do to make that code more acceptable,
>> let me know.
>>
>>
>> Luca
On 14/12/15 18:02, Steve Muckle wrote:
> Hi Juri,
>
> Thanks for the review.
>
> On 12/11/2015 03:04 AM, Juri Lelli wrote:
> >> +config CPU_FREQ_GOV_SCHED
> >> + bool "'sched' cpufreq governor"
> >> + depends on CPU_FREQ
> >
> > We depend on IRQ_WORK as well, which in turn I think depends on SMP. As
> > briefly discussed with Peter on IRC, we might want to use
> > smp_call_function_single_async() instead to break this dependecies
> > chain (and be able to use this governor on UP as well).
>
> FWIW I don't see an explicit dependency of IRQ_WORK on SMP
Oh, right. I seemed to remember that, but now I couldn't find this
dependency anymore.
> (init/Kconfig), nevertheless I'll take a look at moving to
> smp_call_function_single_async() to reduce the dependency list of
> sched-freq.
>
OK, great. I think there's still value in reducing the dependency list.
> ...
> >> + /* avoid race with cpufreq_sched_stop */
> >> + if (!down_write_trylock(&policy->rwsem))
> >> + return;
> >> +
> >> + __cpufreq_driver_target(policy, freq, CPUFREQ_RELATION_L);
> >> +
> >> + gd->throttle = ktime_add_ns(ktime_get(), gd->throttle_nsec);
> >
> > As I think you proposed at Connect, we could use post frequency
> > transition notifiers to implement throttling. Is this something that you
> > already tried implementing/planning to experiment with?
>
> I started to do this a while back and then decided to hold off. I think
> (though I can't recall for sure) it may have been so I could
> artificially throttle the rate of frequency change events further by
> specifying an inflated frequency change time. That's useful to have as
> we experiment with policy.
>
> We probably want both of these mechanisms. Throttling at a minimum based
> on transition end notifiers, and the option of throttling further for
> policy purposes (at least for now, or as a debug option). Will look at
> this again.
>
Yeah, looks good.
> ...
> >> +static int cpufreq_sched_thread(void *data)
> >> +{
> >> + struct sched_param param;
> >> + struct cpufreq_policy *policy;
> >> + struct gov_data *gd;
> >> + unsigned int new_request = 0;
> >> + unsigned int last_request = 0;
> >> + int ret;
> >> +
> >> + policy = (struct cpufreq_policy *) data;
> >> + gd = policy->governor_data;
> >> +
> >> + param.sched_priority = 50;
> >> + ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, ¶m);
> >> + if (ret) {
> >> + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> >> + do_exit(-EINVAL);
> >> + } else {
> >> + pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
> >> + __func__, gd->task->pid);
> >> + }
> >> +
> >> + do {
> >> + set_current_state(TASK_INTERRUPTIBLE);
> >> + new_request = gd->requested_freq;
> >> + if (new_request == last_request) {
> >> + schedule();
> >> + } else {
> >
> > Shouldn't we have to do the following here?
> >
> >
> > @@ -125,9 +125,9 @@ static int cpufreq_sched_thread(void *data)
> > }
> >
> > do {
> > - set_current_state(TASK_INTERRUPTIBLE);
> > new_request = gd->requested_freq;
> > if (new_request == last_request) {
> > + set_current_state(TASK_INTERRUPTIBLE);
> > schedule();
> > } else {
> > /*
> >
> > Otherwise we set task to INTERRUPTIBLE state right after it has been
> > woken up.
>
> The state must be set to TASK_INTERRUPTIBLE before the data used to
> decide whether to sleep or not is read (gd->requested_freq in this case).
>
> If it is set after, then once gd->requested_freq is read but before the
> state is set to TASK_INTERRUPTIBLE, the other side may update
> gd->requested_freq and issue a wakeup on the freq thread. The wakeup
> will have no effect since the freq thread would still be TASK_RUNNING at
> that time. The freq thread would proceed to go to sleep and the update
> would be lost.
>
Mmm, I suggested that because I was hitting this while testing:
[ 34.816158] ------------[ cut here ]------------
[ 34.816177] WARNING: CPU: 2 PID: 1712 at kernel/kernel/sched/core.c:7617 __might_sleep+0x90/0xa8()
[ 34.816188] do not call blocking ops when !TASK_RUNNING; state=1 set at [<c007c1f8>] cpufreq_sched_thread+0x80/0x2b0
[ 34.816198] Modules linked in:
[ 34.816207] CPU: 2 PID: 1712 Comm: kschedfreq:1 Not tainted 4.4.0-rc2+ #401
[ 34.816212] Hardware name: ARM-Versatile Express
[ 34.816229] [<c0018874>] (unwind_backtrace) from [<c0013f60>] (show_stack+0x20/0x24)
[ 34.816243] [<c0013f60>] (show_stack) from [<c0448c98>] (dump_stack+0x80/0xb4)
[ 34.816257] [<c0448c98>] (dump_stack) from [<c0029930>] (warn_slowpath_common+0x88/0xc0)
[ 34.816267] [<c0029930>] (warn_slowpath_common) from [<c0029a24>] (warn_slowpath_fmt+0x40/0x48)
[ 34.816278] [<c0029a24>] (warn_slowpath_fmt) from [<c0054764>] (__might_sleep+0x90/0xa8)
[ 34.816291] [<c0054764>] (__might_sleep) from [<c0578400>] (cpufreq_freq_transition_begin+0x6c/0x13c)
[ 34.816303] [<c0578400>] (cpufreq_freq_transition_begin) from [<c0578714>] (__cpufreq_driver_target+0x180/0x2c0)
[ 34.816314] [<c0578714>] (__cpufreq_driver_target) from [<c007c14c>] (cpufreq_sched_try_driver_target+0x48/0x74)
[ 34.816324] [<c007c14c>] (cpufreq_sched_try_driver_target) from [<c007c1e8>] (cpufreq_sched_thread+0x70/0x2b0)
[ 34.816336] [<c007c1e8>] (cpufreq_sched_thread) from [<c004ce30>] (kthread+0xf4/0x114)
[ 34.816347] [<c004ce30>] (kthread) from [<c000fdd0>] (ret_from_fork+0x14/0x24)
[ 34.816355] ---[ end trace 30e92db342678467 ]---
Maybe we could cope with what you are saying with an atomic flag
indicating that the kthread is currently servicing a request? Like
extending the finish_last_request thing to cover this case as well.
Best,
- Juri
On Tue, Dec 15, 2015 at 09:50:14AM +0100, Luca Abeni wrote:
> On 12/15/2015 05:59 AM, Vincent Guittot wrote:
> >The 2nd definition is used to compute the remaining capacity for the
> >CFS scheduler. This one doesn't need to be updated at each wake/sleep
> >of a deadline task but should reflect the capacity used by deadline in
> >a larger time scale. The latter will be used by the CFS scheduler at
> >the periodic load balance pace
> Ok, so as I wrote above this really looks like an average utilisation.
> My impression (but I do not know the CFS code too much) is that the mainline
> kernel is currently doing the right thing to compute it, so maybe there is no
> need to change the current code in this regard.
> If the current code is not acceptable for some reason, an alternative would
> be to measure the active utilisation for frequency scaling, and then apply a
> low-pass filter to it for CFS.
So CFS really only needs a 'vague' average idea on how much time it will
not get. Its best effort etc., so being a little wrong isn't a problem.
The current code suffices, but I think the reason its been changed in
this series is that they want/need separate tracking for fifo/rr and
deadline in the next patch, and taking out deadline like proposed was
the easiest way of achieving that.
On Tue, Dec 15, 2015 at 09:50:14AM +0100, Luca Abeni wrote:
> Strictly speaking, the active utilisation must be updated when a task
> wakes up and when a task sleeps/terminates (but when a task sleeps/terminates
> you cannot decrease the active utilisation immediately: you have to wait
> some time because the task might already have used part of its "future
> utilisation").
> The active utilisation must not be updated when a task is throttled: a
> task is throttled when its current runtime is 0, so it already used all
> of its utilisation for the current period (think about two tasks with
> runtime=50ms and period 100ms: they consume 100% of the time on a CPU,
> and when the first task consumed all of its runtime, you cannot decrease
> the active utilisation).
Hehe, this reminds me of the lag tracking in EEVDF/WF2Q/BFQ etc., that
had similar issues.
On Mon, Dec 14, 2015 at 10:31:13PM +0100, Luca Abeni wrote:
> > There 'might' be smart pants ways around this, where you run part of
> > the execution at lower speed and switch to a higher speed to 'catch'
> > up if you exceed some boundary, such that, on average, you run at the
> > same speed the WCET mandates, but I'm not sure that's worth it.
> > Juri/Luca might know.
> Some previous works (see for example
> https://www.researchgate.net/profile/Giuseppe_Lipari/publication/220800940_Using_resource_reservation_techniques_for_power-aware_scheduling/links/09e41513639b2703fc000000.pdf
> ) investigated the usage of the "active utilisation" for switching the
> CPU frequency. This "active utilisation tracking" mechanism is the same
> I mentioned in the previous email, and implemented here:
> https://github.com/lucabe72/linux-reclaiming/commit/49fc786a1c453148625f064fa38ea538470df55b .
I have stuck the various PDFs and commits you've linked into my todo
list ;-) Thanks!
> I suspect the "inactive timer" I used to decrease the utilisation at
> the so called 0-lag time might be problematic, but I did not find any
> way to implement (or approximate) the active utilisation tracking
> without this timer... Anyway, if there is interest I am willing to
> adapt/rework/modify my patches as needed.
So I remember something else from the BFQ code, which also had to track
entries for the 0-lag stuff, and I just had a quick peek at that code
again. And what they appear to do is keep inactive entries with a lag
deficit in a separate tree (the idle tree).
And every time they update the vtime, they also push fwd the idle tree
and expire entries on that.
Or that is what I can make of it in a quick few minutes staring at that
code -- look for bfq_forget_idle().
On Tue, Dec 15, 2015 at 05:43:44AM +0100, Vincent Guittot wrote:
> On 14 December 2015 at 17:51, Peter Zijlstra <[email protected]> wrote:
> > No, since the WCET can and _will_ happen, its the best you can do with
> > cpufreq. If you were to set it lower you could not be able to execute
> > correctly in your 'never' tail cases.
>
> In the context of frequency scaling, This mean that we will never
> reach low frequency
Only if you've stuffed your machine full of deadline tasks, if you take
Luca's example of the I/B frame decoder thingy, then even the WCET for
the I frames should not be very much (albeit significantly more than B
frames).
So while the WCET is pessimistic compared to the avg case, most CPUs can
do video decoding without much effort at all, so even the WCET for the
I-frames might allow us to drop to the lowest cpufreq.
Now, if you were to decode 10 streams at the same time, different story
of course ;-)
On 15 December 2015 at 09:50, Luca Abeni <[email protected]> wrote:
> On 12/15/2015 05:59 AM, Vincent Guittot wrote:
> [...]
>>>>>
>>>>> So I don't think this is right. AFAICT this projects the WCET as the
>>>>> amount of time actually used by DL. This will, under many
>>>>> circumstances, vastly overestimate the amount of time actually
>>>>> spend on it. Therefore unduly pessimisme the fair capacity of this
>>>>> CPU.
>>>>
>>>>
>>>> I agree that if the WCET is far from reality, we will underestimate
>>>> available capacity for CFS. Have you got some use case in mind which
>>>> overestimates the WCET ?
>>>> If we can't rely on this parameters to evaluate the amount of capacity
>>>> used by deadline scheduler on a core, this will imply that we can't
>>>> also use it for requesting capacity to cpufreq and we should fallback
>>>> on a monitoring mechanism which reacts to a change instead of
>>>> anticipating it.
>>>
>>> I think a more "theoretically sound" approach would be to track the
>>> _active_ utilisation (informally speaking, the sum of the utilisations
>>> of the tasks that are actually active on a core - the exact definition
>>> of "active" is the trick here).
>>
>>
>> The point is that we probably need 2 definitions of "active" tasks.
>
> Ok; thanks for clarifying. I do not know much about the remaining capacity
> used by CFS; however, from what you write I guess CFS really need an
> "average"
> utilisation (while frequency scaling needs the active utilisation).
yes. this patch is only about the "average" utilization
> So, I suspect you really need to track 2 different things.
> From a quick look at the code that is currently in mainline, it seems to
> me that it does a reasonable thing for tracking the remaining capacity
> used by CFS...
>
>> The 1st one would be used to scale the frequency. From a power saving
>> point of view, it have to reflect the minimum frequency needed at the
>> current time to handle all works without missing deadline.
>
> Right. And it can be computed as shown in the GRUB-PA paper I mentioned
> in a previous mail (that is, by tracking the active utilisation, as done
> by my patches).
I fully trust you on that part.
>
>> This one
>> should be updated quite often with the wake up and the sleep of tasks
>> as well as the throttling.
>
> Strictly speaking, the active utilisation must be updated when a task
> wakes up and when a task sleeps/terminates (but when a task
> sleeps/terminates
> you cannot decrease the active utilisation immediately: you have to wait
> some time because the task might already have used part of its "future
> utilisation").
> The active utilisation must not be updated when a task is throttled: a
> task is throttled when its current runtime is 0, so it already used all
> of its utilisation for the current period (think about two tasks with
> runtime=50ms and period 100ms: they consume 100% of the time on a CPU,
> and when the first task consumed all of its runtime, you cannot decrease
> the active utilisation).
I haven't read the paper you pointed in the previous email but it's
on my todo list. Does the GRUB-PA take into account the frequency
transition when selecting the best frequency ?
>
>> The 2nd definition is used to compute the remaining capacity for the
>> CFS scheduler. This one doesn't need to be updated at each wake/sleep
>> of a deadline task but should reflect the capacity used by deadline in
>> a larger time scale. The latter will be used by the CFS scheduler at
>> the periodic load balance pace
>
> Ok, so as I wrote above this really looks like an average utilisation.
> My impression (but I do not know the CFS code too much) is that the mainline
> kernel is currently doing the right thing to compute it, so maybe there is
> no
> need to change the current code in this regard.
> If the current code is not acceptable for some reason, an alternative would
> be to measure the active utilisation for frequency scaling, and then apply a
> low-pass filter to it for CFS.
>
>
> Luca
>
>
>>
>>> As done, for example, here:
>>> https://github.com/lucabe72/linux-reclaiming/tree/track-utilisation-v2
>>> (in particular, see
>>>
>>> https://github.com/lucabe72/linux-reclaiming/commit/49fc786a1c453148625f064fa38ea538470df55b
>>> )
>>> I understand this approach might look too complex... But I think it is
>>> much less pessimistic while still being "safe".
>>> If there is something that I can do to make that code more acceptable,
>>> let me know.
>>>
>>>
>>> Luca
>
>
On 15 December 2015 at 13:20, Peter Zijlstra <[email protected]> wrote:
> On Tue, Dec 15, 2015 at 09:50:14AM +0100, Luca Abeni wrote:
>> On 12/15/2015 05:59 AM, Vincent Guittot wrote:
>
>> >The 2nd definition is used to compute the remaining capacity for the
>> >CFS scheduler. This one doesn't need to be updated at each wake/sleep
>> >of a deadline task but should reflect the capacity used by deadline in
>> >a larger time scale. The latter will be used by the CFS scheduler at
>> >the periodic load balance pace
>
>> Ok, so as I wrote above this really looks like an average utilisation.
>> My impression (but I do not know the CFS code too much) is that the mainline
>> kernel is currently doing the right thing to compute it, so maybe there is no
>> need to change the current code in this regard.
>> If the current code is not acceptable for some reason, an alternative would
>> be to measure the active utilisation for frequency scaling, and then apply a
>> low-pass filter to it for CFS.
>
> So CFS really only needs a 'vague' average idea on how much time it will
> not get. Its best effort etc., so being a little wrong isn't a problem.
>
> The current code suffices, but I think the reason its been changed in
> this series is that they want/need separate tracking for fifo/rr and
> deadline in the next patch, and taking out deadline like proposed was
> the easiest way of achieving that.
yes. you're right. The goal was to minimize the overhead for tracking
separately fifo/rr and deadline.
>
>
On 15 December 2015 at 13:41, Peter Zijlstra <[email protected]> wrote:
> On Tue, Dec 15, 2015 at 05:43:44AM +0100, Vincent Guittot wrote:
>> On 14 December 2015 at 17:51, Peter Zijlstra <[email protected]> wrote:
>
>> > No, since the WCET can and _will_ happen, its the best you can do with
>> > cpufreq. If you were to set it lower you could not be able to execute
>> > correctly in your 'never' tail cases.
>>
>> In the context of frequency scaling, This mean that we will never
>> reach low frequency
>
> Only if you've stuffed your machine full of deadline tasks, if you take
> Luca's example of the I/B frame decoder thingy, then even the WCET for
> the I frames should not be very much (albeit significantly more than B
> frames).
But in this case, the impact of deadline scheduler on the remaining
capacity for CFS should not be that much as well. This will not
prevent a CFS task but only will only make it a bit smaller than the
other
>
> So while the WCET is pessimistic compared to the avg case, most CPUs can
> do video decoding without much effort at all, so even the WCET for the
> I-frames might allow us to drop to the lowest cpufreq.
>
> Now, if you were to decode 10 streams at the same time, different story
> of course ;-)
On 15 December 2015 at 09:50, Luca Abeni <[email protected]> wrote:
> On 12/15/2015 05:59 AM, Vincent Guittot wrote:
> [...]
>>>>>
>>>>> So I don't think this is right. AFAICT this projects the WCET as the
>>>>> amount of time actually used by DL. This will, under many
>>>>> circumstances, vastly overestimate the amount of time actually
>>>>> spend on it. Therefore unduly pessimisme the fair capacity of this
>>>>> CPU.
>>>>
>>>>
[snip]
>> The 2nd definition is used to compute the remaining capacity for the
>> CFS scheduler. This one doesn't need to be updated at each wake/sleep
>> of a deadline task but should reflect the capacity used by deadline in
>> a larger time scale. The latter will be used by the CFS scheduler at
>> the periodic load balance pace
>
> Ok, so as I wrote above this really looks like an average utilisation.
> My impression (but I do not know the CFS code too much) is that the mainline
> kernel is currently doing the right thing to compute it, so maybe there is
> no
> need to change the current code in this regard.
> If the current code is not acceptable for some reason, an alternative would
> be to measure the active utilisation for frequency scaling, and then apply a
> low-pass filter to it for CFS.
In this case, it's probably easier to split what is already done into
a rt_avg metric and a dl_avg metric
Vincent
>
>
> Luca
>
>
>>
>>> As done, for example, here:
>>> https://github.com/lucabe72/linux-reclaiming/tree/track-utilisation-v2
>>> (in particular, see
>>>
>>> https://github.com/lucabe72/linux-reclaiming/commit/49fc786a1c453148625f064fa38ea538470df55b
>>> )
>>> I understand this approach might look too complex... But I think it is
>>> much less pessimistic while still being "safe".
>>> If there is something that I can do to make that code more acceptable,
>>> let me know.
>>>
>>>
>>> Luca
>
>
On 12/15/2015 01:20 PM, Peter Zijlstra wrote:
> On Tue, Dec 15, 2015 at 09:50:14AM +0100, Luca Abeni wrote:
>> On 12/15/2015 05:59 AM, Vincent Guittot wrote:
>
>>> The 2nd definition is used to compute the remaining capacity for the
>>> CFS scheduler. This one doesn't need to be updated at each wake/sleep
>>> of a deadline task but should reflect the capacity used by deadline in
>>> a larger time scale. The latter will be used by the CFS scheduler at
>>> the periodic load balance pace
>
>> Ok, so as I wrote above this really looks like an average utilisation.
>> My impression (but I do not know the CFS code too much) is that the mainline
>> kernel is currently doing the right thing to compute it, so maybe there is no
>> need to change the current code in this regard.
>> If the current code is not acceptable for some reason, an alternative would
>> be to measure the active utilisation for frequency scaling, and then apply a
>> low-pass filter to it for CFS.
>
> So CFS really only needs a 'vague' average idea on how much time it will
> not get. Its best effort etc., so being a little wrong isn't a problem.
>
> The current code suffices, but I think the reason its been changed in
> this series is that they want/need separate tracking for fifo/rr and
> deadline in the next patch, and taking out deadline like proposed was
> the easiest way of achieving that.
Ah, ok. Thanks for explaining.
So, I agree that this patch is not a good idea for estimating the average
utilisation needed by CFS.
Luca
On 12/15/2015 01:23 PM, Peter Zijlstra wrote:
> On Tue, Dec 15, 2015 at 09:50:14AM +0100, Luca Abeni wrote:
>> Strictly speaking, the active utilisation must be updated when a task
>> wakes up and when a task sleeps/terminates (but when a task sleeps/terminates
>> you cannot decrease the active utilisation immediately: you have to wait
>> some time because the task might already have used part of its "future
>> utilisation").
>> The active utilisation must not be updated when a task is throttled: a
>> task is throttled when its current runtime is 0, so it already used all
>> of its utilisation for the current period (think about two tasks with
>> runtime=50ms and period 100ms: they consume 100% of the time on a CPU,
>> and when the first task consumed all of its runtime, you cannot decrease
>> the active utilisation).
>
> Hehe, this reminds me of the lag tracking in EEVDF/WF2Q/BFQ etc., that
> had similar issues.
Yes, I remember EEVDF and similar algorithms also needed to keep track of
the current lag (and to reset it at a later time respect to the "task is
blocking" event). I do not remember the exact details, but I "borrowed" the
"0-lag time" name from one of those papers :)
Luca
On 12/15/2015 01:38 PM, Peter Zijlstra wrote:
> On Mon, Dec 14, 2015 at 10:31:13PM +0100, Luca Abeni wrote:
>
>>> There 'might' be smart pants ways around this, where you run part of
>>> the execution at lower speed and switch to a higher speed to 'catch'
>>> up if you exceed some boundary, such that, on average, you run at the
>>> same speed the WCET mandates, but I'm not sure that's worth it.
>>> Juri/Luca might know.
>
>> Some previous works (see for example
>> https://www.researchgate.net/profile/Giuseppe_Lipari/publication/220800940_Using_resource_reservation_techniques_for_power-aware_scheduling/links/09e41513639b2703fc000000.pdf
>> ) investigated the usage of the "active utilisation" for switching the
>> CPU frequency. This "active utilisation tracking" mechanism is the same
>> I mentioned in the previous email, and implemented here:
>> https://github.com/lucabe72/linux-reclaiming/commit/49fc786a1c453148625f064fa38ea538470df55b .
>
> I have stuck the various PDFs and commits you've linked into my todo
> list ;-) Thanks!
You are welcome :)
>> I suspect the "inactive timer" I used to decrease the utilisation at
>> the so called 0-lag time might be problematic, but I did not find any
>> way to implement (or approximate) the active utilisation tracking
>> without this timer... Anyway, if there is interest I am willing to
>> adapt/rework/modify my patches as needed.
>
> So I remember something else from the BFQ code, which also had to track
> entries for the 0-lag stuff, and I just had a quick peek at that code
> again. And what they appear to do is keep inactive entries with a lag
> deficit in a separate tree (the idle tree).
>
> And every time they update the vtime, they also push fwd the idle tree
> and expire entries on that.
I am not sure if I understand correctly the idea (I do not know the BFQ
code; I'll have a look), but I think I tried something similar:
- When a task blocks, instead of arming the inactive timer I can insert
the task in an "active non contending" tree (to use GRUB terminology)
- So, when some sched deadline function is invoked, I check the "0-lag
time" of the first task in the "active non contending" tree, and if
that time is passed I remove the task from the tree and adjust the
active utilisation
The resulting code ended up being more complex (basically, I needed to
handle the "active non contending" tree and to check it in task_tick_dl()
and update_curr_dl()). But maybe I did it wrong... I'll try this approach
again, after looking ad the BFQ code.
Thanks,
Luca
>
> Or that is what I can make of it in a quick few minutes staring at that
> code -- look for bfq_forget_idle().
>
On 12/15/2015 01:43 PM, Vincent Guittot wrote:
[...]
>>>>> I agree that if the WCET is far from reality, we will underestimate
>>>>> available capacity for CFS. Have you got some use case in mind which
>>>>> overestimates the WCET ?
>>>>> If we can't rely on this parameters to evaluate the amount of capacity
>>>>> used by deadline scheduler on a core, this will imply that we can't
>>>>> also use it for requesting capacity to cpufreq and we should fallback
>>>>> on a monitoring mechanism which reacts to a change instead of
>>>>> anticipating it.
>>>>
>>>> I think a more "theoretically sound" approach would be to track the
>>>> _active_ utilisation (informally speaking, the sum of the utilisations
>>>> of the tasks that are actually active on a core - the exact definition
>>>> of "active" is the trick here).
>>>
>>>
>>> The point is that we probably need 2 definitions of "active" tasks.
>>
>> Ok; thanks for clarifying. I do not know much about the remaining capacity
>> used by CFS; however, from what you write I guess CFS really need an
>> "average"
>> utilisation (while frequency scaling needs the active utilisation).
>
> yes. this patch is only about the "average" utilization
Ok; so, I think that the approach of this patch is too pessimistic (it uses
the "worst-case" utilisation as an estimation of the average one).
>>> This one
>>> should be updated quite often with the wake up and the sleep of tasks
>>> as well as the throttling.
>>
>> Strictly speaking, the active utilisation must be updated when a task
>> wakes up and when a task sleeps/terminates (but when a task
>> sleeps/terminates
>> you cannot decrease the active utilisation immediately: you have to wait
>> some time because the task might already have used part of its "future
>> utilisation").
>> The active utilisation must not be updated when a task is throttled: a
>> task is throttled when its current runtime is 0, so it already used all
>> of its utilisation for the current period (think about two tasks with
>> runtime=50ms and period 100ms: they consume 100% of the time on a CPU,
>> and when the first task consumed all of its runtime, you cannot decrease
>> the active utilisation).
>
> I haven't read the paper you pointed in the previous email but it's
> on my todo list. Does the GRUB-PA take into account the frequency
> transition when selecting the best frequency ?
I do not know...
As far as I understand, the GRUB-PA approach is simple: if the active
utilisation of SCHED_DEADLINE tasks is Ua, then the CPU frequency can be
reduced to the maximum possible frequency multiplied by Ua (of course,
this must be adjusted a little bit, because the original GRUB-PA paper
only considered real-time/SCHED_DEADLINE tasks... To leave some CPU time
for other tasks, you have to increase Ua a little bit).
Some time ago one of the authors of the GRUB-PA paper told me that they
evaluated the performance of the algorithm by simulating the behaviour of
a real CPU, but I do not know the details, and I do not know if they took
the frequency transition into account.
Luca
On 12/15/2015 01:58 PM, Vincent Guittot wrote:
> On 15 December 2015 at 09:50, Luca Abeni <[email protected]> wrote:
>> On 12/15/2015 05:59 AM, Vincent Guittot wrote:
>> [...]
>>>>>>
>>>>>> So I don't think this is right. AFAICT this projects the WCET as the
>>>>>> amount of time actually used by DL. This will, under many
>>>>>> circumstances, vastly overestimate the amount of time actually
>>>>>> spend on it. Therefore unduly pessimisme the fair capacity of this
>>>>>> CPU.
>>>>>
>>>>>
>
> [snip]
>
>>> The 2nd definition is used to compute the remaining capacity for the
>>> CFS scheduler. This one doesn't need to be updated at each wake/sleep
>>> of a deadline task but should reflect the capacity used by deadline in
>>> a larger time scale. The latter will be used by the CFS scheduler at
>>> the periodic load balance pace
>>
>> Ok, so as I wrote above this really looks like an average utilisation.
>> My impression (but I do not know the CFS code too much) is that the mainline
>> kernel is currently doing the right thing to compute it, so maybe there is
>> no
>> need to change the current code in this regard.
>> If the current code is not acceptable for some reason, an alternative would
>> be to measure the active utilisation for frequency scaling, and then apply a
>> low-pass filter to it for CFS.
>
> In this case, it's probably easier to split what is already done into
> a rt_avg metric and a dl_avg metric
Yes, I think this could be the best approach for what concerns the average
utilisation used by CFS.
Luca
On Tue, Dec 15, 2015 at 02:30:07PM +0100, Luca Abeni wrote:
> >So I remember something else from the BFQ code, which also had to track
> >entries for the 0-lag stuff, and I just had a quick peek at that code
> >again. And what they appear to do is keep inactive entries with a lag
> >deficit in a separate tree (the idle tree).
> >
> >And every time they update the vtime, they also push fwd the idle tree
> >and expire entries on that.
> I am not sure if I understand correctly the idea (I do not know the BFQ
> code; I'll have a look), but I think I tried something similar:
> - When a task blocks, instead of arming the inactive timer I can insert
> the task in an "active non contending" tree (to use GRUB terminology)
> - So, when some sched deadline function is invoked, I check the "0-lag
> time" of the first task in the "active non contending" tree, and if
> that time is passed I remove the task from the tree and adjust the
> active utilisation
>
> The resulting code ended up being more complex (basically, I needed to
> handle the "active non contending" tree and to check it in task_tick_dl()
> and update_curr_dl()). But maybe I did it wrong... I'll try this approach
> again, after looking ad the BFQ code.
That sounds about right.
I've no idea if its more or less work. I just had vague memories on an
alternative approach to the timer.
Feel free to stick with the timer if that works better, just wanted to
mention there are indeed alternative solutions.
On Tue, 15 Dec 2015 14:42:29 +0100
Peter Zijlstra <[email protected]> wrote:
> On Tue, Dec 15, 2015 at 02:30:07PM +0100, Luca Abeni wrote:
>
> > >So I remember something else from the BFQ code, which also had to
> > >track entries for the 0-lag stuff, and I just had a quick peek at
> > >that code again. And what they appear to do is keep inactive
> > >entries with a lag deficit in a separate tree (the idle tree).
> > >
> > >And every time they update the vtime, they also push fwd the idle
> > >tree and expire entries on that.
> > I am not sure if I understand correctly the idea (I do not know the
> > BFQ code; I'll have a look), but I think I tried something similar:
> > - When a task blocks, instead of arming the inactive timer I can
> > insert the task in an "active non contending" tree (to use GRUB
> > terminology)
> > - So, when some sched deadline function is invoked, I check the
> > "0-lag time" of the first task in the "active non contending" tree,
> > and if that time is passed I remove the task from the tree and
> > adjust the active utilisation
> >
> > The resulting code ended up being more complex (basically, I needed
> > to handle the "active non contending" tree and to check it in
> > task_tick_dl() and update_curr_dl()). But maybe I did it wrong...
> > I'll try this approach again, after looking ad the BFQ code.
>
> That sounds about right.
>
> I've no idea if its more or less work. I just had vague memories on an
> alternative approach to the timer.
>
> Feel free to stick with the timer if that works better, just wanted to
> mention there are indeed alternative solutions.
Ok; I'll try to implement this alternative approach again, after
looking at BFQ, to see if it turns out to be simpler or more complex
than the timer-based approach.
If there is interest, I'll send an RFC with these patches after some
testing.
Thanks,
Luca
On 12/15/2015 02:31 AM, Juri Lelli wrote:
>>>> + do {
>>>> > >> + set_current_state(TASK_INTERRUPTIBLE);
>>>> > >> + new_request = gd->requested_freq;
>>>> > >> + if (new_request == last_request) {
>>>> > >> + schedule();
>>>> > >> + } else {
>>> > >
>>> > > Shouldn't we have to do the following here?
>>> > >
>>> > >
>>> > > @@ -125,9 +125,9 @@ static int cpufreq_sched_thread(void *data)
>>> > > }
>>> > >
>>> > > do {
>>> > > - set_current_state(TASK_INTERRUPTIBLE);
>>> > > new_request = gd->requested_freq;
>>> > > if (new_request == last_request) {
>>> > > + set_current_state(TASK_INTERRUPTIBLE);
>>> > > schedule();
>>> > > } else {
>>> > > /*
>>> > >
>>> > > Otherwise we set task to INTERRUPTIBLE state right after it has been
>>> > > woken up.
>> >
>> > The state must be set to TASK_INTERRUPTIBLE before the data used to
>> > decide whether to sleep or not is read (gd->requested_freq in this case).
>> >
>> > If it is set after, then once gd->requested_freq is read but before the
>> > state is set to TASK_INTERRUPTIBLE, the other side may update
>> > gd->requested_freq and issue a wakeup on the freq thread. The wakeup
>> > will have no effect since the freq thread would still be TASK_RUNNING at
>> > that time. The freq thread would proceed to go to sleep and the update
>> > would be lost.
>> >
> Mmm, I suggested that because I was hitting this while testing:
>
> [ 34.816158] ------------[ cut here ]------------
> [ 34.816177] WARNING: CPU: 2 PID: 1712 at kernel/kernel/sched/core.c:7617 __might_sleep+0x90/0xa8()
> [ 34.816188] do not call blocking ops when !TASK_RUNNING; state=1 set at [<c007c1f8>] cpufreq_sched_thread+0x80/0x2b0
> [ 34.816198] Modules linked in:
> [ 34.816207] CPU: 2 PID: 1712 Comm: kschedfreq:1 Not tainted 4.4.0-rc2+ #401
> [ 34.816212] Hardware name: ARM-Versatile Express
> [ 34.816229] [<c0018874>] (unwind_backtrace) from [<c0013f60>] (show_stack+0x20/0x24)
> [ 34.816243] [<c0013f60>] (show_stack) from [<c0448c98>] (dump_stack+0x80/0xb4)
> [ 34.816257] [<c0448c98>] (dump_stack) from [<c0029930>] (warn_slowpath_common+0x88/0xc0)
> [ 34.816267] [<c0029930>] (warn_slowpath_common) from [<c0029a24>] (warn_slowpath_fmt+0x40/0x48)
> [ 34.816278] [<c0029a24>] (warn_slowpath_fmt) from [<c0054764>] (__might_sleep+0x90/0xa8)
> [ 34.816291] [<c0054764>] (__might_sleep) from [<c0578400>] (cpufreq_freq_transition_begin+0x6c/0x13c)
> [ 34.816303] [<c0578400>] (cpufreq_freq_transition_begin) from [<c0578714>] (__cpufreq_driver_target+0x180/0x2c0)
> [ 34.816314] [<c0578714>] (__cpufreq_driver_target) from [<c007c14c>] (cpufreq_sched_try_driver_target+0x48/0x74)
> [ 34.816324] [<c007c14c>] (cpufreq_sched_try_driver_target) from [<c007c1e8>] (cpufreq_sched_thread+0x70/0x2b0)
> [ 34.816336] [<c007c1e8>] (cpufreq_sched_thread) from [<c004ce30>] (kthread+0xf4/0x114)
> [ 34.816347] [<c004ce30>] (kthread) from [<c000fdd0>] (ret_from_fork+0x14/0x24)
> [ 34.816355] ---[ end trace 30e92db342678467 ]---
>
> Maybe we could cope with what you are saying with an atomic flag
> indicating that the kthread is currently servicing a request? Like
> extending the finish_last_request thing to cover this case as well.
Ah. I should be able to just set_current_state(TASK_RUNNING) at the top
of the else clause. Will include this change next time.
thanks,
Steve
Hi Steve,
On Tue, Dec 08, 2015 at 10:19:24PM -0800, Steve Muckle wrote:
[...]
> +static int cpufreq_sched_thread(void *data)
> +{
> + struct sched_param param;
> + struct cpufreq_policy *policy;
> + struct gov_data *gd;
> + unsigned int new_request = 0;
> + unsigned int last_request = 0;
> + int ret;
> +
> + policy = (struct cpufreq_policy *) data;
> + gd = policy->governor_data;
> +
> + param.sched_priority = 50;
> + ret = sched_setscheduler_nocheck(gd->task, SCHED_FIFO, ¶m);
> + if (ret) {
> + pr_warn("%s: failed to set SCHED_FIFO\n", __func__);
> + do_exit(-EINVAL);
> + } else {
> + pr_debug("%s: kthread (%d) set to SCHED_FIFO\n",
> + __func__, gd->task->pid);
> + }
> +
> + do {
> + set_current_state(TASK_INTERRUPTIBLE);
> + new_request = gd->requested_freq;
> + if (new_request == last_request) {
> + schedule();
> + } else {
> + /*
> + * if the frequency thread sleeps while waiting to be
> + * unthrottled, start over to check for a newer request
> + */
> + if (finish_last_request(gd))
> + continue;
> + last_request = new_request;
> + cpufreq_sched_try_driver_target(policy, new_request);
> + }
I also think "set_current_state(TASK_INTERRUPTIBLE)" will introduce
logic error when software flow run into "else" block. The reason is
after you set state with TASK_INTERRUPTIBLE, if there have some
scheduling happen within cpufreq_sched_try_driver_target(), then the
thread will be remove from rq. But generally we suppose the thread
will be on rq and can continue run after next tick.
Juri's suggestion can fix this issue. And we can use atomic_t to
safely accessing gd->requested_freq.
[...]
Thanks,
Leo Yan
Hi Luca,
On 15/12/15 22:24, Luca Abeni wrote:
> On Tue, 15 Dec 2015 14:42:29 +0100
> Peter Zijlstra <[email protected]> wrote:
> > On Tue, Dec 15, 2015 at 02:30:07PM +0100, Luca Abeni wrote:
> >
> > > >So I remember something else from the BFQ code, which also had to
> > > >track entries for the 0-lag stuff, and I just had a quick peek at
> > > >that code again. And what they appear to do is keep inactive
> > > >entries with a lag deficit in a separate tree (the idle tree).
> > > >
> > > >And every time they update the vtime, they also push fwd the idle
> > > >tree and expire entries on that.
> > > I am not sure if I understand correctly the idea (I do not know the
> > > BFQ code; I'll have a look), but I think I tried something similar:
> > > - When a task blocks, instead of arming the inactive timer I can
> > > insert the task in an "active non contending" tree (to use GRUB
> > > terminology)
> > > - So, when some sched deadline function is invoked, I check the
> > > "0-lag time" of the first task in the "active non contending" tree,
> > > and if that time is passed I remove the task from the tree and
> > > adjust the active utilisation
> > >
> > > The resulting code ended up being more complex (basically, I needed
> > > to handle the "active non contending" tree and to check it in
> > > task_tick_dl() and update_curr_dl()). But maybe I did it wrong...
> > > I'll try this approach again, after looking ad the BFQ code.
> >
> > That sounds about right.
> >
> > I've no idea if its more or less work. I just had vague memories on an
> > alternative approach to the timer.
> >
> > Feel free to stick with the timer if that works better, just wanted to
> > mention there are indeed alternative solutions.
> Ok; I'll try to implement this alternative approach again, after
> looking at BFQ, to see if it turns out to be simpler or more complex
> than the timer-based approach.
>
> If there is interest, I'll send an RFC with these patches after some
> testing.
>
I think there's definitely interest, as next step will be to start using
the new API for freq selection from DL as well.
Thanks a lot for your time and efforts!
Best,
- Juri
Hi Leo,
On 12/15/2015 07:48 PM, Leo Yan wrote:
> I also think "set_current_state(TASK_INTERRUPTIBLE)" will introduce
> logic error when software flow run into "else" block. The reason is
> after you set state with TASK_INTERRUPTIBLE, if there have some
> scheduling happen within cpufreq_sched_try_driver_target(), then the
> thread will be remove from rq. But generally we suppose the thread
> will be on rq and can continue run after next tick.
>
> Juri's suggestion can fix this issue. And we can use atomic_t to
> safely accessing gd->requested_freq.
I agree, it's incorrect. As I replied earlier I believe setting the task
state back to TASK_RUNNING at the top of the else block is the easiest fix.
thanks,
Steve
Hi Steve,
On Wed, Dec 16, 2015 at 05:24:56PM -0800, Steve Muckle wrote:
> Hi Leo,
>
> On 12/15/2015 07:48 PM, Leo Yan wrote:
> > I also think "set_current_state(TASK_INTERRUPTIBLE)" will introduce
> > logic error when software flow run into "else" block. The reason is
> > after you set state with TASK_INTERRUPTIBLE, if there have some
> > scheduling happen within cpufreq_sched_try_driver_target(), then the
> > thread will be remove from rq. But generally we suppose the thread
> > will be on rq and can continue run after next tick.
> >
> > Juri's suggestion can fix this issue. And we can use atomic_t to
> > safely accessing gd->requested_freq.
>
> I agree, it's incorrect. As I replied earlier I believe setting the task
> state back to TASK_RUNNING at the top of the else block is the easiest fix.
Could you check if below corner case will introduce logic error?
The task still will be removed from rq if timer tick is triggered
between two time's set_current_state().
set_current_state(TASK_INTERRUPTIBLE);
`-------> timer_tick and
schedule();
do_something...
set_current_state(TASK_RUNNING);
It will be safe for combination for set_current_state()/schedule()
with waken_up_process():
Thread_A: Thread_B:
set_current_state(TASK_INTERRUPTIBLE);
`-------> timer_tick and
schedule();
....
wake_up_process(Thread_A);
<---------------------/
schedule();
The first time's schedule() will remove task from rq which is caused
by timer tick and call schedule(), and the second time schdule() will
be equal yeild().
Thanks,
Leo Yan
Hi Leo,
On 12/16/2015 11:17 PM, Leo Yan wrote:
> Could you check if below corner case will introduce logic error?
> The task still will be removed from rq if timer tick is triggered
> between two time's set_current_state().
>
> set_current_state(TASK_INTERRUPTIBLE);
> `-------> timer_tick and
> schedule();
> do_something...
> set_current_state(TASK_RUNNING);
>
> It will be safe for combination for set_current_state()/schedule()
> with waken_up_process():
>
> Thread_A: Thread_B:
>
> set_current_state(TASK_INTERRUPTIBLE);
> `-------> timer_tick and
> schedule();
> ....
> wake_up_process(Thread_A);
> <---------------------/
> schedule();
>
> The first time's schedule() will remove task from rq which is caused
> by timer tick and call schedule(), and the second time schdule() will
> be equal yeild().
I was initially concerned about preemption while task state =
TASK_INTERRUPTIBLE as well, but a task with state TASK_INTERRUPTIBLE is
not dequeued if it is preempted. See core.c:__schedule():
if (!preempt && prev->state) {
if (unlikely(signal_pending_state(prev->state, prev))) {
prev->state = TASK_RUNNING;
} else {
deactivate_task(rq, prev, DEQUEUE_SLEEP);
prev->on_rq = 0;
I knew this had to be the case, because this design pattern is used in
many other places in the kernel, so many things would be very broken if
this were a problem.
thanks,
Steve
Hi Steve,
On Fri, Dec 18, 2015 at 11:15:01AM -0800, Steve Muckle wrote:
> On 12/16/2015 11:17 PM, Leo Yan wrote:
> > Could you check if below corner case will introduce logic error?
> > The task still will be removed from rq if timer tick is triggered
> > between two time's set_current_state().
> >
> > set_current_state(TASK_INTERRUPTIBLE);
> > `-------> timer_tick and
> > schedule();
> > do_something...
> > set_current_state(TASK_RUNNING);
> >
> > It will be safe for combination for set_current_state()/schedule()
> > with waken_up_process():
> >
> > Thread_A: Thread_B:
> >
> > set_current_state(TASK_INTERRUPTIBLE);
> > `-------> timer_tick and
> > schedule();
> > ....
> > wake_up_process(Thread_A);
> > <---------------------/
> > schedule();
> >
> > The first time's schedule() will remove task from rq which is caused
> > by timer tick and call schedule(), and the second time schdule() will
> > be equal yeild().
>
> I was initially concerned about preemption while task state =
> TASK_INTERRUPTIBLE as well, but a task with state TASK_INTERRUPTIBLE is
> not dequeued if it is preempted. See core.c:__schedule():
>
> if (!preempt && prev->state) {
> if (unlikely(signal_pending_state(prev->state, prev))) {
> prev->state = TASK_RUNNING;
> } else {
> deactivate_task(rq, prev, DEQUEUE_SLEEP);
> prev->on_rq = 0;
>
> I knew this had to be the case, because this design pattern is used in
> many other places in the kernel, so many things would be very broken if
> this were a problem.
You are right, I went through the code again and sched tick irq will
call preempt_schedule_irq() and __schedule(true); so finally set the
parameter "preempt" = true. Sorry for noise :p
---8<---
arch/arm64/kernel/entry.S:
#ifdef CONFIG_PREEMPT
el1_preempt:
mov x24, lr
1: bl preempt_schedule_irq // irq en/disable is done inside
ldr x0, [tsk, #TI_FLAGS] // get new tasks TI_FLAGS
tbnz x0, #TIF_NEED_RESCHED, 1b // needs rescheduling?
ret x24
#endif
Thanks,
Leo Yan