We propose Window-Assisted Load Tracking (WALT) as an alternative or additional
load tracking scheme in lieu of or along with PELT, one that in our estimation
better tracks task demand and CPU utilization especially for use cases on
mobile devices. WALT was conceived by Srivatsa Vaddagiri to provide better
perf-per-watt numbers on asymmetric CPU (frequency and/or IPC) implementations,
(specifically on Qualcomm Snapgdragon chipsets running Android) and its metrics
have been used to guide task placement and p-state selection (load balancing
in CFS still uses PELT statistics). WALT is now present in the android-common
kernel as well. This RFC is not an attempt to supplant PELT statistics; this is
more about starting a discussion; perhaps one that may result in changes to
PELT to address the requirements listed below.
Background
__________
1) Classifying a task as "heavy" faster helps in mobile world use-cases such as
scrolling a UI or browsing a web page where tasks exhibit sporadically heavy
load, allowing for instance, faster migration to a more capable CPU (assuming
the scheduler uses task util and CPU capacity as inputs during task placement
decision making, e.g EAS). Reclassification of a light task as heavy is also
important - for example, a rendering thread may change its demand depending on
what is being shown on-screen. We believe that quickly associating a task with
its necessary resources requires a more dynamic demand or utilization signal
than provided by PELT, which is subject to the geometric series and does not
differentiate between task ramp-up and ramp-down. E.g, with a 32ms halflife,
PELT takes ~74ms to reach 80% nice-0-load_avg/util_avg for a continuously
executing 100% task, and ~139ms to reach 95%.
2) Historical task demand (even if a task sleeps for say 300ms due to network
delay) is required to restore both CPU and frequency resources to meet
performance throughput as well. PELT with a 32ms 'half-life' takes just ~213ms
to decay utilization/load (effectively) zero, forgetting that task's history,
requiring the task to re-execute its workload to gain that load_avg/util_avg
again.
3) There is a need to quickly ramp up CPU frequency as a result of heavy CPU
utilization, and to reduce frequency whenever possible to save power.
Consider as an example a single 100% thread running continuously on a single
core using schedutil with a 10ms transition latency. On an i7 8-core SMP
machine with SMT, PELT takes 60-70ms to ramp to FMAX from FMIN.
4) PELT's blocked util decay implies that the underlying cpufreq governor would
have to delay dropping frequency for longer than necessary (especially for
mobile world usecases). PELT with a 32ms 'half-life' takes ~108ms to reduce
the runqueue utilization/nice-0-load from max to 10%. Blocked load/util tracking
is of course a conservative approach recognizing that load might return to
the runqueue, but one that we feel is inflexible when it comes to certain real
world use cases.
WALT vs PELT
____________
WALT retains the per-entity properties of PELT; demand and utilization are
accounted for at the task level, and a CPU rq's utilization contribution is
the sum of its children's contributions. Wallclock time, however, is divided
into a series of windows - to be clear about what WALT tracks exactly, we
introduce the terms task demand and CPU busy-time here and show the benefits
versus PELT that apply to the four points mentioned above:
1) Task Demand - The contribution of a task's running and runnable time to a
window, averaged over N sliding windows. Note that a window has to be
*completed* before its contribution is considered. A task's demand is the
maximum of its contribution to the most recently completed window and its
average demand over the past N windows. This metric can be roughly thought of as
an unweighted load_avg without blocked load.
WALT can take as little as one window of execution to report max task demand,
although with freq and capacity invariance, this tends to be about 3 windows
with a good frequency governor (so that's about 30ms with a 10ms window). This
allows faster and more accurate task placement decisions.
2) WALT "forgets" blocked time entirely, i.e. the sliding windows exist only
when the task is on the runqueue or running. This allows rapid reclassification
of the same task as heavy after a short sleep. Thus a task does not need to
re-execute to gain its demand once more and can be migrated up to a big CPU
immediately.
3) CPU busy time - The sum of execution times of all tasks in the most recently
completed window. For the same machine described above and with schedutil, WALT
with a 10ms window takes around 30-50ms to ramp to FMAX, and with a 20ms window
takes between 60-100ms (note that it is assumed that schedutil has a transition
latency equal to or greater than the window size).
4) WALT "forgets" cpu utilization as soon as tasks are taken off of the
runqueue, and thus the cpufreq governor can choose to drop frequency after just
one window, potentially saving power as well as allowing the governor to choose
a hysteresis setting as necessary based on more accurate utilization data.
RFC patch
_________
We realize that WALT has been commercialized along with several other
out-of-tree changes to the Linux scheduler, and evaluating the code on top of
all those changes is not trivial. Therefore, this RFC is an attempt to provide
a simple, limited implementation of WALT - only the CPU utilization/frequency
guidance portion - that can be run on the X86 architecture as well. It provides
the basic building blocks in terms of the windowed view of time and how WALT
hooks into various scheduling points. The WALT CPU utilization metric is fed to
the schedutil governor which then sets the appropriate p-state. The intent
here is to show that we don't harm server or desktop workload performance
while still providing potential power savings.
The full WALT implementation can be browsed through at:
https://android.googlesource.com/kernel/common/+/android-4.4/kernel/sched/walt.c
This RFC patch has been tested on live X86 machines with the following sanity
and benchmark results (thanks to Juri Lelli, Dietmar Eggeman, Patrick Bellasi
for initial code reviews):
(Tested on an Intel i7 2nd generation CPU, 8GB RAM, Nvidia GTX950Ti graphics,
with the same frequency list as above. Running Ubuntu 16.04 on a v4.8.2
baseline. WALT window size was 10ms. Only deltas above 3% are considered
non-noise.Power measured with Intel RAPL counters)
+--------------------+----------------+----------------+----------+------------+
| | Schedutil+WALT | Schedutil+PELT | Ondemand | Delta |
+--------------------+----------------+----------------+----------+------------+
| Chrome power test | | | | WALT saves |
| 1080p 60fps video | 922 Joules | 967 J | 1011 J | 4.6% |
+--------------------+--------------- +----------------+----------+------------+
| Firefox power test | | | | WALT saves |
| 1080p 60fps video | 1079 Joules | 1210 J | 1248 J | 10.82% |
+--------------------+--------------- +----------------+----------+------------+
| perf sched bench | | | | |
| messaging | | | | |
| -g 50 -l 5000 | 16.94s | 16.9326s | 16.9318s | No diff |
+--------------------+----------------+----------------+-------- -+------------+
| phoronix apache | 26177 | | | |
| benchmark | requests/s | 26088 | 26170 | No diff |
+--------------------+----------------+----------------+----------+------------+
| phoronix pgbench | 295 | | | |
| database r/w test | transactions/s | 300 | 302 | No diff |
+--------------------+----------------+----------------+----------+------------+
| phoronix | | | | |
| uniqingine-valley | 49fps | 49fps | 49fps | No diff |
+--------------------+----------------+----------------+----------+------------+
+-----------------+-------+--------+--------+-------------+-----------+--------+
| cpufreq-bench (from tools/power/cpupower/bench/) |
| measures % of work done against the performance govneror |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| Schedutil + WALT| round | load | sleep | performance | schedutil | % |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| | 0 | 50000 | 50000 | 50255 | 74574 | 67.389 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| | 1 | 100000 | 100000 | 99931 | 125724 | 80.367 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| | 2 | 150000 | 150000 | 150496 | 175244 | 85.878 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| | 3 | 200000 | 200000 | 200772 | 224903 | 89.270 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| | 4 | 250000 | 250000 | 250716 | 275955 | 90.854 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| | | | | | | |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| Schedutil + PELT| round | load | sleep | performance | schedutil | % |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| | 0 | 50000 | 50000 | 50055 | 65408 | 76.528 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| | 1 | 100000 | 100000 | 100494 | 131842 | 76.223 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| | 2 | 150000 | 150000 | 150423 | 182352 | 82.490 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| | 3 | 200000 | 200000 | 200605 | 234065 | 85.705 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
| | 4 | 250000 | 250000 | 251724 | 284186 | 88.577 |
+-----------------+-------+--------+--------+-------------+-----------+--------+
ARM data with EAS (Energy Aware Scheduler) and sched-freq
_________________________________________________________
The following is data collected on a Snapdragon 820 device. The WALT
implementation in this kernel is used by the EAS scheduler coupled with the
sched-freq governor and is a complete one with both task demand and cpu
utilization tracking (as opposed to the limited patch in this RFC). Power
is measured at battery. EAS uses cfs_rq->avg.util_avg and se->avg.util_avg
when using PELT and rq->prev_runnable_sum and p->ravg.demand when using
WALT metrics.
The "Jankbench" and power data are real world usecases where WALT provides the
most benefit, while the benchmarks are often used as a metric to compare OEM
devices. WALT uses a 10ms window in all of the following data.
+---------------------------------+---------+----------+----------------------+
| ARM data (ARMv8 based Snapdragon 820) |
| Two clusters with 2 Kryo CPUs each, 1.8Ghz and 2.4GHz FMAX each. |
| Running Android with EAS + WALT and EAS + PELT, with the sched-freq govneror|
+---------------------------------+---------+----------+----------------------+
| | WALT | PELT | |
+---------------------------------+---------+----------+----------------------+
| Geekbench 3 Single Core | No difference |
+---------------------------------+---------+----------+----------------------+
| Geekbench 3 Multicore | No difference |
+---------------------------------+---------+----------+----------------------+
| Geekbench 4 Single Core | No difference |
+---------------------------------+---------+----------+----------------------+
| Geekbench 4 Multicore | No difference |
+---------------------------------+---------+----------+----------------------+
| MT Linpack Native | 525 | 483 | WALT is 8.6% better |
+---------------------------------+---------+----------+----------------------+
| MT Linpack Java | 682 | 710 | PELT is 4.1% better |
+---------------------------------+---------+----------+----------------------+
| Threadbench | 200 | 103 | WALT is 94.1% better |
+---------------------------------+---------+----------+----------------------+
| sysbench | 230 | 259 | PELT is 12% better |
+---------------------------------+---------+----------+----------------------+
| Parsec | 287 | 220 | WALT is 30% better |
+---------------------------------+---------+----------+----------------------+
| IPC (Android Binder) | 158 | 167 | PELT is 5.6% better |
+---------------------------------+---------+----------+----------------------+
| | | | |
+---------------------------------+---------+----------+----------------------+
| Jankbench - Xth frame percentile indicates X % of frames rendered in Y ms |
| or less. Lower is better. |
| This test scrolls text and graphics on a mobile screen in Android. |
| (list view, imagelist view subtests) |
+---------------------------------+---------+----------+----------------------+
| | WALT | PELT | |
+---------------------------------+---------+----------+----------------------+
| 90th percentile | 6.32ms | 6.50ms | WALT is 3% better |
+---------------------------------+---------+----------+----------------------+
| 95th percentile | 7.20ms | 7.50s | WALT is 4% better |
+---------------------------------+---------+----------+----------------------+
| 99th percentile | 8.50ms | 8.53ms | No diff |
+---------------------------------+---------+----------+----------------------+
| Power (avg, inst. current) | 115.1mA | 116.8mA | No difference |
+---------------------------------+---------+----------+----------------------+
| | | | |
+---------------------------------+---------+----------+----------------------+
| Jankbench - Xth frame percentile indicates X % of frames rendered in Y ms |
| or less. Lower is better. |
| This test scrolls text and graphics on a mobile screen in Android. |
| (high bit-rate text render, edit text input, overdraw subtests) |
+---------------------------------+---------+----------+----------------------+
| 90th percentile | 17.45ms | 18.23ms | WALT is 4.2% better |
+---------------------------------+---------+----------+----------------------+
| 95th percentile | 19.20ms | 19.68ms | No diff |
+---------------------------------+---------+----------+----------------------+
| 99th percentile | 38.4ms | 41.44ms | WALT is 7.33% better |
+---------------------------------+---------+----------+----------------------+
| Power (avg, inst. current) | 378.0mA | 380.0mA | No difference |
+---------------------------------+---------+----------+----------------------+
| | | | |
+---------------------------------+---------+----------+----------------------+
| Video 1080p (big buck bunny) 60 seconds of playback, sampling rate 1000/s |
+---------------------------------+---------+----------+----------------------+
| Avg Instantaneous Current | 103mA | 114mA | WALT is 9.6% better |
+---------------------------------+---------+----------+----------------------+
| | | | |
+---------------------------------+---------+----------+----------------------+
| Video 2160p (big buck bunny) 60 seconds of playback, sampling rate 1000/s |
+---------------------------------+---------+----------+----------------------+
| Avg Instantaneous Current | 160mA | 164.5mA | WALT is 3% better |
+---------------------------------+---------+----------+----------------------+
| | | | |
+---------------------------------+---------+----------+----------------------+
This patch is based on v4.8.2
Srivatsa Vaddagiri (3):
sched: Introduce structures necessary for WALT
sched: Introduce Window-Assisted CPU utilization Tracking
sched: Introduce WALT hooks into core and scheduling classes
include/linux/sched.h | 35 ++++++
include/linux/sched/sysctl.h | 1 +
init/Kconfig | 9 ++
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 29 ++++-
kernel/sched/cputime.c | 10 +-
kernel/sched/deadline.c | 7 ++
kernel/sched/debug.c | 10 ++
kernel/sched/fair.c | 11 +-
kernel/sched/rt.c | 6 +
kernel/sched/sched.h | 12 ++
kernel/sched/walt.c | 433 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/walt.h | 69 ++++++++++
kernel/sysctl.c | 11 ++
14 files changed, 638 insertions(+), 6 deletions(-)
Some references:
Steve Muckle's 2014 presentation on Qualcomm's HMP scheduler (with WALT):
http://www.slideshare.net/linaroorg/lcu14-406-a-quick-take-on-energyaware-scheduling
https://www.youtube.com/watch?v=2xb0vOV-E6E
2016 presentation of Qualcomm data on WALT (was called WinLT then!):
http://www.slideshare.net/linaroorg/bkk16208-eas
https://www.youtube.com/watch?v=g4srUao5ahg
Qualcomm data on WALT vs PELT CPU utilization on X86:
https://www.youtube.com/watch?v=hc0aufpfWvY
Steve Muckle's presentation including WALT+schedutil data on a HiKey Board:
http://www.slideshare.net/linaroorg/las16307-benchmarking-schedutil-in-android
https://www.youtube.com/watch?v=Wxu5ApDRYH4
The WALT wiki page
https://wiki.codeaurora.org/xwiki/bin/QKERNEL/Window+Assisted+Load+Tracking/
--
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project
From: Srivatsa Vaddagiri <[email protected]>
Add the per-task and per-runqueue data structures that
will later be used by Window Assisted Load Tracking (WALT)
to estimate task demand and CPU utilization.
Move cap_scale into sched.h as that will be needed by WALT
as well to implement frequency and capacity invariance.
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
Signed-off-by: Vikram Mulukutla <[email protected]>
---
include/linux/sched.h | 39 +++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 2 --
kernel/sched/sched.h | 8 ++++++++
3 files changed, 47 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..64f8bec 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -315,6 +315,21 @@ extern char ___assert_task_state[1 - 2*!!(
/* Task command name length */
#define TASK_COMM_LEN 16
+/*
+ * These events may be replaced with a combination of existing scheduler flags
+ * provided that that doesn't make the implementation too fragile.
+ */
+enum task_event {
+ PUT_PREV_TASK = 0,
+ PICK_NEXT_TASK = 1,
+ TASK_WAKE = 2,
+ TASK_MIGRATE = 3,
+ TASK_UPDATE = 4,
+ IRQ_UPDATE = 5,
+};
+
+extern char *task_event_names[];
+
#include <linux/spinlock.h>
/*
@@ -1320,6 +1335,25 @@ struct sched_statistics {
};
#endif
+#ifdef CONFIG_SCHED_WALT
+
+/* ravg represents capacity scaled cpu-usage of tasks */
+struct ravg {
+ /*
+ * 'mark_start' marks the most recent event for a task
+ *
+ * 'curr_window' represents task's cpu usage in its most recent
+ * window
+ *
+ * 'prev_window' represents task's cpu usage in the window prior
+ * to the one represented by 'curr_window'
+ */
+ u64 mark_start;
+ u32 curr_window, prev_window;
+};
+#endif
+
+
struct sched_entity {
struct load_weight load; /* for load-balancing */
struct rb_node run_node;
@@ -1480,6 +1514,11 @@ struct task_struct {
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
+
+#ifdef CONFIG_SCHED_WALT
+ struct ravg ravg;
+#endif
+
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce8b244..39c826d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2674,8 +2674,6 @@ static u32 __compute_runnable_contrib(u64 n)
return contrib + runnable_avg_yN_sum[n];
}
-#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
-
/*
* We can represent the historical contribution to runnable average as the
* coefficients of a geometric series. To do this we sub-divide our runnable
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c64fc51..9bf6925 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -65,6 +65,8 @@ static inline void cpu_load_update_active(struct rq *this_rq) { }
# define scale_load_down(w) (w)
#endif
+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
/*
* Task weight (visible to users) and its load (invisible to users) have
* independent resolution, but they should be well calibrated. We use
@@ -664,6 +666,12 @@ struct rq {
u64 max_idle_balance_cost;
#endif
+#ifdef CONFIG_SCHED_WALT
+ u64 window_start;
+ u64 curr_runnable_sum;
+ u64 prev_runnable_sum;
+#endif /* CONFIG_SCHED_WALT */
+
#ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
#endif
--
TheMan
From: Srivatsa Vaddagiri <[email protected]>
This patch implements an alternative window-based CPU utilization
tracking mechanism in the scheduler. Per task and per CPU counters are
updated with utilization statistics using a synchronized (across CPUs)
time source and a single statistic (prev_runnable_sum) is fed to schedutil
in an upcoming patch. A windowed view of time (window size determined by
walt_ravg_window) is used to determine CPU utilization.
There are two per-CPU-rq quantities maintained by Window Assisted Load Tracking
(WALT), both normalized to the max possible frequency and the max efficiency
(IPC) of that CPU (provided the arch supports the invariance APIs):
curr_runnable_sum: aggregate utilization of all tasks that executed
during the current (not yet completed) window
prev_runnable_sum: aggregate utilization of all tasks that executed
during the most recent completed window
prev_runnable_sum is the primary statistic used to guide CPU frequency in
lieu of PELT's cfs_rq->avg.util_avg. No additional policy is imposed on this
statistic, the assumption being that the consumer (e.g., schedutil) will
perform appropriate policy decisions (e.g., headroom) before deciding the
next P-state.
Corresponding to the aggregate statistics, WALT also mantains the
following stats per task:
curr_window - represents the cpu utilization of the task in its most
recently tracked window
prev_window - represents cpu utilization of task in the window prior
to the one being tracked by curr_window
WALT statistic updates are event driven, with updates occurring in
scheduler_tick, pick_next_task and put_prev_task (i.e, in context_switch),
task wakeup and during task migration. Migration simply involves removing a
tasks's curr_window and prev_window from the source CPU's curr_runnable sum
and prev_runnable_sum, and adding the per-task counters to the destination
CPU's aggregate CPU counters. Execution time in an IRQ handler is accounted
in a CPU's curr_runnable_sum statistic, provided that the CPU was also
executing the idle task for the duration of the interrupt handler.
Idle task handling is modified by walt_io_is_busy; when set to 1, if a CPU
rq has tasks blocked on IO, idle-task execution is accounted in per-task
and per-CPU counters. Setting walt_io_is_busy will also cause interrupt
handlers in the idle task to update counters as if the idle task was
executing (instead of just the interrupt handler execution time).
The major tunable provided by WALT is walt_ravg_window, which represents
window size (in nanoseconds) and is set to 20ms by default. walt_io_is_busy
(described above) is set to 0 by default.
Potential upcoming changes/improvements include: the use of sched_clock
instead of ktime_get as a time source and support for an unsynchronized
(across CPUs) time source. The event codes (PUT_PREV_TASK etc.) may also
potentially be replaced by combinations of existing flags, provided that
doesn't make the implementation too fragile.
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
Signed-off-by: Vikram Mulukutla <[email protected]>
---
include/linux/sched/sysctl.h | 1 +
init/Kconfig | 9 +
kernel/sched/Makefile | 1 +
kernel/sched/cputime.c | 10 +-
kernel/sched/walt.c | 428 +++++++++++++++++++++++++++++++++++++++++++
kernel/sched/walt.h | 69 +++++++
kernel/sysctl.c | 11 ++
7 files changed, 528 insertions(+), 1 deletion(-)
create mode 100644 kernel/sched/walt.c
create mode 100644 kernel/sched/walt.h
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 22db1e6..7007815 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -31,6 +31,7 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
extern unsigned int sysctl_numa_balancing_scan_period_min;
extern unsigned int sysctl_numa_balancing_scan_period_max;
extern unsigned int sysctl_numa_balancing_scan_size;
+extern unsigned int sysctl_sched_use_walt_metrics;
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
diff --git a/init/Kconfig b/init/Kconfig
index cac3f09..1e629ac 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -390,6 +390,15 @@ config IRQ_TIME_ACCOUNTING
If in doubt, say N here.
+config SCHED_WALT
+ bool "Support window based load tracking"
+ depends on SMP
+ help
+ This feature will allow the scheduler to maintain a tunable window
+ based set of metrics for tasks and runqueues. These metrics can be
+ used to guide task placement as well as task frequency requirements
+ for cpufreq governors.
+
config BSD_PROCESS_ACCT
bool "BSD Process Accounting"
depends on MULTIUSER
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 5e59b83..41ada04 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,6 +19,7 @@ obj-y += core.o loadavg.o clock.o cputime.o
obj-y += idle_task.o fair.o rt.o deadline.o stop_task.o
obj-y += wait.o swait.o completion.o idle.o
obj-$(CONFIG_SMP) += cpupri.o cpudeadline.o
+obj-$(CONFIG_SCHED_WALT) += walt.o
obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index a846cf8..704a6f6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -51,12 +51,15 @@ void irqtime_account_irq(struct task_struct *curr)
{
s64 delta;
int cpu;
+ u64 wallclock;
+ boot account = true;
if (!sched_clock_irqtime)
return;
cpu = smp_processor_id();
- delta = sched_clock_cpu(cpu) - __this_cpu_read(irq_start_time);
+ wallclock = sched_clock_cpu(cpu);
+ delta = wallclock - __this_cpu_read(irq_start_time);
__this_cpu_add(irq_start_time, delta);
irq_time_write_begin();
@@ -70,6 +73,11 @@ void irqtime_account_irq(struct task_struct *curr)
__this_cpu_add(cpu_hardirq_time, delta);
else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
__this_cpu_add(cpu_softirq_time, delta);
+ else
+ account = false;
+
+ if (account && is_idle_task(curr))
+ walt_account_irqtime(cpu, curr, delta, wallclock);
irq_time_write_end();
}
diff --git a/kernel/sched/walt.c b/kernel/sched/walt.c
new file mode 100644
index 0000000..ed9fde1
--- /dev/null
+++ b/kernel/sched/walt.c
@@ -0,0 +1,428 @@
+/*
+ * Copyright (c) 2016, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ *
+ * Window Assisted Load Tracking (WALT) implementation credits:
+ * Srivatsa Vaddagiri, Steve Muckle, Syed Rameez Mustafa, Joonwoo Park,
+ * Pavan Kumar Kondeti, Olav Haugan
+ *
+ * 2016-03-06: Integration with EAS/refactoring by Vikram Mulukutla
+ * and Todd Kjos
+ * 2016-08-31: Integration with mainline by Srivatsa Vaddagiri
+ * and Vikram Mulukutla
+ */
+
+#include <linux/syscore_ops.h>
+#include <linux/cpufreq.h>
+#include "sched.h"
+#include "walt.h"
+
+
+char *task_event_names[] = {"PUT_PREV_TASK", "PICK_NEXT_TASK",
+ "TASK_WAKE", "TASK_MIGRATE", "TASK_UPDATE",
+ "IRQ_UPDATE"};
+
+__read_mostly unsigned int sysctl_sched_use_walt_metrics = 1;
+
+static __read_mostly unsigned int walt_freq_account_wait_time;
+static __read_mostly unsigned int walt_io_is_busy;
+
+/* 1 -> use PELT based load stats, 0 -> use window-based load stats */
+static unsigned int __read_mostly walt_disabled;
+
+/* Window size (in ns) */
+__read_mostly unsigned int walt_ravg_window = 20000000;
+
+/* Min window size (in ns) = 5ms */
+#define MIN_SCHED_RAVG_WINDOW 5000000
+
+/* Max window size (in ns) = 1s */
+#define MAX_SCHED_RAVG_WINDOW 1000000000
+
+static unsigned int sync_cpu;
+static ktime_t ktime_last;
+static bool walt_ktime_suspended;
+
+u64 walt_ktime_clock(void)
+{
+ if (unlikely(walt_ktime_suspended))
+ return ktime_to_ns(ktime_last);
+ return ktime_get_ns();
+}
+
+static void walt_resume(void)
+{
+ walt_ktime_suspended = false;
+}
+
+static int walt_suspend(void)
+{
+ ktime_last = ktime_get();
+ walt_ktime_suspended = true;
+ return 0;
+}
+
+static struct syscore_ops walt_syscore_ops = {
+ .resume = walt_resume,
+ .suspend = walt_suspend
+};
+
+static int __init walt_init_ops(void)
+{
+ register_syscore_ops(&walt_syscore_ops);
+ return 0;
+}
+late_initcall(walt_init_ops);
+
+static int __init set_walt_ravg_window(char *str)
+{
+ get_option(&str, &walt_ravg_window);
+
+ walt_disabled = (walt_ravg_window < MIN_SCHED_RAVG_WINDOW ||
+ walt_ravg_window > MAX_SCHED_RAVG_WINDOW);
+ return 0;
+}
+
+early_param("walt_ravg_window", set_walt_ravg_window);
+
+static void
+update_window_start(struct rq *rq, u64 wallclock)
+{
+ s64 delta;
+ int nr_windows;
+ u64 prev_sum = 0;
+
+ delta = wallclock - rq->window_start;
+ BUG_ON(delta < 0);
+ if (delta < walt_ravg_window)
+ return;
+
+ nr_windows = div64_u64(delta, walt_ravg_window);
+ if (nr_windows == 1)
+ prev_sum = rq->curr_runnable_sum;
+
+ rq->prev_runnable_sum = prev_sum;
+ rq->curr_runnable_sum = 0;
+
+ rq->window_start += (u64)nr_windows * (u64)walt_ravg_window;
+}
+
+static u64 scale_exec_time(u64 delta, struct rq *rq)
+{
+ unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu_of(rq));
+ unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu_of(rq));
+ u64 scaled_delta = cap_scale(delta, scale_freq);
+
+ return cap_scale(scaled_delta, scale_cpu);
+}
+
+static int cpu_is_waiting_on_io(struct rq *rq)
+{
+ if (!walt_io_is_busy)
+ return 0;
+
+ return atomic_read(&rq->nr_iowait);
+}
+
+static int account_cpu_busy_time(struct rq *rq, struct task_struct *p,
+ u64 irqtime, int event)
+{
+ if (is_idle_task(p)) {
+ /* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */
+ if (event == PICK_NEXT_TASK)
+ return 0;
+
+ /* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */
+ return irqtime || cpu_is_waiting_on_io(rq);
+ }
+
+ if (event == TASK_WAKE)
+ return 0;
+
+ if (event == PUT_PREV_TASK || event == IRQ_UPDATE ||
+ event == TASK_UPDATE)
+ return 1;
+
+ /* Only TASK_MIGRATE && PICK_NEXT_TASK left */
+ return walt_freq_account_wait_time;
+}
+
+/*
+ * Account cpu activity in its busy time counters (rq->curr/prev_runnable_sum)
+ */
+static void update_cpu_busy_time(struct task_struct *p, struct rq *rq,
+ int event, u64 wallclock, u64 irqtime)
+{
+ int new_window, nr_full_windows = 0;
+ u64 mark_start = p->ravg.mark_start;
+ u64 window_start = rq->window_start;
+ u32 window_size = walt_ravg_window;
+ u64 delta;
+
+ new_window = mark_start < window_start;
+ if (new_window)
+ nr_full_windows = div64_u64((window_start - mark_start),
+ window_size);
+
+ /* Handle window rollover */
+ if (new_window) {
+ if (!is_idle_task(p)) {
+ u32 curr_window = 0;
+
+ if (!nr_full_windows)
+ curr_window = p->ravg.curr_window;
+
+ p->ravg.prev_window = curr_window;
+ p->ravg.curr_window = 0;
+ }
+ }
+
+ if (!account_cpu_busy_time(rq, p, irqtime, event))
+ return;
+
+ if (!new_window) {
+ if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq))
+ delta = wallclock - mark_start;
+ else
+ delta = irqtime;
+ delta = scale_exec_time(delta, rq);
+ rq->curr_runnable_sum += delta;
+ if (!is_idle_task(p))
+ p->ravg.curr_window += delta;
+
+ return;
+ }
+
+ if (!irqtime || !is_idle_task(p) || cpu_is_waiting_on_io(rq)) {
+ if (!nr_full_windows) {
+ /*
+ * A full window hasn't elapsed, account partial
+ * contribution to previous completed window.
+ */
+ delta = scale_exec_time(window_start - mark_start, rq);
+ p->ravg.prev_window += delta;
+ } else {
+ /*
+ * Since at least one full window has elapsed,
+ * the contribution to the previous window is the
+ * full window (window_size).
+ */
+ delta = scale_exec_time(window_size, rq);
+ p->ravg.prev_window = delta;
+ }
+ rq->prev_runnable_sum += delta;
+
+ /* Account piece of busy time in the current window. */
+ delta = scale_exec_time(wallclock - window_start, rq);
+ rq->curr_runnable_sum += delta;
+ p->ravg.curr_window = delta;
+
+ return;
+ }
+
+ if (irqtime) {
+ /* IRQ busy time start = wallclock - irqtime */
+ mark_start = wallclock - irqtime;
+
+ if (mark_start > window_start) {
+ rq->curr_runnable_sum += scale_exec_time(irqtime, rq);
+ return;
+ }
+
+ /*
+ * IRQ busy time spanned multiple windows. Process the
+ * busy time preceding the current window first
+ */
+ delta = window_start - mark_start;
+ if (delta > window_size)
+ delta = window_size;
+ delta = scale_exec_time(delta, rq);
+ rq->prev_runnable_sum += delta;
+
+ /* Process the remaining IRQ busy time in the current window. */
+ delta = wallclock - window_start;
+ rq->curr_runnable_sum += scale_exec_time(delta, rq);
+
+ return;
+ }
+
+ BUG();
+}
+
+/* Reflect task activity on its demand and cpu's busy time statistics */
+void walt_update_task_ravg(struct task_struct *p, struct rq *rq,
+ enum task_event event, u64 wallclock, u64 irqtime)
+{
+ if (walt_disabled || !rq->window_start)
+ return;
+
+ lockdep_assert_held(&rq->lock);
+
+ update_window_start(rq, wallclock);
+
+ if (!p->ravg.mark_start)
+ goto done;
+
+ update_cpu_busy_time(p, rq, event, wallclock, irqtime);
+
+done:
+ p->ravg.mark_start = wallclock;
+}
+
+void walt_mark_task_starting(struct task_struct *p)
+{
+ u64 wallclock;
+ struct rq *rq = task_rq(p);
+
+ if (!rq->window_start)
+ return;
+
+ wallclock = walt_ktime_clock();
+ p->ravg.mark_start = wallclock;
+}
+
+void walt_set_window_start(struct rq *rq)
+{
+ int cpu = cpu_of(rq);
+ struct rq *sync_rq = cpu_rq(sync_cpu);
+ unsigned long flags;
+
+ if (rq->window_start || walt_ktime_clock() < walt_ravg_window)
+ return;
+
+ if (cpu == sync_cpu) {
+ raw_spin_lock_irqsave(&rq->lock, flags);
+ rq->window_start = walt_ktime_clock();
+ rq->curr_runnable_sum = rq->prev_runnable_sum = 0;
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+ } else {
+ local_irq_save(flags);
+ double_rq_lock(rq, sync_rq);
+ rq->window_start = cpu_rq(sync_cpu)->window_start;
+ rq->curr_runnable_sum = rq->prev_runnable_sum = 0;
+ double_rq_unlock(rq, sync_rq);
+ local_irq_restore(flags);
+ }
+}
+
+void walt_migrate_sync_cpu(int cpu)
+{
+ if (cpu == sync_cpu)
+ sync_cpu = smp_processor_id();
+}
+
+void walt_finish_migrate(struct task_struct *p, struct rq *dest_rq, bool locked)
+{
+ u64 wallclock;
+ unsigned long flags;
+
+ if (!p->on_rq && p->state != TASK_WAKING)
+ return;
+
+ if (locked == false)
+ raw_spin_lock_irqsave(&dest_rq->lock, flags);
+
+ lockdep_assert_held(&dest_rq->lock);
+
+ wallclock = walt_ktime_clock();
+
+ /* Update counters on destination CPU */
+ walt_update_task_ravg(dest_rq->curr, dest_rq,
+ TASK_UPDATE, wallclock, 0);
+
+ /* We may be in a new window. Update task counters */
+ walt_update_task_ravg(p, dest_rq, TASK_MIGRATE, wallclock, 0);
+
+ if (p->ravg.curr_window) {
+ if (!dest_rq->window_start) {
+ p->ravg.curr_window = 0;
+ p->ravg.mark_start = 0;
+ }
+ dest_rq->curr_runnable_sum += p->ravg.curr_window;
+ }
+
+ if (p->ravg.prev_window) {
+ if (!dest_rq->window_start)
+ p->ravg.prev_window = 0;
+ dest_rq->prev_runnable_sum += p->ravg.prev_window;
+ }
+
+ if (locked == false)
+ raw_spin_unlock_irqrestore(&dest_rq->lock, flags);
+}
+
+void walt_prepare_migrate(struct task_struct *p, struct rq *src_rq, bool locked)
+{
+ u64 wallclock;
+ unsigned long flags;
+
+ if (!p->on_rq && p->state != TASK_WAKING)
+ return;
+
+ if (locked == false)
+ raw_spin_lock_irqsave(&src_rq->lock, flags);
+
+ lockdep_assert_held(&src_rq->lock);
+
+ /* Note that same wallclock reference is used for all 3 events below */
+ wallclock = walt_ktime_clock();
+
+ /* Update counters on source CPU */
+ walt_update_task_ravg(task_rq(p)->curr, task_rq(p),
+ TASK_UPDATE, wallclock, 0);
+
+ /* Update task's counters */
+ walt_update_task_ravg(p, task_rq(p), TASK_MIGRATE, wallclock, 0);
+
+ /* Fixup busy time */
+ if (p->ravg.curr_window)
+ src_rq->curr_runnable_sum -= p->ravg.curr_window;
+
+ if (p->ravg.prev_window)
+ src_rq->prev_runnable_sum -= p->ravg.prev_window;
+
+ if ((s64)src_rq->prev_runnable_sum < 0) {
+ src_rq->prev_runnable_sum = 0;
+ WARN_ON(1);
+ }
+ if ((s64)src_rq->curr_runnable_sum < 0) {
+ src_rq->curr_runnable_sum = 0;
+ WARN_ON(1);
+ }
+
+ if (locked == false)
+ raw_spin_unlock_irqrestore(&src_rq->lock, flags);
+}
+
+void walt_account_irqtime(int cpu, struct task_struct *curr,
+ u64 delta, u64 wallclock)
+{
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&rq->lock, flags);
+
+ /*
+ * cputime (wallclock) uses sched_clock so use the same here for
+ * consistency.
+ */
+ delta += sched_clock_cpu(cpu) - wallclock;
+
+ walt_update_task_ravg(curr, rq, IRQ_UPDATE, walt_ktime_clock(), delta);
+
+ raw_spin_unlock_irqrestore(&rq->lock, flags);
+}
+
+void walt_init_new_task_load(struct task_struct *p)
+{
+ memset(&p->ravg, 0, sizeof(struct ravg));
+}
diff --git a/kernel/sched/walt.h b/kernel/sched/walt.h
new file mode 100644
index 0000000..6dad2e1
--- /dev/null
+++ b/kernel/sched/walt.h
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2016, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef __WALT_H
+#define __WALT_H
+
+#ifdef CONFIG_SCHED_WALT
+
+void walt_update_task_ravg(struct task_struct *p, struct rq *rq,
+ enum task_event event, u64 wallclock, u64 irqtime);
+void walt_prepare_migrate(struct task_struct *p, struct rq *rq, bool locked);
+void walt_finish_migrate(struct task_struct *p, struct rq *rq, bool locked);
+void walt_init_new_task_load(struct task_struct *p);
+void walt_mark_task_starting(struct task_struct *p);
+void walt_set_window_start(struct rq *rq);
+void walt_migrate_sync_cpu(int cpu);
+u64 walt_ktime_clock(void);
+void walt_account_irqtime(int cpu, struct task_struct *curr, u64 delta,
+ u64 wallclock);
+
+extern unsigned int sysctl_sched_use_walt_metrics;
+extern unsigned int walt_ravg_window;
+
+/* Fold into cpu_util */
+static inline unsigned long cpu_walt_util(struct rq *rq)
+{
+ if (!sysctl_sched_use_walt_metrics)
+ return rq->cfs.avg.util_avg;
+
+ return (rq->prev_runnable_sum * rq->cpu_capacity_orig) /
+ walt_ravg_window;
+}
+
+#else /* CONFIG_SCHED_WALT */
+
+static inline void walt_update_task_ravg(struct task_struct *p, struct rq *rq,
+ int event, u64 wallclock, u64 irqtime) { }
+static inline void walt_prepare_migrate(struct task_struct *p, struct rq *rq,
+ bool locked) { }
+static inline void walt_finish_migrate(struct task_struct *p, struct rq *rq,
+ bool locked) { }
+static inline void walt_init_new_task_load(struct task_struct *p) { }
+static inline void walt_mark_task_starting(struct task_struct *p) { }
+static inline void walt_set_window_start(struct rq *rq) { }
+static inline void walt_migrate_sync_cpu(int cpu) { }
+static inline u64 walt_ktime_clock(void) { return 0; }
+static inline void walt_account_irqtime(int cpu, struct task_struct *curr,
+ u64 delta, u64 wallclock) { }
+
+static inline unsigned long cpu_walt_util(struct rq *rq)
+{
+ return rq->cfs.avg.util_avg;
+}
+
+#endif /* CONFIG_SCHED_WALT */
+
+extern unsigned int walt_ravg_window;
+
+#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a13bbda..e0f61db 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -439,6 +439,17 @@ static struct ctl_table kern_table[] = {
.extra2 = &one,
},
#endif
+#ifdef CONFIG_SCHED_WALT
+ {
+ .procname = "sched_use_walt_metrics",
+ .data = &sysctl_sched_use_walt_metrics,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+#endif /* CONFIG_SCHED_WALT */
#ifdef CONFIG_CFS_BANDWIDTH
{
.procname = "sched_cfs_bandwidth_slice_us",
--
TheMan
From: Srivatsa Vaddagiri <[email protected]>
Add the necessary hooks to core and the various scheduling
classes that will allow WALT to track CPU utilization and
handle task migration between CPUs as well.
With CONFIG_SCHED_WALT enabled, schedutil will use WALT's cpu
utilization metric by default. This can be switched to PELT's
util_avg at runtime by the following command:
echo 0 > /proc/sys/kernel/sched_use_walt_metrics
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
Signed-off-by: Vikram Mulukutla <[email protected]>
---
kernel/sched/core.c | 29 ++++++++++++++++++++++++++++-
kernel/sched/deadline.c | 7 +++++++
kernel/sched/debug.c | 9 +++++++++
kernel/sched/fair.c | 9 +++++++--
kernel/sched/rt.c | 6 ++++++
5 files changed, 57 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44817c6..3b7f67d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -91,6 +91,8 @@
#define CREATE_TRACE_POINTS
#include <trace/events/sched.h>
+#include "walt.h"
+
DEFINE_MUTEX(sched_domains_mutex);
DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
@@ -991,6 +993,7 @@ static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int new
p->on_rq = TASK_ON_RQ_MIGRATING;
dequeue_task(rq, p, 0);
+ walt_prepare_migrate(p, rq, true);
set_task_cpu(p, new_cpu);
raw_spin_unlock(&rq->lock);
@@ -998,6 +1001,7 @@ static struct rq *move_queued_task(struct rq *rq, struct task_struct *p, int new
raw_spin_lock(&rq->lock);
BUG_ON(task_cpu(p) != new_cpu);
+ walt_finish_migrate(p, rq, true);
enqueue_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
@@ -1257,7 +1261,9 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(src_rq, p, 0);
+ walt_prepare_migrate(p, src_rq, true);
set_task_cpu(p, cpu);
+ walt_finish_migrate(p, dst_rq, true);
activate_task(dst_rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(dst_rq, p, 0);
@@ -2072,13 +2078,19 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
*/
smp_cond_load_acquire(&p->on_cpu, !VAL);
+ raw_spin_lock(&task_rq(p)->lock);
+ walt_update_task_ravg(p, task_rq(p), TASK_WAKE, walt_ktime_clock(), 0);
+ raw_spin_unlock(&task_rq(p)->lock);
+
p->sched_contributes_to_load = !!task_contributes_to_load(p);
p->state = TASK_WAKING;
cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
+ walt_prepare_migrate(p, task_rq(p), false);
set_task_cpu(p, cpu);
+ walt_finish_migrate(p, cpu_rq(cpu), false);
}
#endif /* CONFIG_SMP */
@@ -2129,8 +2141,10 @@ static void try_to_wake_up_local(struct task_struct *p, struct pin_cookie cookie
trace_sched_waking(p);
- if (!task_on_rq_queued(p))
+ if (!task_on_rq_queued(p)) {
+ walt_update_task_ravg(p, rq, TASK_WAKE, walt_ktime_clock(), 0);
ttwu_activate(rq, p, ENQUEUE_WAKEUP);
+ }
ttwu_do_wakeup(rq, p, 0, cookie);
if (schedstat_enabled())
@@ -2196,6 +2210,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.nr_migrations = 0;
p->se.vruntime = 0;
INIT_LIST_HEAD(&p->se.group_node);
+ walt_init_new_task_load(p);
#ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = NULL;
@@ -2570,6 +2585,8 @@ void wake_up_new_task(struct task_struct *p)
rq = __task_rq_lock(p, &rf);
post_init_entity_util_avg(&p->se);
+ walt_mark_task_starting(p);
+
activate_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
trace_sched_wakeup_new(p);
@@ -3071,6 +3088,7 @@ void scheduler_tick(void)
update_rq_clock(rq);
curr->sched_class->task_tick(rq, curr, 0);
cpu_load_update_active(rq);
+ walt_update_task_ravg(rq->curr, rq, TASK_UPDATE, walt_ktime_clock(), 0);
calc_global_load_tick(rq);
raw_spin_unlock(&rq->lock);
@@ -3322,6 +3340,7 @@ static void __sched notrace __schedule(bool preempt)
struct pin_cookie cookie;
struct rq *rq;
int cpu;
+ u64 wallclock;
cpu = smp_processor_id();
rq = cpu_rq(cpu);
@@ -3385,6 +3404,9 @@ static void __sched notrace __schedule(bool preempt)
update_rq_clock(rq);
next = pick_next_task(rq, prev, cookie);
+ wallclock = walt_ktime_clock();
+ walt_update_task_ravg(prev, rq, PUT_PREV_TASK, wallclock, 0);
+ walt_update_task_ravg(next, rq, PICK_NEXT_TASK, wallclock, 0);
clear_tsk_need_resched(prev);
clear_preempt_need_resched();
rq->clock_skip_update = 0;
@@ -7284,6 +7306,8 @@ static void sched_rq_cpu_starting(unsigned int cpu)
{
struct rq *rq = cpu_rq(cpu);
+ walt_set_window_start(rq);
+
rq->calc_load_update = calc_load_update;
update_max_interval();
}
@@ -7304,6 +7328,9 @@ int sched_cpu_dying(unsigned int cpu)
/* Handle pending wakeups and then migrate everything off */
sched_ttwu_pending();
raw_spin_lock_irqsave(&rq->lock, flags);
+
+ walt_migrate_sync_cpu(cpu);
+
if (rq->rd) {
BUG_ON(!cpumask_test_cpu(cpu, rq->rd->span));
set_rq_offline(rq);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 1ce8867..0dd3c1f 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -15,6 +15,7 @@
* Fabio Checconi <[email protected]>
*/
#include "sched.h"
+#include "walt.h"
#include <linux/slab.h>
@@ -278,7 +279,9 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
* By now the task is replenished and enqueued; migrate it.
*/
deactivate_task(rq, p, 0);
+ walt_prepare_migrate(p, rq, true);
set_task_cpu(p, later_rq->cpu);
+ walt_finish_migrate(p, later_rq, true);
activate_task(later_rq, p, 0);
if (!fallback)
@@ -1512,7 +1515,9 @@ retry:
}
deactivate_task(rq, next_task, 0);
+ walt_prepare_migrate(next_task, rq, true);
set_task_cpu(next_task, later_rq->cpu);
+ walt_finish_migrate(next_task, later_rq, true);
activate_task(later_rq, next_task, 0);
ret = 1;
@@ -1600,7 +1605,9 @@ static void pull_dl_task(struct rq *this_rq)
resched = true;
deactivate_task(src_rq, p, 0);
+ walt_prepare_migrate(p, src_rq, true);
set_task_cpu(p, this_cpu);
+ walt_finish_migrate(p, this_rq, true);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 2a0a999..ab10031 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -607,6 +607,15 @@ do { \
P(nr_switches);
P(nr_load_updates);
P(nr_uninterruptible);
+#ifdef CONFIG_SMP
+ P(cpu_capacity_orig);
+ P(cpu_capacity);
+#ifdef CONFIG_SCHED_WALT
+ P(window_start);
+ P(curr_runnable_sum);
+ P(prev_runnable_sum);
+#endif
+#endif
PN(next_balance);
SEQ_printf(m, " .%-30s: %ld\n", "curr->pid", (long)(task_pid_nr(rq->curr)));
PN(clock);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39c826d..182dcd3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -34,6 +34,7 @@
#include <trace/events/sched.h>
#include "sched.h"
+#include "walt.h"
/*
* Targeted preemption latency for CPU-bound tasks:
@@ -2885,6 +2886,7 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
if (cpu == smp_processor_id() && &rq->cfs == cfs_rq) {
unsigned long max = rq->cpu_capacity_orig;
+ unsigned long util = cpu_walt_util(rq);
/*
* There are a few boundary cases this might miss but it should
@@ -2902,8 +2904,8 @@ static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq)
*
* See cpu_util().
*/
- cpufreq_update_util(rq_clock(rq),
- min(cfs_rq->avg.util_avg, max), max);
+
+ cpufreq_update_util(rq_clock(rq), min(util, max), max);
}
}
@@ -6205,7 +6207,9 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
p->on_rq = TASK_ON_RQ_MIGRATING;
deactivate_task(env->src_rq, p, 0);
+ walt_prepare_migrate(p, env->src_rq, true);
set_task_cpu(p, env->dst_cpu);
+ /* update WALT later under the dest rq's lock */
}
/*
@@ -6337,6 +6341,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
lockdep_assert_held(&rq->lock);
BUG_ON(task_rq(p) != rq);
+ walt_finish_migrate(p, rq, true);
activate_task(rq, p, 0);
p->on_rq = TASK_ON_RQ_QUEUED;
check_preempt_curr(rq, p, 0);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d5690b7..130040c 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -8,6 +8,8 @@
#include <linux/slab.h>
#include <linux/irq_work.h>
+#include "walt.h"
+
int sched_rr_timeslice = RR_TIMESLICE;
static int do_sched_rt_period_timer(struct rt_bandwidth *rt_b, int overrun);
@@ -1843,7 +1845,9 @@ retry:
}
deactivate_task(rq, next_task, 0);
+ walt_prepare_migrate(next_task, rq, true);
set_task_cpu(next_task, lowest_rq->cpu);
+ walt_finish_migrate(next_task, lowest_rq, true);
activate_task(lowest_rq, next_task, 0);
ret = 1;
@@ -2097,7 +2101,9 @@ static void pull_rt_task(struct rq *this_rq)
resched = true;
deactivate_task(src_rq, p, 0);
+ walt_prepare_migrate(p, src_rq, true);
set_task_cpu(p, this_cpu);
+ walt_finish_migrate(p, this_rq, true);
activate_task(this_rq, p, 0);
/*
* We continue with the search, just in
--
TheMan
On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote:
>
> We propose Window-Assisted Load Tracking (WALT) as an alternative or additional
> load tracking scheme in lieu of or along with PELT, one that in our estimation
> better tracks task demand and CPU utilization especially for use cases on
> mobile devices. WALT was conceived by Srivatsa Vaddagiri to provide better
> perf-per-watt numbers on asymmetric CPU (frequency and/or IPC) implementations,
> (specifically on Qualcomm Snapgdragon chipsets running Android) and its metrics
> have been used to guide task placement and p-state selection (load balancing
> in CFS still uses PELT statistics). WALT is now present in the android-common
> kernel as well.
And how come we only learn of it after its already shipping? Isn't that
arse backwards?
On Fri, Oct 28, 2016 at 12:10:41AM -0700, Vikram Mulukutla wrote:
> +static int cpu_is_waiting_on_io(struct rq *rq)
> +{
> + if (!walt_io_is_busy)
> + return 0;
> +
> + return atomic_read(&rq->nr_iowait);
> +}
This is just drug induced nonsense. The per-cpu nr_iowait number is
completely without meaning.
On Fri, Oct 28, 2016 at 12:10:41AM -0700, Vikram Mulukutla wrote:
> +u64 walt_ktime_clock(void)
> +{
> + if (unlikely(walt_ktime_suspended))
> + return ktime_to_ns(ktime_last);
> + return ktime_get_ns();
> +}
> +static int walt_suspend(void)
> +{
> + ktime_last = ktime_get();
> + walt_ktime_suspended = true;
> + return 0;
> +}
No, ktime_get() will not be used in the scheduler. Imagine the joy if
that thing ends up being the HPET.
On Fri, Oct 28, 2016 at 12:10:41AM -0700, Vikram Mulukutla wrote:
> +void walt_finish_migrate(struct task_struct *p, struct rq *dest_rq, bool locked)
> +{
> + u64 wallclock;
> + unsigned long flags;
> +
> + if (!p->on_rq && p->state != TASK_WAKING)
> + return;
> +
> + if (locked == false)
> + raw_spin_lock_irqsave(&dest_rq->lock, flags);
> +
> +
> + if (locked == false)
> + raw_spin_unlock_irqrestore(&dest_rq->lock, flags);
> +}
> +
> +void walt_prepare_migrate(struct task_struct *p, struct rq *src_rq, bool locked)
> +{
> + u64 wallclock;
> + unsigned long flags;
> +
> + if (!p->on_rq && p->state != TASK_WAKING)
> + return;
> +
> + if (locked == false)
> + raw_spin_lock_irqsave(&src_rq->lock, flags);
> +
> +
> + if (locked == false)
> + raw_spin_unlock_irqrestore(&src_rq->lock, flags);
> +}
Seriously bad style that. Please, less bonghits before writing code.
On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote:
> This RFC patch has been tested on live X86 machines with the following sanity
> and benchmark results (thanks to Juri Lelli, Dietmar Eggeman, Patrick Bellasi
> for initial code reviews):
>
> (Tested on an Intel i7 2nd generation CPU, 8GB RAM, Nvidia GTX950Ti graphics,
> with the same frequency list as above. Running Ubuntu 16.04 on a v4.8.2
> baseline. WALT window size was 10ms. Only deltas above 3% are considered
> non-noise.Power measured with Intel RAPL counters)
Was this comparison done using the use_walt_metric sysctl knob?
On Fri, Oct 28, 2016 at 12:10:41AM -0700, Vikram Mulukutla wrote:
> +static int account_cpu_busy_time(struct rq *rq, struct task_struct *p,
> + u64 irqtime, int event)
> +{
> + if (is_idle_task(p)) {
> + /* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */
> + if (event == PICK_NEXT_TASK)
> + return 0;
> +
> + /* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */
> + return irqtime || cpu_is_waiting_on_io(rq);
> + }
> +
> + if (event == TASK_WAKE)
> + return 0;
> +
> + if (event == PUT_PREV_TASK || event == IRQ_UPDATE ||
> + event == TASK_UPDATE)
> + return 1;
> +
> + /* Only TASK_MIGRATE && PICK_NEXT_TASK left */
> + return walt_freq_account_wait_time;
> +}
switch() wasn't an option?
On 2016-10-28 00:29, Peter Zijlstra wrote:
> On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote:
>>
>> We propose Window-Assisted Load Tracking (WALT) as an alternative or
>> additional
>> load tracking scheme in lieu of or along with PELT, one that in our
>> estimation
>> better tracks task demand and CPU utilization especially for use cases
>> on
>> mobile devices. WALT was conceived by Srivatsa Vaddagiri to provide
>> better
>> perf-per-watt numbers on asymmetric CPU (frequency and/or IPC)
>> implementations,
>> (specifically on Qualcomm Snapgdragon chipsets running Android) and
>> its metrics
>> have been used to guide task placement and p-state selection (load
>> balancing
>> in CFS still uses PELT statistics). WALT is now present in the
>> android-common
>> kernel as well.
>
> And how come we only learn of it after its already shipping? Isn't that
> arse backwards?
Yes, but also we were not confident that it would be close to being
acceptable
upstream since it was intricately tied to our HMP scheduler. However now
that
more parties including the folks at ARM are interested, and given that
EAS
exists and schedutil was merged into mainline, we felt it the right time
to try and introduce the concept. In general we are seriously trying to
get
more things upstream and converge.
On 2016-10-28 00:49, Peter Zijlstra wrote:
> On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote:
>> This RFC patch has been tested on live X86 machines with the following
>> sanity
>> and benchmark results (thanks to Juri Lelli, Dietmar Eggeman, Patrick
>> Bellasi
>> for initial code reviews):
>>
>> (Tested on an Intel i7 2nd generation CPU, 8GB RAM, Nvidia GTX950Ti
>> graphics,
>> with the same frequency list as above. Running Ubuntu 16.04 on a
>> v4.8.2
>> baseline. WALT window size was 10ms. Only deltas above 3% are
>> considered
>> non-noise.Power measured with Intel RAPL counters)
>
> Was this comparison done using the use_walt_metric sysctl knob?
Yes, it was. You will want to see numbers against a pure 4.8.2 without
any
of the WALT code, correct?
On Fri, Oct 28, 2016 at 12:10:42AM -0700, Vikram Mulukutla wrote:
> @@ -2072,13 +2078,19 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
> */
> smp_cond_load_acquire(&p->on_cpu, !VAL);
>
> + raw_spin_lock(&task_rq(p)->lock);
> + walt_update_task_ravg(p, task_rq(p), TASK_WAKE, walt_ktime_clock(), 0);
> + raw_spin_unlock(&task_rq(p)->lock);
> +
Lol. funny that.
No.
On 2016-10-28 00:43, Peter Zijlstra wrote:
> On Fri, Oct 28, 2016 at 12:10:41AM -0700, Vikram Mulukutla wrote:
>> +u64 walt_ktime_clock(void)
>> +{
>> + if (unlikely(walt_ktime_suspended))
>> + return ktime_to_ns(ktime_last);
>> + return ktime_get_ns();
>> +}
>
>> +static int walt_suspend(void)
>> +{
>> + ktime_last = ktime_get();
>> + walt_ktime_suspended = true;
>> + return 0;
>> +}
>
> No, ktime_get() will not be used in the scheduler. Imagine the joy if
> that thing ends up being the HPET.
Agreed, this is an artifact from the full implementation that feeds into
the
interactive governor, and thus both needed to use the same time source.
It
shall go away.
On 2016-10-28 00:46, Peter Zijlstra wrote:
> On Fri, Oct 28, 2016 at 12:10:41AM -0700, Vikram Mulukutla wrote:
>> +void walt_finish_migrate(struct task_struct *p, struct rq *dest_rq,
>> bool locked)
>> +{
>> + u64 wallclock;
>> + unsigned long flags;
>> +
>> + if (!p->on_rq && p->state != TASK_WAKING)
>> + return;
>> +
>> + if (locked == false)
>> + raw_spin_lock_irqsave(&dest_rq->lock, flags);
>> +
>
>> +
>> + if (locked == false)
>> + raw_spin_unlock_irqrestore(&dest_rq->lock, flags);
>> +}
>> +
>> +void walt_prepare_migrate(struct task_struct *p, struct rq *src_rq,
>> bool locked)
>> +{
>> + u64 wallclock;
>> + unsigned long flags;
>> +
>> + if (!p->on_rq && p->state != TASK_WAKING)
>> + return;
>> +
>> + if (locked == false)
>> + raw_spin_lock_irqsave(&src_rq->lock, flags);
>> +
>
>> +
>> + if (locked == false)
>> + raw_spin_unlock_irqrestore(&src_rq->lock, flags);
>> +}
>
> Seriously bad style that. Please, less bonghits before writing code.
This was my bad personal attempt at eliminating double-locking from the
original code.
This was pointed out earlier and shall go away once I can come up with a
way to merge
this into enqeue/dequeue sans bonghits :-)
On Fri, Oct 28, 2016 at 12:57:05AM -0700, Vikram Mulukutla wrote:
> On 2016-10-28 00:49, Peter Zijlstra wrote:
> >On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote:
> >>This RFC patch has been tested on live X86 machines with the following
> >>sanity
> >>and benchmark results (thanks to Juri Lelli, Dietmar Eggeman, Patrick
> >>Bellasi
> >>for initial code reviews):
> >>
> >>(Tested on an Intel i7 2nd generation CPU, 8GB RAM, Nvidia GTX950Ti
> >>graphics,
> >>with the same frequency list as above. Running Ubuntu 16.04 on a v4.8.2
> >>baseline. WALT window size was 10ms. Only deltas above 3% are considered
> >>non-noise.Power measured with Intel RAPL counters)
> >
> >Was this comparison done using the use_walt_metric sysctl knob?
>
> Yes, it was. You will want to see numbers against a pure 4.8.2 without any
> of the WALT code, correct?
Yep, because with the sysctl we still run all the accounting code. So
esp things like the hackbench run are meaningless (note that even the
CONFIG thing doesn't take out everything).
Also, I think it makes sense to always (also) compare against the
"performance" governor. That way you can see the drop in absolute
performance etc..
On Fri, Oct 28, 2016 at 12:55:52AM -0700, Vikram Mulukutla wrote:
> On 2016-10-28 00:29, Peter Zijlstra wrote:
> >On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote:
> >>
> >>We propose Window-Assisted Load Tracking (WALT) as an alternative or
> >>additional load tracking scheme in lieu of or along with PELT, one
> >>that in our estimation better tracks task demand and CPU utilization
> >>especially for use cases on mobile devices. WALT was conceived by
> >>Srivatsa Vaddagiri to provide better perf-per-watt numbers on
> >>asymmetric CPU (frequency and/or IPC) implementations, (specifically
> >>on Qualcomm Snapgdragon chipsets running Android) and its metrics
> >>have been used to guide task placement and p-state selection (load
> >>balancing in CFS still uses PELT statistics). WALT is now present in
> >>the android-common kernel as well.
> >
> >And how come we only learn of it after its already shipping? Isn't that
> >arse backwards?
>
> Yes, but also we were not confident that it would be close to being
> acceptable
> upstream since it was intricately tied to our HMP scheduler. However now
> that
> more parties including the folks at ARM are interested, and given that EAS
> exists and schedutil was merged into mainline, we felt it the right time
> to try and introduce the concept. In general we are seriously trying to get
> more things upstream and converge.
That's good to hear, great!
That said, I would much prefer to modify/augment existing PELT stuff
than to add an entirely new set of accounting on the side of it. But
yes, I (and others) are aware of the pain points you mentioned with
using the PELT signal for schedutil.
We'll have to see what we can come up with.
On 2016-10-28 01:49, Peter Zijlstra wrote:
> On Fri, Oct 28, 2016 at 12:57:05AM -0700, Vikram Mulukutla wrote:
>> On 2016-10-28 00:49, Peter Zijlstra wrote:
>> >On Fri, Oct 28, 2016 at 12:10:39AM -0700, Vikram Mulukutla wrote:
>> >>This RFC patch has been tested on live X86 machines with the following
>> >>sanity
>> >>and benchmark results (thanks to Juri Lelli, Dietmar Eggeman, Patrick
>> >>Bellasi
>> >>for initial code reviews):
>> >>
>> >>(Tested on an Intel i7 2nd generation CPU, 8GB RAM, Nvidia GTX950Ti
>> >>graphics,
>> >>with the same frequency list as above. Running Ubuntu 16.04 on a v4.8.2
>> >>baseline. WALT window size was 10ms. Only deltas above 3% are considered
>> >>non-noise.Power measured with Intel RAPL counters)
>> >
>> >Was this comparison done using the use_walt_metric sysctl knob?
>>
>> Yes, it was. You will want to see numbers against a pure 4.8.2 without
>> any
>> of the WALT code, correct?
>
> Yep, because with the sysctl we still run all the accounting code. So
> esp things like the hackbench run are meaningless (note that even the
> CONFIG thing doesn't take out everything).
>
> Also, I think it makes sense to always (also) compare against the
> "performance" governor. That way you can see the drop in absolute
> performance etc..
Ok, here are some numbers. I should be able to get the rest during the
week.
The averages are pretty close, so I figured I would include some
percentile
numbers. PELT and WALT numbers are with schedutil. On average it seems
we're
introducing about 0.5% overhead with the current additional accounting.
Test: perf bench sched messaging -g 50 -l 5000
+-------------+-------------+----------+----------+-----------+
| 4.8.2-walt | Performance | Ondemand | PELT | WALT(10ms)|
+-------------+-------------+----------+----------+---------- +
| | | | | |
| 90th | 17.077 | 17.088 | 17.088 | 17.159 |
| | | | | |
| 96th | 17.117 | 17.421 | 17.198 | 17.343 |
| | | | | |
| Average | 16.910 | 16.916 | 16.937 | 16.962 |
| | | | | |
+-------------+-------------+----------+----------+-----------+
| 4.8.2-raw | Performance | Ondemand | PELT | WALT(10ms)|
+-------------+-------------+----------+----------+-----------+
| | | | | |
| 90th | 16.919 | 17.1694 | 16.986 | 0 |
| | | | | |
| 96th | 16.980 | 17.364 | 17.052 | 0 |
| | | | | |
| Average | 16.841 | 16.902 | 16.860 | 0 |
+-------------+-------------+----------+----------+-----------+