This is a respin of:
https://lkml.org/lkml/2017/11/9/546
which has been rebased on v4.15-rc2 to have util_est now working on top
of the recent PeterZ's:
[PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
The aim of this series is to improve some PELT behaviors to make it a
better fit for the scheduling of tasks common in embedded mobile
use-cases, without affecting other classes of workloads.
A complete description of these behaviors has been presented in the
previous RFC [1] and further discussed during the last OSPM Summit [2]
as well as during the last two LPCs.
This series presents an implementation which improves the initial RFC's
prototype. Specifically, this new implementation has been verified to
not impact in any noticeable way the performance of:
perf bench sched messaging --pipe --thread --group 8 --loop 50000
when running 30 iterations on a dual socket, 10 cores (20 threads) per
socket Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, whit the
sched_feat(SCHED_UTILEST) set to False.
With this feature enabled, the measured overhead is in the range of ~1%
on the same HW/SW test configuration.
That's the main reason why this sched feature is disabled by default.
A possible improvement can be the addition of a KConfig option to toggle
the sched_feat default value on systems where a 1% overhead on hackbench
is not a concern, e.g. mobile systems, especially considering the
benefits coming from estimated utilization on workloads of interest.
>From a functional standpoint, this implementation shows a more stable
utilization signal, compared to mainline, when running synthetics
benchmarks describing a set of interesting target use-cases.
This allows for a better selection of the target CPU as well as a
faster selection of the most appropriate OPP.
A detailed description of the used functional tests has been already
covered in the previous RFC [1].
This series is based on v4.15-rc2 and is composed of four patches:
1) a small refactoring preparing the ground
2) introducing the required data structures to track util_est of both
TASKs and CPUs
3) make use of util_est in the wakeup and load balance paths
4) make use of util_est in schedutil for frequency selection
Cheers Patrick
.:: References
==============
[1] https://lkml.org/lkml/2017/8/25/195
[2] slides: http://retis.sssup.it/ospm-summit/Downloads/OSPM_PELT_DecayClampingVsUtilEst.pdf
video: http://youtu.be/adnSHPBGS-w
Changes v1->v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
Patrick Bellasi (4):
sched/fair: always used unsigned long for utilization
sched/fair: add util_est on top of PELT
sched/fair: use util_est in LB and WU paths
sched/cpufreq_schedutil: use util_est for OPP selection
include/linux/sched.h | 21 +++++
kernel/sched/cpufreq_schedutil.c | 6 +-
kernel/sched/debug.c | 4 +
kernel/sched/fair.c | 184 ++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 5 ++
kernel/sched/sched.h | 1 +
6 files changed, 209 insertions(+), 12 deletions(-)
--
2.14.1
The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.
The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.
Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.
For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.
This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:
util_est(task) = max(task::util_avg, f(task::util_avg@dequeue_times))
This allows to remember how big a task has been reported by PELT in its
previous activations via the function: f(task::util_avg@dequeue_times).
If a task should change its behavior and it runs even longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).
The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization due
to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:
util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est_runnable)
where:
cfs_rq::util_est_runnable = sum(util_est(task))
for each RUNNABLE task on that root cfs_rq
It's worth to note that the estimated utilization is tracked only for
entities of interests, specifically:
- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection
Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes v1->v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
include/linux/sched.h | 21 ++++++++++
kernel/sched/debug.c | 4 ++
kernel/sched/fair.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 5 +++
kernel/sched/sched.h | 1 +
5 files changed, 132 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21991d668d35..b01c0dc75ef5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -338,6 +338,21 @@ struct sched_avg {
unsigned long util_avg;
};
+/**
+ * Estimation Utilization for FAIR tasks.
+ *
+ * Support data structure to track an Exponential Weighted Moving Average
+ * (EWMA) of a FAIR task's utilization. New samples are added to the moving
+ * average each time a task completes an activation. Sample's weight is
+ * chosen so that the EWMA will be relatively insensitive to transient changes
+ * to the task's workload.
+ */
+struct util_est {
+ unsigned long last;
+ unsigned long ewma;
+#define UTIL_EST_WEIGHT_SHIFT 2
+};
+
struct sched_statistics {
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
@@ -562,6 +577,12 @@ struct task_struct {
const struct sched_class *sched_class;
struct sched_entity se;
+ /*
+ * Since we use se.avg.util_avg to update util_est fields,
+ * this last can benefit from being close to se which
+ * also defines se.avg as cache aligned.
+ */
+ struct util_est util_est;
struct sched_rt_entity rt;
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1ca0130ed4f9..5ffa8234524a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -567,6 +567,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.runnable_load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
+ SEQ_printf(m, " .%-30s: %lu\n", "util_est_runnable",
+ cfs_rq->util_est_runnable);
SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg",
cfs_rq->removed.load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
@@ -1018,6 +1020,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(se.avg.runnable_load_avg);
P(se.avg.util_avg);
P(se.avg.last_update_time);
+ P(util_est.ewma);
+ P(util_est.last);
#endif
P(policy);
P(prio);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ad21550d008c..d8f3ed71010b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -732,6 +732,12 @@ void init_entity_runnable_average(struct sched_entity *se)
se->runnable_weight = se->load.weight;
/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
+
+ /* Utilization estimation */
+ if (entity_is_task(se)) {
+ task_of(se)->util_est.ewma = 0;
+ task_of(se)->util_est.last = 0;
+ }
}
static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
@@ -5153,6 +5159,20 @@ static inline void hrtick_update(struct rq *rq)
}
#endif
+static inline unsigned long task_util(struct task_struct *p);
+static inline unsigned long task_util_est(struct task_struct *p);
+
+static inline void util_est_enqueue(struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Update root cfs_rq's estimated utilization */
+ cfs_rq->util_est_runnable += task_util_est(p);
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -5205,9 +5225,84 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
add_nr_running(rq, 1);
+ util_est_enqueue(p);
hrtick_update(rq);
}
+static inline void util_est_dequeue(struct task_struct *p, int flags)
+{
+ struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+ unsigned long util_last = task_util(p);
+ bool sleep = flags & DEQUEUE_SLEEP;
+ unsigned long ewma;
+ long util_est;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /*
+ * Update root cfs_rq's estimated utilization
+ *
+ * If *p is the last task then the root cfs_rq's estimated utilization
+ * of a CPU is 0 by definition.
+ *
+ * Otherwise, in removing *p's util_est from its cfs_rq's
+ * util_est_runnable we should account for cases where this last
+ * activation of *p was longer then the previous ones.
+ * Also in these cases we need to set 0 the estimated utilization for
+ * the CPU.
+ */
+ if (cfs_rq->nr_running > 0) {
+ util_est = cfs_rq->util_est_runnable;
+ util_est -= task_util_est(p);
+ if (util_est < 0)
+ util_est = 0;
+ cfs_rq->util_est_runnable = util_est;
+ } else {
+ cfs_rq->util_est_runnable = 0;
+ }
+
+ /*
+ * Skip update of task's estimated utilization when the task has not
+ * yet completed an activation, e.g. being migrated.
+ */
+ if (!sleep)
+ return;
+
+ /*
+ * Skip update of task's estimated utilization when its EWMA is already
+ * ~1% close to its last activation value.
+ */
+ util_est = p->util_est.ewma;
+ if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
+ return;
+
+ /*
+ * Update Task's estimated utilization
+ *
+ * When *p completes an activation we can consolidate another sample
+ * about the task size. This is done by storing the last PELT value
+ * for this task and using this value to load another sample in the
+ * exponential weighted moving average:
+ *
+ * ewma(t) = w * task_util(p) + (1 - w) ewma(t-1)
+ * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
+ * = w * (task_util(p) + ewma(t-1) / w - ewma(t-1))
+ *
+ * Where 'w' is the weight of new samples, which is configured to be
+ * 0.25, thus making w=1/4
+ */
+ p->util_est.last = util_last;
+ ewma = p->util_est.ewma;
+ if (likely(ewma != 0)) {
+ ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
+ ewma >>= UTIL_EST_WEIGHT_SHIFT;
+ } else {
+ ewma = util_last;
+ }
+ p->util_est.ewma = ewma;
+}
+
static void set_next_buddy(struct sched_entity *se);
/*
@@ -5264,6 +5359,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
sub_nr_running(rq, 1);
+ util_est_dequeue(p, flags);
hrtick_update(rq);
}
@@ -5721,7 +5817,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return affine;
}
-static inline unsigned long task_util(struct task_struct *p);
static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -6216,6 +6311,11 @@ static inline unsigned long task_util(struct task_struct *p)
return p->se.avg.util_avg;
}
+static inline unsigned long task_util_est(struct task_struct *p)
+{
+ return max(p->util_est.ewma, p->util_est.last);
+}
+
/*
* cpu_util_wake: Compute cpu utilization with any contributions from
* the waking task p removed.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9552fd5854bf..e9f312acc0d3 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
SCHED_FEAT(WA_IDLE, true)
SCHED_FEAT(WA_WEIGHT, true)
SCHED_FEAT(WA_BIAS, true)
+
+/*
+ * UtilEstimation. Use estimated CPU utiliation.
+ */
+SCHED_FEAT(UTIL_EST, false)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b19552a212de..8371839075fa 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -444,6 +444,7 @@ struct cfs_rq {
* CFS load tracking
*/
struct sched_avg avg;
+ unsigned long util_est_runnable;
#ifndef CONFIG_64BIT
u64 load_last_update_time_copy;
#endif
--
2.14.1
When schedutil looks at the CPU utilization, the current PELT value for
that CPU is returned straight away. In certain scenarios this can have
undesired side effects and delays on frequency selection.
For example, since the task utilization is decayed at wakeup time, a
long sleeping big task newly enqueued does not add immediately a
significant contribution to the target CPU. This introduces some latency
before schedutil will be able to detect the best frequency required by
that task.
Moreover, the PELT signal build-up time is function of the current
frequency, because of the scale invariant load tracking support. Thus,
starting from a lower frequency, the utilization build-up time will
increase even more and further delays the selection of the actual
frequency which better serves the task requirements.
In order to reduce these kind of latencies, this patch integrates the
usage of the CPU's estimated utilization in the sugov_get_util function.
The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of each
currently RUNNABLE task on that CPU.
This allows to properly represent the expected utilization of a CPU which,
for example, has just got a big task running after a long sleep period,
and ultimately it allows to select the best frequency to run a task
right after it wakes up.
Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes v1->v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
kernel/sched/cpufreq_schedutil.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 2f52ec0f1539..465430d99440 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -183,7 +183,11 @@ static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
cfs_max = arch_scale_cpu_capacity(NULL, cpu);
- *util = min(rq->cfs.avg.util_avg, cfs_max);
+ *util = rq->cfs.avg.util_avg;
+ if (sched_feat(UTIL_EST))
+ *util = max(*util, rq->cfs.util_est_runnable);
+ *util = min(*util, cfs_max);
+
*max = cfs_max;
}
--
2.14.1
Utilization and capacity are tracked as unsigned long, however some
functions using them return an int which is ultimately assigned back to
unsigned long variables.
Since there is not scope on using a different and signed type, this
consolidate the signature of functions returning utilization to always
use the native type.
As well as improving code consistency this is expected also benefits
code paths where utilizations should be clamped by avoiding further type
conversions or ugly type casts.
Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Chris Redpath <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
---
Changes v1->v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
kernel/sched/fair.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4037e19bbca2..ad21550d008c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5721,8 +5721,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return affine;
}
-static inline int task_util(struct task_struct *p);
-static int cpu_util_wake(int cpu, struct task_struct *p);
+static inline unsigned long task_util(struct task_struct *p);
+static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
{
@@ -6203,7 +6203,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* capacity_orig) as it useful for predicting the capacity required after task
* migrations (scheduler-driven DVFS).
*/
-static int cpu_util(int cpu)
+static unsigned long cpu_util(int cpu)
{
unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
unsigned long capacity = capacity_orig_of(cpu);
@@ -6211,7 +6211,7 @@ static int cpu_util(int cpu)
return (util >= capacity) ? capacity : util;
}
-static inline int task_util(struct task_struct *p)
+static inline unsigned long task_util(struct task_struct *p)
{
return p->se.avg.util_avg;
}
@@ -6220,7 +6220,7 @@ static inline int task_util(struct task_struct *p)
* cpu_util_wake: Compute cpu utilization with any contributions from
* the waking task p removed.
*/
-static int cpu_util_wake(int cpu, struct task_struct *p)
+static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
{
unsigned long util, capacity;
--
2.14.1
When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.
For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while is still considered relatively empty.
In order to reduce these kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in cpu_util_wake as well as in update_sg_lb_stats.
The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of the
tasks currently RUNNABLE on that CPU.
This allows to properly represent the expected utilization of a CPU which,
for example, has just got a big task running since a long sleep
period.
Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes v1->v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 68 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d8f3ed71010b..373d631efa91 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6306,6 +6306,41 @@ static unsigned long cpu_util(int cpu)
return (util >= capacity) ? capacity : util;
}
+/**
+ * cpu_util_est: estimated utilization for the specified CPU
+ * @cpu: the CPU to get the estimated utilization for
+ *
+ * The estimated utilization of a CPU is defined to be the maximum between its
+ * PELT's utilization and the sum of the estimated utilization of the tasks
+ * currently RUNNABLE on that CPU.
+ *
+ * This allows to properly represent the expected utilization of a CPU which
+ * has just got a big task running since a long sleep period. At the same time
+ * however it preserves the benefits of the "blocked load" in describing the
+ * potential for other tasks waking up on the same CPU.
+ *
+ * Return: the estimated utlization for the specified CPU
+ */
+static inline unsigned long cpu_util_est(int cpu)
+{
+ unsigned long util, util_est;
+ unsigned long capacity;
+ struct cfs_rq *cfs_rq;
+
+ if (!sched_feat(UTIL_EST))
+ return cpu_util(cpu);
+
+ cfs_rq = &cpu_rq(cpu)->cfs;
+ util = cfs_rq->avg.util_avg;
+ util_est = cfs_rq->util_est_runnable;
+ util_est = max(util, util_est);
+
+ capacity = capacity_orig_of(cpu);
+ util_est = min(util_est, capacity);
+
+ return util_est;
+}
+
static inline unsigned long task_util(struct task_struct *p)
{
return p->se.avg.util_avg;
@@ -6322,16 +6357,43 @@ static inline unsigned long task_util_est(struct task_struct *p)
*/
static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
{
- unsigned long util, capacity;
+ long util, util_est;
/* Task has no contribution or is new */
if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
- return cpu_util(cpu);
+ return cpu_util_est(cpu);
- capacity = capacity_orig_of(cpu);
- util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+ /* Discount task's blocked util from CPU's util */
+ util = cpu_util(cpu) - task_util(p);
+ util = max(util, 0L);
- return (util >= capacity) ? capacity : util;
+ if (!sched_feat(UTIL_EST))
+ return util;
+
+ /*
+ * These are the main cases covered:
+ * - if *p is the only task sleeping on this CPU, then:
+ * cpu_util (== task_util) > util_est (== 0)
+ * and thus we return:
+ * cpu_util_wake = (cpu_util - task_util) = 0
+ *
+ * - if other tasks are SLEEPING on the same CPU, which is just waking
+ * up, then:
+ * cpu_util >= task_util
+ * cpu_util > util_est (== 0)
+ * and thus we discount *p's blocked utilization to return:
+ * cpu_util_wake = (cpu_util - task_util) >= 0
+ *
+ * - if other tasks are RUNNABLE on that CPU and
+ * util_est > cpu_util
+ * then we use util_est since it returns a more restrictive
+ * estimation of the spare capacity on that CPU, by just considering
+ * the expected utilization of tasks already runnable on that CPU.
+ */
+ util_est = cpu_rq(cpu)->cfs.util_est_runnable;
+ util = max(util, util_est);
+
+ return util;
}
/*
@@ -7857,7 +7919,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
load = source_load(i, load_idx);
sgs->group_load += load;
- sgs->group_util += cpu_util(i);
+ sgs->group_util += cpu_util_est(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;
nr_running = rq->nr_running;
--
2.14.1
On 5 December 2017 at 18:10, Patrick Bellasi <[email protected]> wrote:
> Utilization and capacity are tracked as unsigned long, however some
> functions using them return an int which is ultimately assigned back to
> unsigned long variables.
>
> Since there is not scope on using a different and signed type, this
> consolidate the signature of functions returning utilization to always
> use the native type.
> As well as improving code consistency this is expected also benefits
> code paths where utilizations should be clamped by avoiding further type
> conversions or ugly type casts.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Reviewed-by: Chris Redpath <[email protected]>
> Reviewed-by: Brendan Jackman <[email protected]>
> Reviewed-by: Dietmar Eggemann <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: [email protected]
Acked-by: Vincent Guittot <[email protected]>
>
> ---
> Changes v1->v2:
> - rebase on top of v4.15-rc2
> - tested that overhauled PELT code does not affect the util_est
> ---
> kernel/sched/fair.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4037e19bbca2..ad21550d008c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5721,8 +5721,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> return affine;
> }
>
> -static inline int task_util(struct task_struct *p);
> -static int cpu_util_wake(int cpu, struct task_struct *p);
> +static inline unsigned long task_util(struct task_struct *p);
> +static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
>
> static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> {
> @@ -6203,7 +6203,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> * capacity_orig) as it useful for predicting the capacity required after task
> * migrations (scheduler-driven DVFS).
> */
> -static int cpu_util(int cpu)
> +static unsigned long cpu_util(int cpu)
> {
> unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
> unsigned long capacity = capacity_orig_of(cpu);
> @@ -6211,7 +6211,7 @@ static int cpu_util(int cpu)
> return (util >= capacity) ? capacity : util;
> }
>
> -static inline int task_util(struct task_struct *p)
> +static inline unsigned long task_util(struct task_struct *p)
> {
> return p->se.avg.util_avg;
> }
> @@ -6220,7 +6220,7 @@ static inline int task_util(struct task_struct *p)
> * cpu_util_wake: Compute cpu utilization with any contributions from
> * the waking task p removed.
> */
> -static int cpu_util_wake(int cpu, struct task_struct *p)
> +static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
> {
> unsigned long util, capacity;
>
> --
> 2.14.1
>
On Tue, Dec 05, 2017 at 05:10:14PM +0000, Patrick Bellasi wrote:
> With this feature enabled, the measured overhead is in the range of ~1%
> on the same HW/SW test configuration.
That's quite a lot; did you look where that comes from?
On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> + if (cfs_rq->nr_running > 0) {
> + util_est = cfs_rq->util_est_runnable;
> + util_est -= task_util_est(p);
> + if (util_est < 0)
> + util_est = 0;
> + cfs_rq->util_est_runnable = util_est;
> + } else {
I'm thinking that's an explicit load-store to avoid intermediate values
landing in cfs_rq->util_esp_runnable, right?
That would need READ_ONCE() / WRITE_ONCE() I think, without that the
compiler is free to munge the lot together.
On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> +static inline void util_est_dequeue(struct task_struct *p, int flags)
> +{
> + struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
> + unsigned long util_last = task_util(p);
> + bool sleep = flags & DEQUEUE_SLEEP;
> + unsigned long ewma;
> + long util_est;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /*
> + * Update root cfs_rq's estimated utilization
> + *
> + * If *p is the last task then the root cfs_rq's estimated utilization
> + * of a CPU is 0 by definition.
> + *
> + * Otherwise, in removing *p's util_est from its cfs_rq's
> + * util_est_runnable we should account for cases where this last
> + * activation of *p was longer then the previous ones.
> + * Also in these cases we need to set 0 the estimated utilization for
> + * the CPU.
> + */
> + if (cfs_rq->nr_running > 0) {
> + util_est = cfs_rq->util_est_runnable;
> + util_est -= task_util_est(p);
> + if (util_est < 0)
> + util_est = 0;
> + cfs_rq->util_est_runnable = util_est;
> + } else {
> + cfs_rq->util_est_runnable = 0;
> + }
> +
> + /*
> + * Skip update of task's estimated utilization when the task has not
> + * yet completed an activation, e.g. being migrated.
> + */
> + if (!sleep)
> + return;
> +
> + /*
> + * Skip update of task's estimated utilization when its EWMA is already
> + * ~1% close to its last activation value.
> + */
> + util_est = p->util_est.ewma;
> + if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> + return;
Isn't that computation almost as expensive as the stuff you're trying to
avoid?
> + /*
> + * Update Task's estimated utilization
> + *
> + * When *p completes an activation we can consolidate another sample
> + * about the task size. This is done by storing the last PELT value
> + * for this task and using this value to load another sample in the
> + * exponential weighted moving average:
> + *
> + * ewma(t) = w * task_util(p) + (1 - w) ewma(t-1)
> + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> + * = w * (task_util(p) + ewma(t-1) / w - ewma(t-1))
> + *
> + * Where 'w' is the weight of new samples, which is configured to be
> + * 0.25, thus making w=1/4
> + */
> + p->util_est.last = util_last;
> + ewma = p->util_est.ewma;
> + if (likely(ewma != 0)) {
Why special case 0? Yes it helps with the initial ramp-on, but would not
an asymmetric IIR (with a consistent upward bias) be better?
> + ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
> + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> + } else {
> + ewma = util_last;
> + }
> + p->util_est.ewma = ewma;
> +}
On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> @@ -562,6 +577,12 @@ struct task_struct {
>
> const struct sched_class *sched_class;
> struct sched_entity se;
> + /*
> + * Since we use se.avg.util_avg to update util_est fields,
> + * this last can benefit from being close to se which
> + * also defines se.avg as cache aligned.
> + */
> + struct util_est util_est;
> struct sched_rt_entity rt;
> #ifdef CONFIG_CGROUP_SCHED
> struct task_group *sched_task_group;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b19552a212de..8371839075fa 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -444,6 +444,7 @@ struct cfs_rq {
> * CFS load tracking
> */
> struct sched_avg avg;
> + unsigned long util_est_runnable;
> #ifndef CONFIG_64BIT
> u64 load_last_update_time_copy;
> #endif
So you put the util_est in task_struct (not sched_entity) but the
util_est_runnable in cfs_rq (not rq). Seems inconsistent.
On 13-Dec 17:03, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 05:10:14PM +0000, Patrick Bellasi wrote:
> > With this feature enabled, the measured overhead is in the range of ~1%
> > on the same HW/SW test configuration.
>
> That's quite a lot; did you look where that comes from?
I've tracked it down to util_est_dequeue() introduced by PATCH 2/4,
mainly due to the EWMA udpate. Initially the running average was
implemented using the library function provided in:
include/linux/average.h::DECLARE_EWMA
but that solution generated even more overhead.
That's why we switched to an "inline custom" implementation.
Hackbench is quite stressful for that path and we have also few
branches which can play a role. One for example has been added to
avoid the EWMA if the rolling average is "close enough" to the
current PELT value.
All that considered, that's why I disable by default the sched_feat,
in which case we have 0 overheads... in !SCHED_DEBUG kernel the code
is actually removed by the compiler.
In mobile systems (i.e. non-hackbench scenarios) the additional
benefits on tasks placement and OPP selection is likely still worth
the overhead.
Do you think the idea to have a Kconfig to enabled this feature only
on systems which do not care about the possible overheads is a viable
solution?
--
#include <best/regards.h>
Patrick Bellasi
On 13-Dec 17:19, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > @@ -562,6 +577,12 @@ struct task_struct {
> >
> > const struct sched_class *sched_class;
> > struct sched_entity se;
> > + /*
> > + * Since we use se.avg.util_avg to update util_est fields,
> > + * this last can benefit from being close to se which
> > + * also defines se.avg as cache aligned.
> > + */
> > + struct util_est util_est;
> > struct sched_rt_entity rt;
> > #ifdef CONFIG_CGROUP_SCHED
> > struct task_group *sched_task_group;
>
>
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index b19552a212de..8371839075fa 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -444,6 +444,7 @@ struct cfs_rq {
> > * CFS load tracking
> > */
> > struct sched_avg avg;
> > + unsigned long util_est_runnable;
> > #ifndef CONFIG_64BIT
> > u64 load_last_update_time_copy;
> > #endif
>
>
> So you put the util_est in task_struct (not sched_entity) but the
> util_est_runnable in cfs_rq (not rq). Seems inconsistent.
One goal was to keep util_est variables close to the util_avg used to
load the filter, for caches affinity sakes.
The other goal was to have util_est data only for Tasks and CPU's
RQ, thus avoiding unused data for TG's RQ and SE.
Unfortunately the first goal does not allow to achieve completely the
second and, you right, the solution looks a bit inconsistent.
Do you think we should better disregard cache proximity and move
util_est_runnable to rq?
--
#include <best/regards.h>
Patrick Bellasi
On Wed, Dec 13, 2017 at 04:36:53PM +0000, Patrick Bellasi wrote:
> On 13-Dec 17:19, Peter Zijlstra wrote:
> > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > > @@ -562,6 +577,12 @@ struct task_struct {
> > >
> > > const struct sched_class *sched_class;
> > > struct sched_entity se;
> > > + /*
> > > + * Since we use se.avg.util_avg to update util_est fields,
> > > + * this last can benefit from being close to se which
> > > + * also defines se.avg as cache aligned.
> > > + */
> > > + struct util_est util_est;
The thing is, since sched_entity has a member with cacheline alignment,
the whole structure must have cacheline alignment, and this util_est
_will_ start on a new line.
See also:
$ pahole -EC task_struct defconfig/kernel/sched/core.o
...
struct sched_avg {
/* typedef u64 */ long long unsigned int last_update_time; /* 576 8 */
/* typedef u64 */ long long unsigned int load_sum; /* 584 8 */
/* typedef u32 */ unsigned int util_sum; /* 592 4 */
/* typedef u32 */ unsigned int period_contrib; /* 596 4 */
long unsigned int load_avg; /* 600 8 */
long unsigned int util_avg; /* 608 8 */
} avg; /* 576 40 */
/* --- cacheline 6 boundary (384 bytes) --- */
} se; /* 192 448 */
/* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */
struct util_est {
long unsigned int last; /* 640 8 */
long unsigned int ewma; /* 648 8 */
} util_est; /* 640 16 */
...
The thing is somewhat confused on which cacheline is which, but you'll
see sched_avg landing at 576 (cacheline #9) and util_est at 640 (line
#10).
> > > struct sched_rt_entity rt;
> > > #ifdef CONFIG_CGROUP_SCHED
> > > struct task_group *sched_task_group;
> One goal was to keep util_est variables close to the util_avg used to
> load the filter, for caches affinity sakes.
>
> The other goal was to have util_est data only for Tasks and CPU's
> RQ, thus avoiding unused data for TG's RQ and SE.
>
> Unfortunately the first goal does not allow to achieve completely the
> second and, you right, the solution looks a bit inconsistent.
>
> Do you think we should better disregard cache proximity and move
> util_est_runnable to rq?
proximity is likely important; I'd suggest moving util_est into
sched_entity.
On Tue, 2017-12-05 at 17:10 +0000, Patrick Bellasi wrote:
> This is a respin of:
> https://lkml.org/lkml/2017/11/9/546
> which has been rebased on v4.15-rc2 to have util_est now working on top
> of the recent PeterZ's:
> [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
>
> The aim of this series is to improve some PELT behaviors to make it a
> better fit for the scheduling of tasks common in embedded mobile
> use-cases, without affecting other classes of workloads.
I thought perhaps this patch set would improve the below behavior, but
alas it does not. ?That's 3 instances of firefox playing youtube clips
being shoved into a corner by hogs sitting on 7 of 8 runqueues. ?PELT
serializes the threaded desktop, making that threading kinda pointless,
and CFS not all that fair.
6569 root 20 0 4048 704 628 R 100.0 0.004 5:10.48 7 cpuhog
6573 root 20 0 4048 712 636 R 100.0 0.004 5:07.47 5 cpuhog
6581 root 20 0 4048 696 620 R 100.0 0.004 5:07.36 1 cpuhog
6585 root 20 0 4048 812 736 R 100.0 0.005 5:08.14 4 cpuhog
6589 root 20 0 4048 712 636 R 100.0 0.004 5:06.42 6 cpuhog
6577 root 20 0 4048 720 644 R 99.80 0.005 5:06.52 3 cpuhog
6593 root 20 0 4048 728 652 R 99.60 0.005 5:04.25 0 cpuhog
6755 mikeg 20 0 2714788 885324 179196 S 19.96 5.544 2:14.36 2 Web Content
6620 mikeg 20 0 2318348 312336 145044 S 8.383 1.956 0:51.51 2 firefox
3190 root 20 0 323944 71704 42368 S 3.194 0.449 0:11.90 2 Xorg
3718 root 20 0 3009580 67112 49256 S 0.599 0.420 0:02.89 2 kwin_x11
3761 root 20 0 769760 90740 62048 S 0.399 0.568 0:03.46 2 konsole
3845 root 9 -11 791224 20132 14236 S 0.399 0.126 0:03.00 2 pulseaudio
3722 root 20 0 3722308 172568 88088 S 0.200 1.081 0:04.35 2 plasmashel
------------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Sum delay ms | Maximum delay at |
------------------------------------------------------------------------------------------------------------------------------------
Web Content:6755 | 2864.862 ms | 7314 | avg: 0.299 ms | max: 40.374 ms | sum: 2189.472 ms | max at: 375.769240 |
Compositor:6680 | 1889.847 ms | 4672 | avg: 0.531 ms | max: 29.092 ms | sum: 2478.559 ms | max at: 375.759405 |
MediaPl~back #3:(13) | 3269.777 ms | 7853 | avg: 0.218 ms | max: 19.451 ms | sum: 1711.635 ms | max at: 391.123970 |
MediaPl~back #4:(10) | 1472.986 ms | 8189 | avg: 0.236 ms | max: 18.653 ms | sum: 1933.886 ms | max at: 376.124211 |
MediaPl~back #1:(9) | 601.788 ms | 6598 | avg: 0.247 ms | max: 17.823 ms | sum: 1627.852 ms | max at: 401.122567 |
firefox:6620 | 303.181 ms | 6232 | avg: 0.111 ms | max: 15.602 ms | sum: 691.865 ms | max at: 385.078558 |
Socket Thread:6639 | 667.537 ms | 4806 | avg: 0.069 ms | max: 12.638 ms | sum: 329.387 ms | max at: 380.827323 |
MediaPD~oder #1:6835 | 154.737 ms | 1592 | avg: 0.700 ms | max: 10.139 ms | sum: 1113.688 ms | max at: 392.575370 |
MediaTimer #1:6828 | 42.660 ms | 5250 | avg: 0.575 ms | max: 9.845 ms | sum: 3018.994 ms | max at: 380.823677 |
MediaPD~oder #2:6840 | 150.822 ms | 1583 | avg: 0.703 ms | max: 9.639 ms | sum: 1112.962 ms | max at: 380.823741 |
...
On 13-Dec 18:03, Peter Zijlstra wrote:
> On Wed, Dec 13, 2017 at 04:36:53PM +0000, Patrick Bellasi wrote:
> > On 13-Dec 17:19, Peter Zijlstra wrote:
> > > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > > > @@ -562,6 +577,12 @@ struct task_struct {
> > > >
> > > > const struct sched_class *sched_class;
> > > > struct sched_entity se;
> > > > + /*
> > > > + * Since we use se.avg.util_avg to update util_est fields,
> > > > + * this last can benefit from being close to se which
> > > > + * also defines se.avg as cache aligned.
> > > > + */
> > > > + struct util_est util_est;
>
> The thing is, since sched_entity has a member with cacheline alignment,
> the whole structure must have cacheline alignment, and this util_est
> _will_ start on a new line.
Right, I was not considering that "aligned" affects also the
start of the following data. Thus
> See also:
>
> $ pahole -EC task_struct defconfig/kernel/sched/core.o
>
> ...
> struct sched_avg {
> /* typedef u64 */ long long unsigned int last_update_time; /* 576 8 */
> /* typedef u64 */ long long unsigned int load_sum; /* 584 8 */
> /* typedef u32 */ unsigned int util_sum; /* 592 4 */
> /* typedef u32 */ unsigned int period_contrib; /* 596 4 */
> long unsigned int load_avg; /* 600 8 */
> long unsigned int util_avg; /* 608 8 */
> } avg; /* 576 40 */
> /* --- cacheline 6 boundary (384 bytes) --- */
> } se; /* 192 448 */
> /* --- cacheline 8 boundary (512 bytes) was 24 bytes ago --- */
> struct util_est {
> long unsigned int last; /* 640 8 */
> long unsigned int ewma; /* 648 8 */
> } util_est; /* 640 16 */
> ...
>
> The thing is somewhat confused on which cacheline is which, but you'll
> see sched_avg landing at 576 (cacheline #9) and util_est at 640 (line
> #10).
>
> > > > struct sched_rt_entity rt;
> > > > #ifdef CONFIG_CGROUP_SCHED
> > > > struct task_group *sched_task_group;
>
> > One goal was to keep util_est variables close to the util_avg used to
> > load the filter, for caches affinity sakes.
> >
> > The other goal was to have util_est data only for Tasks and CPU's
> > RQ, thus avoiding unused data for TG's RQ and SE.
> >
> > Unfortunately the first goal does not allow to achieve completely the
> > second and, you right, the solution looks a bit inconsistent.
> >
> > Do you think we should better disregard cache proximity and move
> > util_est_runnable to rq?
>
> proximity is likely important; I'd suggest moving util_est into
> sched_entity.
So, by moving util_est right after sched_avg, here is what we get (with some
lines to better highlight 64B boundaries):
const struct sched_class * sched_class; /* 152 8 */
struct sched_entity {
[...]
---[ Line 9 ]-------------------------------------------------------------------------------
struct sched_avg {
/* typedef u64 */ long long unsigned int last_update_time; /* 576 8 */
/* typedef u64 */ long long unsigned int load_sum; /* 584 8 */
/* typedef u64 */ long long unsigned int runnable_load_sum; /* 592 8 */
/* typedef u32 */ unsigned int util_sum; /* 600 4 */
/* typedef u32 */ unsigned int period_contrib; /* 604 4 */
long unsigned int load_avg; /* 608 8 */
long unsigned int runnable_load_avg; /* 616 8 */
long unsigned int util_avg; /* 624 8 */
} avg; /* 576 56 */
/* --- cacheline 6 boundary (384 bytes) was 24 bytes ago --- */
struct util_est {
long unsigned int last; /* 632 8 */
---[ Line 10 ]------------------------------------------------------------------------------
long unsigned int ewma; /* 640 8 */
} util_est; /* 632 16 */
} se; /* 192 512 */
---[ Line 11 ]------------------------------------------------------------------------------
/* --- cacheline 9 boundary (576 bytes) was 24 bytes ago --- */
struct sched_rt_entity {
struct list_head {
struct list_head * next; /* 704 8 */
struct list_head * prev; /* 712 8 */
} run_list; /* 704 16 */
As you can see we still end up with util_est spanning acrosss two cache and
even worst with an almost empty Line 10. The point is that sched_avg already
uses 56B... which leave just 8bytes left.
So, I can to move util_est there and use unsigned int for "last" and "ewma"
storage. This should fix the cache alignment but only until we do not add
other stuff to sched_avg.
BTW, should not be possible to use a similar "fasting" approach for load_avg
and runnable_load_avg? Given their range a u32 should be just good enough,
isn't it?
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 13-Dec 17:16, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > +static inline void util_est_dequeue(struct task_struct *p, int flags)
> > +{
> > + struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
> > + unsigned long util_last = task_util(p);
> > + bool sleep = flags & DEQUEUE_SLEEP;
> > + unsigned long ewma;
> > + long util_est;
> > +
> > + if (!sched_feat(UTIL_EST))
> > + return;
> > +
> > + /*
> > + * Update root cfs_rq's estimated utilization
> > + *
> > + * If *p is the last task then the root cfs_rq's estimated utilization
> > + * of a CPU is 0 by definition.
> > + *
> > + * Otherwise, in removing *p's util_est from its cfs_rq's
> > + * util_est_runnable we should account for cases where this last
> > + * activation of *p was longer then the previous ones.
> > + * Also in these cases we need to set 0 the estimated utilization for
> > + * the CPU.
> > + */
> > + if (cfs_rq->nr_running > 0) {
> > + util_est = cfs_rq->util_est_runnable;
> > + util_est -= task_util_est(p);
> > + if (util_est < 0)
> > + util_est = 0;
> > + cfs_rq->util_est_runnable = util_est;
> > + } else {
> > + cfs_rq->util_est_runnable = 0;
> > + }
> > +
> > + /*
> > + * Skip update of task's estimated utilization when the task has not
> > + * yet completed an activation, e.g. being migrated.
> > + */
> > + if (!sleep)
> > + return;
> > +
> > + /*
> > + * Skip update of task's estimated utilization when its EWMA is already
> > + * ~1% close to its last activation value.
> > + */
> > + util_est = p->util_est.ewma;
> > + if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> > + return;
>
> Isn't that computation almost as expensive as the stuff you're trying to
> avoid?
Mmm... maybe slightly simpler. I'll profile it again but I remember
I've added it because it was slightly better on backbench.
This code at the end it's just a "sub" and a "compare to constant" and
it's likely to bail early for all "almost regular" tasks.
Are you worried about the branch overhead?
> > + /*
> > + * Update Task's estimated utilization
> > + *
> > + * When *p completes an activation we can consolidate another sample
> > + * about the task size. This is done by storing the last PELT value
> > + * for this task and using this value to load another sample in the
> > + * exponential weighted moving average:
> > + *
> > + * ewma(t) = w * task_util(p) + (1 - w) ewma(t-1)
> > + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> > + * = w * (task_util(p) + ewma(t-1) / w - ewma(t-1))
> > + *
> > + * Where 'w' is the weight of new samples, which is configured to be
> > + * 0.25, thus making w=1/4
> > + */
> > + p->util_est.last = util_last;
> > + ewma = p->util_est.ewma;
> > + if (likely(ewma != 0)) {
>
> Why special case 0? Yes it helps with the initial ramp-on, but would not
> an asymmetric IIR (with a consistent upward bias) be better?
Yes, maybe the fast ramp-up is not really necessary... I'll test it
without on some real use-cases and see if we really get any noticiable
benefit, otheriwse I'll remove it.
Thanks for pointing this out.
> > + ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
> > + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > + } else {
> > + ewma = util_last;
> > + }
> > + p->util_est.ewma = ewma;
> > +}
--
#include <best/regards.h>
Patrick Bellasi
On Fri, Dec 15, 2017 at 12:14:17PM +0000, Patrick Bellasi wrote:
> On 13-Dec 17:16, Peter Zijlstra wrote:
> > > + /*
> > > + * Skip update of task's estimated utilization when its EWMA is already
> > > + * ~1% close to its last activation value.
> > > + */
> > > + util_est = p->util_est.ewma;
> > > + if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> > > + return;
> >
> > Isn't that computation almost as expensive as the stuff you're trying to
> > avoid?
>
> Mmm... maybe slightly simpler. I'll profile it again but I remember
> I've added it because it was slightly better on backbench.
>
> This code at the end it's just a "sub" and a "compare to constant" and
> it's likely to bail early for all "almost regular" tasks.
>
> Are you worried about the branch overhead?
Its a subtract, a test for sign, a conditional branch on test, a negate,
a subtract constant and another conditinoal branch.
Branch overhead certainly matters too.
> > > + p->util_est.last = util_last;
> > > + ewma = p->util_est.ewma;
> > > + if (likely(ewma != 0)) {
> >
> > Why special case 0? Yes it helps with the initial ramp-on, but would not
> > an asymmetric IIR (with a consistent upward bias) be better?
>
> Yes, maybe the fast ramp-up is not really necessary... I'll test it
> without on some real use-cases and see if we really get any noticiable
> benefit, otheriwse I'll remove it.
>
> Thanks for pointing this out.
>
> > > + ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
> > > + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > > + } else {
> > > + ewma = util_last;
> > > + }
> > > + p->util_est.ewma = ewma;
And this, without the 0 case, is shift, an add, a subtract and another
shift followed by a store.
Which is less branches and roughly similar arith ops, some of which can
be done in parallel.
I suspect what you saw on the profile is the cacheline hit of the store,
but I'm not sure.
On Fri, Dec 15, 2017 at 12:03:31PM +0000, Patrick Bellasi wrote:
> So, by moving util_est right after sched_avg, here is what we get (with some
> lines to better highlight 64B boundaries):
>
> const struct sched_class * sched_class; /* 152 8 */
> struct sched_entity {
> [...]
> ---[ Line 9 ]-------------------------------------------------------------------------------
> struct sched_avg {
> /* typedef u64 */ long long unsigned int last_update_time; /* 576 8 */
> /* typedef u64 */ long long unsigned int load_sum; /* 584 8 */
> /* typedef u64 */ long long unsigned int runnable_load_sum; /* 592 8 */
> /* typedef u32 */ unsigned int util_sum; /* 600 4 */
> /* typedef u32 */ unsigned int period_contrib; /* 604 4 */
> long unsigned int load_avg; /* 608 8 */
> long unsigned int runnable_load_avg; /* 616 8 */
> long unsigned int util_avg; /* 624 8 */
> } avg; /* 576 56 */
> /* --- cacheline 6 boundary (384 bytes) was 24 bytes ago --- */
> struct util_est {
> long unsigned int last; /* 632 8 */
> ---[ Line 10 ]------------------------------------------------------------------------------
> long unsigned int ewma; /* 640 8 */
> } util_est; /* 632 16 */
> } se; /* 192 512 */
> ---[ Line 11 ]------------------------------------------------------------------------------
> /* --- cacheline 9 boundary (576 bytes) was 24 bytes ago --- */
> struct sched_rt_entity {
> struct list_head {
> struct list_head * next; /* 704 8 */
> struct list_head * prev; /* 712 8 */
> } run_list; /* 704 16 */
>
>
> As you can see we still end up with util_est spanning acrosss two cache and
> even worst with an almost empty Line 10. The point is that sched_avg already
> uses 56B... which leave just 8bytes left.
Yes, that's unfortunate.
> So, I can to move util_est there and use unsigned int for "last" and "ewma"
> storage. This should fix the cache alignment but only until we do not add
> other stuff to sched_avg.
>
> BTW, should not be possible to use a similar "fasting" approach for load_avg
> and runnable_load_avg? Given their range a u32 should be just good enough,
> isn't it?
Probably, I'd have to page all that stuff back in :/
Another issue is that for tasks load and runnable_load are the exact
same; I just never found a sensible way to collapse that.
On 13-Dec 17:05, Peter Zijlstra wrote:
> On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > + if (cfs_rq->nr_running > 0) {
> > + util_est = cfs_rq->util_est_runnable;
> > + util_est -= task_util_est(p);
> > + if (util_est < 0)
> > + util_est = 0;
> > + cfs_rq->util_est_runnable = util_est;
> > + } else {
>
> I'm thinking that's an explicit load-store to avoid intermediate values
> landing in cfs_rq->util_esp_runnable, right?
Was mainly to have an unsigned util_est for the following "sub"...
> That would need READ_ONCE() / WRITE_ONCE() I think, without that the
> compiler is free to munge the lot together.
... do we still need the {READ,WRITE}_ONCE() in this case?
I guess adding them however does not hurts.
Steering back at that code however it can likely by optimized to avoid
the else branch... will update and add the barriers.
Cheers Patrick.
--
#include <best/regards.h>
Patrick Bellasi
On Fri, Dec 15, 2017 at 02:02:18PM +0000, Patrick Bellasi wrote:
> On 13-Dec 17:05, Peter Zijlstra wrote:
> > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > > + if (cfs_rq->nr_running > 0) {
> > > + util_est = cfs_rq->util_est_runnable;
> > > + util_est -= task_util_est(p);
> > > + if (util_est < 0)
> > > + util_est = 0;
> > > + cfs_rq->util_est_runnable = util_est;
> > > + } else {
> >
> > I'm thinking that's an explicit load-store to avoid intermediate values
> > landing in cfs_rq->util_esp_runnable, right?
>
> Was mainly to have an unsigned util_est for the following "sub"...
>
>
> > That would need READ_ONCE() / WRITE_ONCE() I think, without that the
> > compiler is free to munge the lot together.
>
> ... do we still need the {READ,WRITE}_ONCE() in this case?
> I guess adding them however does not hurts.
I think the compiler is free to munge it into something like:
cfs_rq->util_est_runnable -= task_util_est();
if (cfs_rq->util_est_runnable < 0)
cfs_rq->util_est_runnable = 0
and its a fairly simple optimization at that, it just needs to observe
that util_est is an alias for cfs_rq->util_est_runnable.
Using the volatile load/store completely destroys that optimization
though, so yes, I'd say its definitely needed.
On 15-Dec 15:07, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 02:02:18PM +0000, Patrick Bellasi wrote:
> > On 13-Dec 17:05, Peter Zijlstra wrote:
> > > On Tue, Dec 05, 2017 at 05:10:16PM +0000, Patrick Bellasi wrote:
> > > > + if (cfs_rq->nr_running > 0) {
> > > > + util_est = cfs_rq->util_est_runnable;
> > > > + util_est -= task_util_est(p);
> > > > + if (util_est < 0)
> > > > + util_est = 0;
> > > > + cfs_rq->util_est_runnable = util_est;
> > > > + } else {
> > >
> > > I'm thinking that's an explicit load-store to avoid intermediate values
> > > landing in cfs_rq->util_esp_runnable, right?
> >
> > Was mainly to have an unsigned util_est for the following "sub"...
> >
> >
> > > That would need READ_ONCE() / WRITE_ONCE() I think, without that the
> > > compiler is free to munge the lot together.
> >
> > ... do we still need the {READ,WRITE}_ONCE() in this case?
> > I guess adding them however does not hurts.
>
This is just to better understand....
> I think the compiler is free to munge it into something like:
>
> cfs_rq->util_est_runnable -= task_util_est();
> if (cfs_rq->util_est_runnable < 0)
> cfs_rq->util_est_runnable = 0
>
I'm still confused, we have:
long util_est
unsigned long cfs_rq->util_est_runnable
The optimization above can potentially generate an overflow, isn't it?
> and its a fairly simple optimization at that, it just needs to observe
> that util_est is an alias for cfs_rq->util_est_runnable.
Since the first is signed and the last unsigned, can the compiler still
considered them an alias?
At least on ARM I would expect a load of an unsigned value, some
computations on "signed registers", and finally a store of an unsigned
value. This is what I get:
if (cfs_rq->nr_running > 0) {
51e4: 91020000 add x0, x0, #0x80
51e8: b9401802 ldr w2, [x0,#24]
51ec: 340004a2 cbz w2, 5280 <dequeue_task_fair+0xb20>
// skip branch for nr_running == 0
return max(p->util_est.ewma, p->util_est.last);
51f0: f9403ba2 ldr x2, [x29,#112]
51f4: f9418044 ldr x4, [x2,#768]
51f8: f9418443 ldr x3, [x2,#776]
// x3 := p->util_est.ewma
// x4 := p->util_est.last
util_est -= task_util_est(p);
51fc: f9405002 ldr x2, [x0,#160]
// x2 := cfs_rq->util_est_runnable
return max(p->util_est.ewma, p->util_est.last);
5200: eb04007f cmp x3, x4
5204: 9a842063 csel x3, x3, x4, cs
// x3 := task_util_est(p) := max(p->util_est.ewma, p->util_est.last)
cfs_rq->util_est_runnable = util_est;
5208: eb030042 subs x2, x2, x3
// x2 := util_est -= task_util_est(p);
520c: 9a9f5042 csel x2, x2, xzr, pl
// x2 := max(0, util_est)
5210: f9005002 str x2, [x0,#160]
// store back into cfs_rq->util_est_runnable
And by adding {READ,WRITE}_ONCE I still get the same code.
While, compiling for x86, I get two different versions, here is
the one without {READ,WRITE}_ONCE:
if (cfs_rq->nr_running > 0) {
3e3e: 8b 90 98 00 00 00 mov 0x98(%rax),%edx
3e44: 85 d2 test %edx,%edx
3e46: 0f 84 e0 00 00 00 je 3f2c <dequeue_task_fair+0xf9c>
util_est = cfs_rq->util_est_runnable;
util_est -= task_util_est(p);
3e4c: 48 8b 74 24 28 mov 0x28(%rsp),%rsi
3e51: 48 8b 96 80 02 00 00 mov 0x280(%rsi),%rdx
3e58: 48 39 96 88 02 00 00 cmp %rdx,0x288(%rsi)
3e5f: 48 0f 43 96 88 02 00 cmovae 0x288(%rsi),%rdx
3e66: 00
if (util_est < 0)
util_est = 0;
cfs_rq->util_est_runnable = util_est;
3e67: 48 8b b0 20 01 00 00 mov 0x120(%rax),%rsi
3e6e: 48 29 d6 sub %rdx,%rsi
3e71: 48 89 f2 mov %rsi,%rdx
3e74: be 00 00 00 00 mov $0x0,%esi
3e79: 48 0f 48 d6 cmovs %rsi,%rdx
3e7d: 48 89 90 20 01 00 00 mov %rdx,0x120(%rax)
but I'm not confident on "parsing it"...
> Using the volatile load/store completely destroys that optimization
> though, so yes, I'd say its definitely needed.
Ok, since it's definitively not harmful, I'll add it.
--
#include <best/regards.h>
Patrick Bellasi
On 15-Dec 13:53, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 12:14:17PM +0000, Patrick Bellasi wrote:
> > On 13-Dec 17:16, Peter Zijlstra wrote:
>
> > > > + /*
> > > > + * Skip update of task's estimated utilization when its EWMA is already
> > > > + * ~1% close to its last activation value.
> > > > + */
> > > > + util_est = p->util_est.ewma;
> > > > + if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> > > > + return;
> > >
> > > Isn't that computation almost as expensive as the stuff you're trying to
> > > avoid?
> >
> > Mmm... maybe slightly simpler. I'll profile it again but I remember
> > I've added it because it was slightly better on backbench.
> >
> > This code at the end it's just a "sub" and a "compare to constant" and
> > it's likely to bail early for all "almost regular" tasks.
> >
> > Are you worried about the branch overhead?
>
> Its a subtract, a test for sign, a conditional branch on test, a negate,
> a subtract constant and another conditinoal branch.
Close enough, the actual code is:
util_est = p->util_est.ewma;
5218: f9403ba3 ldr x3, [x29,#112]
521c: f9418462 ldr x2, [x3,#776]
if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
5220: eb010040 subs x0, x2, x1
5224: da805400 cneg x0, x0, mi
5228: f100281f cmp x0, #0xa
522c: 54fff9cd b.le 5164 <dequeue_task_fair+0xa04>
>
> Branch overhead certainly matters too.
>
> > > > + p->util_est.last = util_last;
> > > > + ewma = p->util_est.ewma;
> > > > + if (likely(ewma != 0)) {
> > >
> > > Why special case 0? Yes it helps with the initial ramp-on, but would not
> > > an asymmetric IIR (with a consistent upward bias) be better?
> >
> > Yes, maybe the fast ramp-up is not really necessary... I'll test it
> > without on some real use-cases and see if we really get any noticiable
> > benefit, otheriwse I'll remove it.
> >
> > Thanks for pointing this out.
> >
> > > > + ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
> > > > + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > > > + } else {
> > > > + ewma = util_last;
> > > > + }
> > > > + p->util_est.ewma = ewma;
>
> And this, without the 0 case, is shift, an add, a subtract and another
> shift followed by a store.
Actual code:
p->util_est.last = util_last;
5230: f9018061 str x1, [x3,#768]
if (likely(ewma != 0)) {
5234: b40000a2 cbz x2, 5248 <dequeue_task_fair+0xae8>
ewma = util_last + (ewma << UTIL_EST_WEIGHT_SHIFT) - ewma;
5238: d37ef440 lsl x0, x2, #2
523c: cb020002 sub x2, x0, x2
5240: 8b010041 add x1, x2, x1
ewma >>= UTIL_EST_WEIGHT_SHIFT;
5244: d342fc21 lsr x1, x1, #2
p->util_est.ewma = ewma;
5248: f9403ba0 ldr x0, [x29,#112]
524c: f9018401 str x1, [x0,#776]
5250: 17ffffc5 b 5164 <dequeue_task_fair+0xa04>
>
> Which is less branches and roughly similar arith ops, some of which can
> be done in parallel.
>
> I suspect what you saw on the profile is the cacheline hit of the store,
> but I'm not sure.
Yes likely, looking at the two chunks above and considering the
removal of the 0 case, it's probably worth to remove the first check.
I'll give it a try again to measure hackbench overheads with the cache
alignment fixed.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
Hi Mike,
On 13-Dec 18:56, Mike Galbraith wrote:
> On Tue, 2017-12-05 at 17:10 +0000, Patrick Bellasi wrote:
> > This is a respin of:
> > https://lkml.org/lkml/2017/11/9/546
> > which has been rebased on v4.15-rc2 to have util_est now working on top
> > of the recent PeterZ's:
> > [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
> >
> > The aim of this series is to improve some PELT behaviors to make it a
> > better fit for the scheduling of tasks common in embedded mobile
> > use-cases, without affecting other classes of workloads.
>
> I thought perhaps this patch set would improve the below behavior, but
> alas it does not. ?That's 3 instances of firefox playing youtube clips
> being shoved into a corner by hogs sitting on 7 of 8 runqueues. ?PELT
> serializes the threaded desktop, making that threading kinda pointless,
> and CFS not all that fair.
Perhaps I don't completely get your use-case.
Are the cpuhog thread pinned to a CPU or just happens to be always
running on the same CPU?
I guess you would expect the three Firefox instances to be spread on
different CPUs. But whether this is possible depends also on the
specific tasks composition generated by Firefox, isn't it?
Being a video playback pipeline I would not be surprised to see that
most of the time we actually have only 1 or 2 tasks RUNNABLE, while
the others are sleeping... and if an HW decoder is involved, even if
you have three instances running you likely get only one pipeline
active at each time...
If that's the case, why should CFS move Fairfox tasks around?
> 6569 root 20 0 4048 704 628 R 100.0 0.004 5:10.48 7 cpuhog
> 6573 root 20 0 4048 712 636 R 100.0 0.004 5:07.47 5 cpuhog
> 6581 root 20 0 4048 696 620 R 100.0 0.004 5:07.36 1 cpuhog
> 6585 root 20 0 4048 812 736 R 100.0 0.005 5:08.14 4 cpuhog
> 6589 root 20 0 4048 712 636 R 100.0 0.004 5:06.42 6 cpuhog
> 6577 root 20 0 4048 720 644 R 99.80 0.005 5:06.52 3 cpuhog
> 6593 root 20 0 4048 728 652 R 99.60 0.005 5:04.25 0 cpuhog
> 6755 mikeg 20 0 2714788 885324 179196 S 19.96 5.544 2:14.36 2 Web Content
> 6620 mikeg 20 0 2318348 312336 145044 S 8.383 1.956 0:51.51 2 firefox
> 3190 root 20 0 323944 71704 42368 S 3.194 0.449 0:11.90 2 Xorg
> 3718 root 20 0 3009580 67112 49256 S 0.599 0.420 0:02.89 2 kwin_x11
> 3761 root 20 0 769760 90740 62048 S 0.399 0.568 0:03.46 2 konsole
> 3845 root 9 -11 791224 20132 14236 S 0.399 0.126 0:03.00 2 pulseaudio
> 3722 root 20 0 3722308 172568 88088 S 0.200 1.081 0:04.35 2 plasmashel
Is this always happening... or sometimes Firefox tasks gets a chance
to run on CPUs other then CPU2?
Could be that looking at an htop output we don't see these small opportunities?
> ------------------------------------------------------------------------------------------------------------------------------------
> Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Sum delay ms | Maximum delay at |
> ------------------------------------------------------------------------------------------------------------------------------------
> Web Content:6755 | 2864.862 ms | 7314 | avg: 0.299 ms | max: 40.374 ms | sum: 2189.472 ms | max at: 375.769240 |
> Compositor:6680 | 1889.847 ms | 4672 | avg: 0.531 ms | max: 29.092 ms | sum: 2478.559 ms | max at: 375.759405 |
> MediaPl~back #3:(13) | 3269.777 ms | 7853 | avg: 0.218 ms | max: 19.451 ms | sum: 1711.635 ms | max at: 391.123970 |
> MediaPl~back #4:(10) | 1472.986 ms | 8189 | avg: 0.236 ms | max: 18.653 ms | sum: 1933.886 ms | max at: 376.124211 |
> MediaPl~back #1:(9) | 601.788 ms | 6598 | avg: 0.247 ms | max: 17.823 ms | sum: 1627.852 ms | max at: 401.122567 |
> firefox:6620 | 303.181 ms | 6232 | avg: 0.111 ms | max: 15.602 ms | sum: 691.865 ms | max at: 385.078558 |
> Socket Thread:6639 | 667.537 ms | 4806 | avg: 0.069 ms | max: 12.638 ms | sum: 329.387 ms | max at: 380.827323 |
> MediaPD~oder #1:6835 | 154.737 ms | 1592 | avg: 0.700 ms | max: 10.139 ms | sum: 1113.688 ms | max at: 392.575370 |
> MediaTimer #1:6828 | 42.660 ms | 5250 | avg: 0.575 ms | max: 9.845 ms | sum: 3018.994 ms | max at: 380.823677 |
> MediaPD~oder #2:6840 | 150.822 ms | 1583 | avg: 0.703 ms | max: 9.639 ms | sum: 1112.962 ms | max at: 380.823741 |
How do you get these stats?
It's definitively an interesting use-case however I think it's out of
the scope of util_est.
Regarding the specific statement "CFS not all that fair", I would say
that the fairness of CFS is defined and has to be evaluated within a
single CPU and on a temporal (not clock cycles) base.
AFAIK, vruntime is progressed based on elapsed time, thus you can have
two tasks which gets the same slice time but consume it at different
frequencies. In this case also we are not that fair, isn't it?
Thus, at the end it all boils down to some (as much as possible)
low-overhead heuristics. Thus, a proper description of a
reproducible use-case can help on improving them.
Can we model your use-case using a simple rt-app configuration?
This would likely help to have a simple and reproducible testing
scenario to better understand where the issue eventually is...
maybe by looking at an execution trace.
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On Fri, 2017-12-15 at 16:13 +0000, Patrick Bellasi wrote:
> Hi Mike,
>
> On 13-Dec 18:56, Mike Galbraith wrote:
> > On Tue, 2017-12-05 at 17:10 +0000, Patrick Bellasi wrote:
> > > This is a respin of:
> > > https://lkml.org/lkml/2017/11/9/546
> > > which has been rebased on v4.15-rc2 to have util_est now working on top
> > > of the recent PeterZ's:
> > > [PATCH -v2 00/18] sched/fair: A bit of a cgroup/PELT overhaul
> > >
> > > The aim of this series is to improve some PELT behaviors to make it a
> > > better fit for the scheduling of tasks common in embedded mobile
> > > use-cases, without affecting other classes of workloads.
> >
> > I thought perhaps this patch set would improve the below behavior, but
> > alas it does not. ?That's 3 instances of firefox playing youtube clips
> > being shoved into a corner by hogs sitting on 7 of 8 runqueues. ?PELT
> > serializes the threaded desktop, making that threading kinda pointless,
> > and CFS not all that fair.
>
> Perhaps I don't completely get your use-case.
> Are the cpuhog thread pinned to a CPU or just happens to be always
> running on the same CPU?
Nothing is pinned.
> I guess you would expect the three Firefox instances to be spread on
> different CPUs. But whether this is possible depends also on the
> specific tasks composition generated by Firefox, isn't it?
It depends on load balancing. ?We're letting firefox threads stack up
to 6 deep while single hogs dominate the box.
> Being a video playback pipeline I would not be surprised to see that
> most of the time we actually have only 1 or 2 tasks RUNNABLE, while
> the others are sleeping... and if an HW decoder is involved, even if
> you have three instances running you likely get only one pipeline
> active at each time...
>
> If that's the case, why should CFS move Fairfox tasks around?
No, while they are indeed ~fairly synchronous, there is overlap. ?If
there were not, there would be no wait time being accumulated. The load
wants to consume roughly one full core worth, but to achieve that, it
needs access to more than one runqueue, which we are not facilitating.
> Is this always happening... or sometimes Firefox tasks gets a chance
> to run on CPUs other then CPU2?
There is some escape going on, but not enough for the load to get its
fair share. ?I have it sort of fixed up locally, but while patch keeps
changing, it's not getting any prettier, nor is it particularly
interested in letting me keep some performance gains I want, so...
> How do you get these stats?
perf sched record/perf sched lat. ?I twiddled it to output accumulated
wait times as well for convenience, stock only shows max. ?See below.
?If you play with perf sched, you'll notice some.. oddities about it.
> It's definitively an interesting use-case however I think it's out of
> the scope of util_est.
Yeah. ?If I had been less busy and read the whole thing, I wouldn't
have taken it out for a spin.
> Regarding the specific statement "CFS not all that fair", I would say
> that the fairness of CFS is defined and has to be evaluated within a
> single CPU and on a temporal (not clock cycles) base.
No, that doesn't really fly. ?In fact, in the group scheduling code, we
actively pursue box wide fairness. ?PELT is going a bit too far ATM.
Point: if you think it's OK to serialize these firefox threads, would
you still think so if those were kernel threads instead? ?Serializing
your kernel is a clear fail, but unpinned kthreads can be stacked up
just as effectively as those browser threads are, eat needless wakeup
latency and pass it on.
> AFAIK, vruntime is progressed based on elapsed time, thus you can have
> two tasks which gets the same slice time but consume it at different
> frequencies. In this case also we are not that fair, isn't it?
Time slices don't really exist as a concrete quantum in CFS. ?There's
vruntime equalization, and that's it.
> Thus, at the end it all boils down to some (as much as possible)
> low-overhead heuristics. Thus, a proper description of a
> reproducible use-case can help on improving them.
Nah, heuristics are fickle beasts, they WILL knife you in the back,
it's just a question of how often, and how deep.
> Can we model your use-case using a simple rt-app configuration?
No idea.
> This would likely help to have a simple and reproducible testing
> scenario to better understand where the issue eventually is...
> maybe by looking at an execution trace.
It should be reproducible by anyone, just fire up NR_CPUS-1 pure hogs,
point firefox at youtube, open three clips in tabs, watch tasks stack.
Root cause IMHO is PELT having grown too aggressive. ?SIS was made more
aggressive to compensate, but when you slam that door you get the full
PELT impact, and it stings, as does too aggressive bouncing when you
leave the escape hatch open. ?Sticky wicket that. ?Both of those want a
gentle wrap upside the head, as they're both acting a bit nutty.
-Mike
---
tools/perf/builtin-sched.c | 34 ++++++++++++++++++++++++++--------
1 file changed, 26 insertions(+), 8 deletions(-)
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -212,6 +212,7 @@ struct perf_sched {
u64 run_avg;
u64 all_runtime;
u64 all_count;
+ u64 all_lat;
u64 cpu_last_switched[MAX_CPUS];
struct rb_root atom_root, sorted_atom_root, merged_atom_root;
struct list_head sort_list, cmp_pid;
@@ -1286,6 +1287,7 @@ static void output_lat_thread(struct per
sched->all_runtime += work_list->total_runtime;
sched->all_count += work_list->nb_atoms;
+ sched->all_lat += work_list->total_lat;
if (work_list->num_merged > 1)
ret = printf(" %s:(%d) ", thread__comm_str(work_list->thread), work_list->num_merged);
@@ -1298,10 +1300,11 @@ static void output_lat_thread(struct per
avg = work_list->total_lat / work_list->nb_atoms;
timestamp__scnprintf_usec(work_list->max_lat_at, max_lat_at, sizeof(max_lat_at));
- printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | max at: %13s s\n",
+ printf("|%11.3f ms |%9" PRIu64 " | avg:%9.3f ms | max:%9.3f ms | sum:%9.3f ms | max at: %13s s\n",
(double)work_list->total_runtime / NSEC_PER_MSEC,
work_list->nb_atoms, (double)avg / NSEC_PER_MSEC,
(double)work_list->max_lat / NSEC_PER_MSEC,
+ (double)work_list->total_lat / NSEC_PER_MSEC,
max_lat_at);
}
@@ -1347,6 +1350,16 @@ static int max_cmp(struct work_atoms *l,
return 0;
}
+static int sum_cmp(struct work_atoms *l, struct work_atoms *r)
+{
+ if (l->total_lat < r->total_lat)
+ return -1;
+ if (l->total_lat > r->total_lat)
+ return 1;
+
+ return 0;
+}
+
static int switch_cmp(struct work_atoms *l, struct work_atoms *r)
{
if (l->nb_atoms < r->nb_atoms)
@@ -1378,6 +1391,10 @@ static int sort_dimension__add(const cha
.name = "max",
.cmp = max_cmp,
};
+ static struct sort_dimension sum_sort_dimension = {
+ .name = "sum",
+ .cmp = sum_cmp,
+ };
static struct sort_dimension pid_sort_dimension = {
.name = "pid",
.cmp = pid_cmp,
@@ -1394,6 +1411,7 @@ static int sort_dimension__add(const cha
&pid_sort_dimension,
&avg_sort_dimension,
&max_sort_dimension,
+ &sum_sort_dimension,
&switch_sort_dimension,
&runtime_sort_dimension,
};
@@ -3090,9 +3108,9 @@ static int perf_sched__lat(struct perf_s
perf_sched__merge_lat(sched);
perf_sched__sort_lat(sched);
- printf("\n -----------------------------------------------------------------------------------------------------------------\n");
- printf(" Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Maximum delay at |\n");
- printf(" -----------------------------------------------------------------------------------------------------------------\n");
+ printf("\n ------------------------------------------------------------------------------------------------------------------------------------\n");
+ printf(" Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Sum delay ms | Maximum delay at |\n");
+ printf(" ------------------------------------------------------------------------------------------------------------------------------------\n");
next = rb_first(&sched->sorted_atom_root);
@@ -3105,11 +3123,11 @@ static int perf_sched__lat(struct perf_s
thread__zput(work_list->thread);
}
- printf(" -----------------------------------------------------------------------------------------------------------------\n");
- printf(" TOTAL: |%11.3f ms |%9" PRIu64 " |\n",
- (double)sched->all_runtime / NSEC_PER_MSEC, sched->all_count);
+ printf(" ------------------------------------------------------------------------------------------------------------\n");
+ printf(" TOTAL: |%11.3f ms |%9" PRIu64 " | |%14.3f ms |\n",
+ (double)sched->all_runtime / NSEC_PER_MSEC, sched->all_count, (double)sched->all_lat / NSEC_PER_MSEC);
- printf(" ---------------------------------------------------\n");
+ printf(" ------------------------------------------------------------------------------------------------------------\n");
print_bad_events(sched);
printf("\n");
On Tuesday, December 5, 2017 6:10:18 PM CET Patrick Bellasi wrote:
> When schedutil looks at the CPU utilization, the current PELT value for
> that CPU is returned straight away. In certain scenarios this can have
> undesired side effects and delays on frequency selection.
>
> For example, since the task utilization is decayed at wakeup time, a
> long sleeping big task newly enqueued does not add immediately a
> significant contribution to the target CPU. This introduces some latency
> before schedutil will be able to detect the best frequency required by
> that task.
>
> Moreover, the PELT signal build-up time is function of the current
> frequency, because of the scale invariant load tracking support. Thus,
> starting from a lower frequency, the utilization build-up time will
> increase even more and further delays the selection of the actual
> frequency which better serves the task requirements.
>
> In order to reduce these kind of latencies, this patch integrates the
> usage of the CPU's estimated utilization in the sugov_get_util function.
>
> The estimated utilization of a CPU is defined to be the maximum between
> its PELT's utilization and the sum of the estimated utilization of each
> currently RUNNABLE task on that CPU.
> This allows to properly represent the expected utilization of a CPU which,
> for example, has just got a big task running after a long sleep period,
> and ultimately it allows to select the best frequency to run a task
> right after it wakes up.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Reviewed-by: Brendan Jackman <[email protected]>
> Reviewed-by: Dietmar Eggemann <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
> Changes v1->v2:
> - rebase on top of v4.15-rc2
> - tested that overhauled PELT code does not affect the util_est
> ---
> kernel/sched/cpufreq_schedutil.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index 2f52ec0f1539..465430d99440 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -183,7 +183,11 @@ static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
>
> cfs_max = arch_scale_cpu_capacity(NULL, cpu);
>
> - *util = min(rq->cfs.avg.util_avg, cfs_max);
> + *util = rq->cfs.avg.util_avg;
I would use a local variable here.
That *util everywhere looks a bit dirtyish.
> + if (sched_feat(UTIL_EST))
> + *util = max(*util, rq->cfs.util_est_runnable);
> + *util = min(*util, cfs_max);
> +
> *max = cfs_max;
> }
>
>
On Fri, 2017-12-15 at 21:23 +0100, Mike Galbraith wrote:
>
> Point: if you think it's OK to serialize these firefox threads, would
> you still think so if those were kernel threads instead? ?Serializing
> your kernel is a clear fail, but unpinned kthreads can be stacked up
> just as effectively as those browser threads are, eat needless wakeup
> latency and pass it on.
FWIW, somewhat cheezy example of that below.
(later, /me returns to [apparently endless] squabble w. PELT/SIS;)
bonnie in nfs mount of own box competing with 7 hogs:
------------------------------------------------------------------------------------------------------------------------------------
Task | Runtime ms | Switches | Average delay ms | Maximum delay ms | Sum delay ms | Maximum delay at |
------------------------------------------------------------------------------------------------------------------------------------
kworker/3:0:29 | 630.078 ms | 89669 | avg: 0.011 ms | max: 102.340 ms | sum: 962.919 ms | max at: 310.501277 |
kworker/3:1H:464 | 1179.868 ms | 101944 | avg: 0.005 ms | max: 102.232 ms | sum: 480.915 ms | max at: 310.501273 |
kswapd0:78 | 2662.230 ms | 1661 | avg: 0.128 ms | max: 93.935 ms | sum: 213.258 ms | max at: 310.503419 |
nfsd:2039 | 3257.143 ms | 78448 | avg: 0.112 ms | max: 86.039 ms | sum: 8795.767 ms | max at: 258.847140 |
nfsd:2038 | 3185.730 ms | 76253 | avg: 0.113 ms | max: 78.348 ms | sum: 8580.676 ms | max at: 258.831370 |
nfsd:2042 | 3256.554 ms | 81423 | avg: 0.110 ms | max: 74.941 ms | sum: 8929.015 ms | max at: 288.397203 |
nfsd:2040 | 3314.826 ms | 80396 | avg: 0.105 ms | max: 51.039 ms | sum: 8471.816 ms | max at: 363.870078 |
nfsd:2036 | 3058.867 ms | 70460 | avg: 0.115 ms | max: 44.629 ms | sum: 8092.319 ms | max at: 250.074253 |
nfsd:2037 | 3113.592 ms | 74276 | avg: 0.115 ms | max: 43.294 ms | sum: 8556.110 ms | max at: 310.443722 |
konsole:4013 | 402.509 ms | 894 | avg: 0.148 ms | max: 38.129 ms | sum: 132.050 ms | max at: 332.156495 |
haveged:497 | 11.831 ms | 1224 | avg: 0.104 ms | max: 37.575 ms | sum: 127.706 ms | max at: 350.669645 |
nfsd:2043 | 3316.033 ms | 78303 | avg: 0.115 ms | max: 36.511 ms | sum: 8995.138 ms | max at: 248.576108 |
nfsd:2035 | 3064.108 ms | 67413 | avg: 0.115 ms | max: 28.221 ms | sum: 7746.306 ms | max at: 313.785682 |
bash:7022 | 0.342 ms | 1 | avg: 22.959 ms | max: 22.959 ms | sum: 22.959 ms | max at: 262.258960 |
kworker/u16:4:354 | 2073.383 ms | 1550 | avg: 0.050 ms | max: 21.203 ms | sum: 77.185 ms | max at: 332.220678 |
kworker/4:3:6975 | 1189.868 ms | 115776 | avg: 0.018 ms | max: 20.856 ms | sum: 2071.894 ms | max at: 348.142757 |
kworker/2:4:6981 | 335.895 ms | 26617 | avg: 0.023 ms | max: 20.726 ms | sum: 625.102 ms | max at: 248.522083 |
bash:7021 | 0.517 ms | 2 | avg: 10.363 ms | max: 20.726 ms | sum: 20.727 ms | max at: 262.235708 |
ksoftirqd/2:22 | 65.718 ms | 998 | avg: 0.138 ms | max: 19.072 ms | sum: 137.827 ms | max at: 332.221676 |
kworker/7:3:6969 | 625.724 ms | 84153 | avg: 0.010 ms | max: 18.838 ms | sum: 876.603 ms | max at: 264.188983 |
bonnie:6965 | 79637.998 ms | 35434 | avg: 0.007 ms | max: 18.719 ms | sum: 256.748 ms | max at: 331.299867 |
Hi Rafael,
On 16-Dec 03:35, Rafael J. Wysocki wrote:
> On Tuesday, December 5, 2017 6:10:18 PM CET Patrick Bellasi wrote:
[...]
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index 2f52ec0f1539..465430d99440 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -183,7 +183,11 @@ static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
> >
> > cfs_max = arch_scale_cpu_capacity(NULL, cpu);
> >
> > - *util = min(rq->cfs.avg.util_avg, cfs_max);
> > + *util = rq->cfs.avg.util_avg;
>
> I would use a local variable here.
>
> That *util everywhere looks a bit dirtyish.
Yes, right... will update for the next respin.
>
> > + if (sched_feat(UTIL_EST))
> > + *util = max(*util, rq->cfs.util_est_runnable);
> > + *util = min(*util, cfs_max);
> > +
> > *max = cfs_max;
> > }
> >
> >
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On Fri, Dec 15, 2017 at 03:41:40PM +0000, Patrick Bellasi wrote:
> Close enough, the actual code is:
>
> util_est = p->util_est.ewma;
> 5218: f9403ba3 ldr x3, [x29,#112]
> 521c: f9418462 ldr x2, [x3,#776]
> if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> 5220: eb010040 subs x0, x2, x1
> 5224: da805400 cneg x0, x0, mi
> 5228: f100281f cmp x0, #0xa
> 522c: 54fff9cd b.le 5164 <dequeue_task_fair+0xa04>
Ah, that cneg instruction is cute; on x86 we end up with something like:
bool abs_test(long s)
{
return abs(s) < 32;
}
cmpl $-31, %eax
jl .L107
movq -8(%rbp), %rax
cmpl $31, %eax
jg .L107
movl $1, %eax
jmp .L108
.L107:
movl $0, %eax
.L108:
But I figured you can actually do:
abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
Which, if y is a constant, should result in nicer code, and it does for
x86:
addq $31, %rax
cmpq $62, %rax
setbe %al
movzbl %al, %eax
Just not measurably faster, I suppose because of all the dependencies :/
On Wed, Dec 20, 2017 at 09:57:47AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 03:41:40PM +0000, Patrick Bellasi wrote:
> > Close enough, the actual code is:
> >
> > util_est = p->util_est.ewma;
> > 5218: f9403ba3 ldr x3, [x29,#112]
> > 521c: f9418462 ldr x2, [x3,#776]
> > if (abs(util_est - util_last) <= (SCHED_CAPACITY_SCALE / 100))
> > 5220: eb010040 subs x0, x2, x1
> > 5224: da805400 cneg x0, x0, mi
> > 5228: f100281f cmp x0, #0xa
> > 522c: 54fff9cd b.le 5164 <dequeue_task_fair+0xa04>
>
> Ah, that cneg instruction is cute; on x86 we end up with something like:
>
> bool abs_test(long s)
> {
> return abs(s) < 32;
> }
>
> cmpl $-31, %eax
> jl .L107
> movq -8(%rbp), %rax
> cmpl $31, %eax
> jg .L107
> movl $1, %eax
> jmp .L108
> .L107:
> movl $0, %eax
> .L108:
>
>
> But I figured you can actually do:
>
> abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
>
> Which, if y is a constant, should result in nicer code, and it does for
> x86:
>
> addq $31, %rax
> cmpq $62, %rax
> setbe %al
> movzbl %al, %eax
>
> Just not measurably faster, I suppose because of all the dependencies :/
Ah no, it actually is, I'm an idiot and used 'long' for return value. If
I use bool we loose that last movzbl and we go from around 4.0 cycles
down to 3.4 cycles.
Commit-ID: f01415fdbfe83380c2dfcf90b7b26042f88963aa
Gitweb: https://git.kernel.org/tip/f01415fdbfe83380c2dfcf90b7b26042f88963aa
Author: Patrick Bellasi <[email protected]>
AuthorDate: Tue, 5 Dec 2017 17:10:15 +0000
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 10 Jan 2018 11:30:28 +0100
sched/fair: Use 'unsigned long' for utilization, consistently
Utilization and capacity are tracked as 'unsigned long', however some
functions using them return an 'int' which is ultimately assigned back to
'unsigned long' variables.
Since there is not scope on using a different and signed type,
consolidate the signature of functions returning utilization to always
use the native type.
This change improves code consistency, and it also benefits
code paths where utilizations should be clamped by avoiding
further type conversions or ugly type casts.
Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Chris Redpath <[email protected]>
Reviewed-by: Brendan Jackman <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2915c0d..de43bd8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5765,8 +5765,8 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return affine;
}
-static inline int task_util(struct task_struct *p);
-static int cpu_util_wake(int cpu, struct task_struct *p);
+static inline unsigned long task_util(struct task_struct *p);
+static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
{
@@ -6247,7 +6247,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* capacity_orig) as it useful for predicting the capacity required after task
* migrations (scheduler-driven DVFS).
*/
-static int cpu_util(int cpu)
+static unsigned long cpu_util(int cpu)
{
unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
unsigned long capacity = capacity_orig_of(cpu);
@@ -6255,7 +6255,7 @@ static int cpu_util(int cpu)
return (util >= capacity) ? capacity : util;
}
-static inline int task_util(struct task_struct *p)
+static inline unsigned long task_util(struct task_struct *p)
{
return p->se.avg.util_avg;
}
@@ -6264,7 +6264,7 @@ static inline int task_util(struct task_struct *p)
* cpu_util_wake: Compute cpu utilization with any contributions from
* the waking task p removed.
*/
-static int cpu_util_wake(int cpu, struct task_struct *p)
+static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
{
unsigned long util, capacity;