Hi,
This is a respin of [1], still based on today's tip/sche/core [2], which
includes Juri's series [3] to integrate SCHED_DEADLINE into schedutil.
Thanks to everyone who provided feedback, all of them have been addressed.
Testing on Intel and ARM (Android) devices confirms the negligible overheads
and the power/performance benefits reported in the previous posting [1].
Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- renamed util_est's "last" into "enqueued"
- using util_est's "enqueued" for both se and cfs_rqs (Joel)
- update margin check to use more ASM friendly code (Peter)
- optimize EWMA updates (Peter)
- ensure cpu_util_wake() is cpu_capacity_orig()'s clamped (Pavan)
- simplify cpu_util_cfs() integration (Dietmar)
Changes in v3:
- rebased on today's tip/sched/core (commit 07881166a892)
- moved util_est into sched_avg (Peter)
- use {READ,WRITE}_ONCE() for EWMA updates (Peter)
- using unsigned int to fit all sched_avg into a single 64B cache line
- schedutil integration using Juri's cpu_util_cfs()
- first patch dropped since it's already queued in tip/sched/core
Changes in v2:
- rebased on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
Cheers Patrick
.:: References
==============
[1] https://lkml.org/lkml/2018/1/23/645
[email protected]
[2] git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
sched/core (commit 460e8c3340a2)
[3] https://lkml.org/lkml/2017/12/4/173
[email protected]
Patrick Bellasi (3):
sched/fair: add util_est on top of PELT
sched/fair: use util_est in LB and WU paths
sched/cpufreq_schedutil: use util_est for OPP selection
include/linux/sched.h | 16 +++++
kernel/sched/debug.c | 4 ++
kernel/sched/fair.c | 179 ++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 5 ++
kernel/sched/sched.h | 7 +-
5 files changed, 204 insertions(+), 7 deletions(-)
--
2.15.1
The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.
The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.
Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.
For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.
This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:
util_est(task) = max(task::util_avg, f(task::util_avg@dequeue_times))
This allows to remember how big a task has been reported by PELT in its
previous activations via the function: f(task::util_avg@dequeue_times).
If a task should change its behavior and it runs even longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).
The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization due
to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:
util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est_runnable)
where:
cfs_rq::util_est_runnable = sum(util_est(task))
for each RUNNABLE task on that root cfs_rq
It's worth to note that the estimated utilization is tracked only for
objects of interests, specifically:
- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection
Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- renamed util_est's "last" into "enqueued"
- using util_est's "enqueued" for both se and cfs_rqs (Joel)
- update margin check to use more ASM friendly code (Peter)
- optimize EWMA updates (Peter)
Changes in v3:
- rebased on today's tip/sched/core (commit 07881166a892)
- moved util_est into sched_avg (Peter)
- use {READ,WRITE}_ONCE() for EWMA updates (Peter)
- using unsigned int to fit all sched_avg into a single 64B cache line
Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
Change-Id: If5690c05b54bc24e1bcbaad85212656f71ab68a3
---
include/linux/sched.h | 16 ++++++++
kernel/sched/debug.c | 4 ++
kernel/sched/fair.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 5 +++
4 files changed, 122 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 166144c04ef6..0e374d69e431 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -275,6 +275,21 @@ struct load_weight {
u32 inv_weight;
};
+/**
+ * Estimation Utilization for FAIR tasks.
+ *
+ * Support data structure to track an Exponential Weighted Moving Average
+ * (EWMA) of a FAIR task's utilization. New samples are added to the moving
+ * average each time a task completes an activation. Sample's weight is
+ * chosen so that the EWMA will be relatively insensitive to transient changes
+ * to the task's workload.
+ */
+struct util_est {
+ unsigned int enqueued;
+ unsigned int ewma;
+#define UTIL_EST_WEIGHT_SHIFT 2
+};
+
/*
* The load_avg/util_avg accumulates an infinite geometric series
* (see __update_load_avg() in kernel/sched/fair.c).
@@ -336,6 +351,7 @@ struct sched_avg {
unsigned long load_avg;
unsigned long runnable_load_avg;
unsigned long util_avg;
+ struct util_est util_est;
};
struct sched_statistics {
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1ca0130ed4f9..d4eb5532ea6b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -567,6 +567,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.runnable_load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
+ SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued",
+ cfs_rq->avg.util_est.enqueued);
SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg",
cfs_rq->removed.load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
@@ -1018,6 +1020,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(se.avg.runnable_load_avg);
P(se.avg.util_avg);
P(se.avg.last_update_time);
+ P(se.avg.util_est.ewma);
+ P(se.avg.util_est.enqueued);
#endif
P(policy);
P(prio);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b6535987500..118f49c39b60 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5193,6 +5193,20 @@ static inline void hrtick_update(struct rq *rq)
}
#endif
+static inline unsigned long task_util(struct task_struct *p);
+static inline unsigned long _task_util_est(struct task_struct *p);
+
+static inline void util_est_enqueue(struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Update root cfs_rq's estimated utilization */
+ cfs_rq->avg.util_est.enqueued += _task_util_est(p);
+}
+
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
@@ -5245,9 +5259,85 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
add_nr_running(rq, 1);
+ util_est_enqueue(p);
hrtick_update(rq);
}
+/*
+ * Check if the specified (signed) value is within a specified margin,
+ * based on the observation that:
+ * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
+ */
+static inline bool within_margin(long value, unsigned int margin)
+{
+ return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
+}
+
+static inline void util_est_dequeue(struct task_struct *p, int flags)
+{
+ struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
+ unsigned long util_last;
+ long last_ewma_diff;
+ unsigned long ewma;
+ long util_est = 0;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /*
+ * Update root cfs_rq's estimated utilization
+ *
+ * If *p is the last task then the root cfs_rq's estimated utilization
+ * of a CPU is 0 by definition.
+ */
+ if (cfs_rq->nr_running) {
+ util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+ util_est -= min_t(long, util_est, _task_util_est(p));
+ }
+ WRITE_ONCE(cfs_rq->avg.util_est.enqueued, util_est);
+
+ /*
+ * Skip update of task's estimated utilization when the task has not
+ * yet completed an activation, e.g. being migrated.
+ */
+ if (!(flags & DEQUEUE_SLEEP))
+ return;
+
+ ewma = READ_ONCE(p->se.avg.util_est.ewma);
+ util_last = task_util(p);
+
+ /*
+ * Skip update of task's estimated utilization when its EWMA is
+ * already ~1% close to its last activation value.
+ */
+ last_ewma_diff = util_last - ewma;
+ if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
+ return;
+
+ /*
+ * Update Task's estimated utilization
+ *
+ * When *p completes an activation we can consolidate another sample
+ * about the task size. This is done by storing the last PELT value
+ * for this task and using this value to load another sample in the
+ * exponential weighted moving average:
+ *
+ * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
+ * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
+ * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
+ * = w * ( last_ewma_diff ) + ewma(t-1)
+ * = w * (last_ewma_diff + ewma(t-1) / w)
+ *
+ * Where 'w' is the weight of new samples, which is configured to be
+ * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
+ */
+ ewma = last_ewma_diff + (ewma << UTIL_EST_WEIGHT_SHIFT);
+ ewma >>= UTIL_EST_WEIGHT_SHIFT;
+
+ WRITE_ONCE(p->se.avg.util_est.ewma, ewma);
+ WRITE_ONCE(p->se.avg.util_est.enqueued, util_last);
+}
+
static void set_next_buddy(struct sched_entity *se);
/*
@@ -5304,6 +5394,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
sub_nr_running(rq, 1);
+ util_est_dequeue(p, flags);
hrtick_update(rq);
}
@@ -5767,7 +5858,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return affine;
}
-static inline unsigned long task_util(struct task_struct *p);
static unsigned long cpu_util_wake(int cpu, struct task_struct *p);
static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -6262,6 +6352,12 @@ static inline unsigned long task_util(struct task_struct *p)
return p->se.avg.util_avg;
}
+
+static inline unsigned long _task_util_est(struct task_struct *p)
+{
+ return max(p->se.avg.util_est.ewma, p->se.avg.util_est.enqueued);
+}
+
/*
* cpu_util_wake: Compute cpu utilization with any contributions from
* the waking task p removed.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9552fd5854bf..c459a4b61544 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
SCHED_FEAT(WA_IDLE, true)
SCHED_FEAT(WA_WEIGHT, true)
SCHED_FEAT(WA_BIAS, true)
+
+/*
+ * UtilEstimation. Use estimated CPU utilization.
+ */
+SCHED_FEAT(UTIL_EST, false)
--
2.15.1
When schedutil looks at the CPU utilization, the current PELT value for
that CPU is returned straight away. In certain scenarios this can have
undesired side effects and delays on frequency selection.
For example, since the task utilization is decayed at wakeup time, a
long sleeping big task newly enqueued does not add immediately a
significant contribution to the target CPU. This introduces some latency
before schedutil will be able to detect the best frequency required by
that task.
Moreover, the PELT signal build-up time is a function of the current
frequency, because of the scale invariant load tracking support. Thus,
starting from a lower frequency, the utilization build-up time will
increase even more and further delays the selection of the actual
frequency which better serves the task requirements.
In order to reduce this kind of latencies, we integrate the usage
of the CPU's estimated utilization in the sugov_get_util function.
This allows to properly consider the expected utilization of a CPU which,
for example, has just got a big task running after a long sleep period.
Ultimately this allows to select the best frequency to run a task
right after its wake-up.
Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- use util_est.enqueued for cfs_rq's util_est (Joel)
- simplify cpu_util_cfs() integration (Dietmar)
Changes in v3:
- rebase on today's tip/sched/core (commit 07881166a892)
- moved into Juri's cpu_util_cfs(), which should also
address Rafael's suggestion to use a local variable.
Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
Change-Id: I62c01ed90d8ad45b06383be03d39fcf8c9041646
---
kernel/sched/sched.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e95505e23c6..f3c7b6a83ef4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2127,7 +2127,12 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
static inline unsigned long cpu_util_cfs(struct rq *rq)
{
- return rq->cfs.avg.util_avg;
+ if (!sched_feat(UTIL_EST))
+ return rq->cfs.avg.util_avg;
+
+ return max_t(unsigned long,
+ rq->cfs.avg.util_avg,
+ rq->cfs.avg.util_est.enqueued);
}
#endif
--
2.15.1
When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.
For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while it is still considered relatively empty.
In order to reduce this kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in cpu_util_wake as well as in update_sg_lb_stats.
The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization of the
tasks currently RUNNABLE on that CPU.
This allows to properly represent the spare capacity of a CPU which, for
example, has just got a big task running since a long sleep period.
Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- ensure cpu_util_wake() is cpu_capacity_orig()'s clamped (Pavan)
Changes in v3:
- rebased on today's tip/sched/core (commit 07881166a892)
Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
Change-Id: Id5a38d0e41aae7ca89f021f277851ee4e6ba5112
---
kernel/sched/fair.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 76 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 118f49c39b60..2a2e88bced87 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6347,6 +6347,41 @@ static unsigned long cpu_util(int cpu)
return (util >= capacity) ? capacity : util;
}
+/**
+ * cpu_util_est: estimated utilization for the specified CPU
+ * @cpu: the CPU to get the estimated utilization for
+ *
+ * The estimated utilization of a CPU is defined to be the maximum between its
+ * PELT's utilization and the sum of the estimated utilization of the tasks
+ * currently RUNNABLE on that CPU.
+ *
+ * This allows to properly represent the expected utilization of a CPU which
+ * has just got a big task running since a long sleep period. At the same time
+ * however it preserves the benefits of the "blocked utilization" in
+ * describing the potential for other tasks waking up on the same CPU.
+ *
+ * Return: the estimated utilization for the specified CPU
+ */
+static inline unsigned long cpu_util_est(int cpu)
+{
+ unsigned long util, util_est;
+ unsigned long capacity;
+ struct cfs_rq *cfs_rq;
+
+ if (!sched_feat(UTIL_EST))
+ return cpu_util(cpu);
+
+ cfs_rq = &cpu_rq(cpu)->cfs;
+ util = cfs_rq->avg.util_avg;
+ util_est = cfs_rq->avg.util_est.enqueued;
+ util_est = max(util, util_est);
+
+ capacity = capacity_orig_of(cpu);
+ util_est = min(util_est, capacity);
+
+ return util_est;
+}
+
static inline unsigned long task_util(struct task_struct *p)
{
return p->se.avg.util_avg;
@@ -6364,16 +6399,52 @@ static inline unsigned long _task_util_est(struct task_struct *p)
*/
static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
{
- unsigned long util, capacity;
+ unsigned long capacity;
+ long util, util_est;
/* Task has no contribution or is new */
if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
- return cpu_util(cpu);
+ return cpu_util_est(cpu);
+ /* Discount task's blocked util from CPU's util */
+ util = cpu_util(cpu) - task_util(p);
+ util = max(util, 0L);
+
+ if (!sched_feat(UTIL_EST))
+ return util;
+
+ /*
+ * Covered cases:
+ * - if *p is the only task sleeping on this CPU, then:
+ * cpu_util (== task_util) > util_est (== 0)
+ * and thus we return:
+ * cpu_util_wake = (cpu_util - task_util) = 0
+ *
+ * - if other tasks are SLEEPING on the same CPU, which is just waking
+ * up, then:
+ * cpu_util >= task_util
+ * cpu_util > util_est (== 0)
+ * and thus we discount *p's blocked utilization to return:
+ * cpu_util_wake = (cpu_util - task_util) >= 0
+ *
+ * - if other tasks are RUNNABLE on that CPU and
+ * util_est > cpu_util
+ * then we use util_est since it returns a more restrictive
+ * estimation of the spare capacity on that CPU, by just considering
+ * the expected utilization of tasks already runnable on that CPU.
+ */
+ util_est = cpu_rq(cpu)->cfs.avg.util_est.enqueued;
+ util = max(util, util_est);
+
+ /*
+ * Estimated utilization can exceed the CPU capacity, thus let's clamp
+ * to the maximum CPU capacity to ensure consistency with other
+ * cpu_util[_est] calls.
+ */
capacity = capacity_orig_of(cpu);
- util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+ util = min_t(unsigned long, util, capacity);
- return (util >= capacity) ? capacity : util;
+ return util;
}
/*
@@ -7898,7 +7969,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
load = source_load(i, load_idx);
sgs->group_load += load;
- sgs->group_util += cpu_util(i);
+ sgs->group_util += cpu_util_est(i);
sgs->sum_nr_running += rq->cfs.h_nr_running;
nr_running = rq->nr_running;
--
2.15.1
Mostly nice, I almost applied, except too many nits below.
On Tue, Feb 06, 2018 at 02:41:29PM +0000, Patrick Bellasi wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7b6535987500..118f49c39b60 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5193,6 +5193,20 @@ static inline void hrtick_update(struct rq *rq)
> }
> #endif
>
> +static inline unsigned long task_util(struct task_struct *p);
> +static inline unsigned long _task_util_est(struct task_struct *p);
What's with the leading underscore? I don't see one without it.
> +
> +static inline void util_est_enqueue(struct task_struct *p)
Also pass @rq from enqueue_task_fair() ? I see no point in computing
task_rq(p) if we already have the value around.
> +{
> + struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /* Update root cfs_rq's estimated utilization */
> + cfs_rq->avg.util_est.enqueued += _task_util_est(p);
> +}
> +/*
> + * Check if the specified (signed) value is within a specified margin,
> + * based on the observation that:
> + * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
* Note: this only works when x+y < INT_MAX.
> + */
> +static inline bool within_margin(long value, unsigned int margin)
This mixing of long and int is dodgy, do we want to consistently use int
here?
> +{
> + return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
> +}
> +
> +static inline void util_est_dequeue(struct task_struct *p, int flags)
> +{
> + struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
> + unsigned long util_last;
> + long last_ewma_diff;
> + unsigned long ewma;
> + long util_est = 0;
Why long?
> +
> + if (!sched_feat(UTIL_EST))
> + return;
> +
> + /*
> + * Update root cfs_rq's estimated utilization
> + *
> + * If *p is the last task then the root cfs_rq's estimated utilization
> + * of a CPU is 0 by definition.
> + */
> + if (cfs_rq->nr_running) {
> + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
Because util_est.enqueued is of type 'unsigned int'.
> + util_est -= min_t(long, util_est, _task_util_est(p));
> + }
> + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, util_est);
long to int truncate
> +
> + /*
> + * Skip update of task's estimated utilization when the task has not
> + * yet completed an activation, e.g. being migrated.
> + */
> + if (!(flags & DEQUEUE_SLEEP))
> + return;
> +
> + ewma = READ_ONCE(p->se.avg.util_est.ewma);
> + util_last = task_util(p);
Again, all kinds of long, while the ewma type itself is of 'unsigned
int'.
> +
> + /*
> + * Skip update of task's estimated utilization when its EWMA is
> + * already ~1% close to its last activation value.
> + */
> + last_ewma_diff = util_last - ewma;
> + if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> + return;
> +
> + /*
> + * Update Task's estimated utilization
> + *
> + * When *p completes an activation we can consolidate another sample
> + * about the task size. This is done by storing the last PELT value
> + * for this task and using this value to load another sample in the
> + * exponential weighted moving average:
> + *
> + * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
> + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> + * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
> + * = w * ( last_ewma_diff ) + ewma(t-1)
> + * = w * (last_ewma_diff + ewma(t-1) / w)
> + *
> + * Where 'w' is the weight of new samples, which is configured to be
> + * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> + */
> + ewma = last_ewma_diff + (ewma << UTIL_EST_WEIGHT_SHIFT);
> + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> +
> + WRITE_ONCE(p->se.avg.util_est.ewma, ewma);
> + WRITE_ONCE(p->se.avg.util_est.enqueued, util_last);
Two stores to that word... can we fix that nicely?
> +}
> +static inline unsigned long _task_util_est(struct task_struct *p)
> +{
> + return max(p->se.avg.util_est.ewma, p->se.avg.util_est.enqueued);
> +}
Aside from the underscore thing I already noted, why is this here and
not where the fwd declaration is?
> +/*
> + * UtilEstimation. Use estimated CPU utilization.
> + */
> +SCHED_FEAT(UTIL_EST, false)
Since you couldn't measure it, do we wants it true?
On 06-Feb 16:50, Peter Zijlstra wrote:
>
> Mostly nice, I almost applied, except too many nits below.
:)
Thanks for the really fast still useful review!
> On Tue, Feb 06, 2018 at 02:41:29PM +0000, Patrick Bellasi wrote:
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7b6535987500..118f49c39b60 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5193,6 +5193,20 @@ static inline void hrtick_update(struct rq *rq)
> > }
> > #endif
> >
> > +static inline unsigned long task_util(struct task_struct *p);
> > +static inline unsigned long _task_util_est(struct task_struct *p);
>
> What's with the leading underscore? I don't see one without it.
Good point, I was actually expecting this question and I should have
added it to the cover letter, sorry.
The reasoning was: the task's estimated utilization is defined as the
max between PELT and the "estimation". Where "estimation" is
the max between EWMA and the last ENQUEUED utilization.
Thus I was envisioning these two calls:
_task_util_est := max(EWMA, ENQUEUED)
task_util_est := max(util_avg, _task_util_est)
but since now we have clients only for the first API, I've not added
the second one. Still I would prefer to keep the "_" to make it clear
that's and util_est's internal signal, not the actual task's estimated
utilization.
Does it make sense?
Do you prefer I just remove the "_" and we will refactor it once we
should add a customer for the proper task's util_est?
> > +
> > +static inline void util_est_enqueue(struct task_struct *p)
>
> Also pass @rq from enqueue_task_fair() ? I see no point in computing
> task_rq(p) if we already have the value around.
You right, that seems to make sense.
I look into it and update if really sane.
>
> > +{
> > + struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
> > +
> > + if (!sched_feat(UTIL_EST))
> > + return;
> > +
> > + /* Update root cfs_rq's estimated utilization */
> > + cfs_rq->avg.util_est.enqueued += _task_util_est(p);
> > +}
>
>
> > +/*
> > + * Check if the specified (signed) value is within a specified margin,
> > + * based on the observation that:
> > + * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
>
> * Note: this only works when x+y < INT_MAX.
+1
>
> > + */
> > +static inline bool within_margin(long value, unsigned int margin)
>
> This mixing of long and int is dodgy, do we want to consistently use int
> here?
Right, perhaps better "unsigned int" for both params, isn't?
> > +{
> > + return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
> > +}
> > +
> > +static inline void util_est_dequeue(struct task_struct *p, int flags)
> > +{
> > + struct cfs_rq *cfs_rq = &task_rq(p)->cfs;
> > + unsigned long util_last;
> > + long last_ewma_diff;
> > + unsigned long ewma;
> > + long util_est = 0;
>
> Why long?
Right, because I've did not spot the possibility to update it when I
changed the util_est type... anyway, I'll check better, but likely
we don't need a long range.
> > +
> > + if (!sched_feat(UTIL_EST))
> > + return;
> > +
> > + /*
> > + * Update root cfs_rq's estimated utilization
> > + *
> > + * If *p is the last task then the root cfs_rq's estimated utilization
> > + * of a CPU is 0 by definition.
> > + */
> > + if (cfs_rq->nr_running) {
> > + util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
>
> Because util_est.enqueued is of type 'unsigned int'.
Indeed...
>
> > + util_est -= min_t(long, util_est, _task_util_est(p));
> > + }
> > + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, util_est);
>
> long to int truncate
right!
We have util_avg related signals which are all long based, but in the
scope of "utilization" tracking, and specifically for "util_est" signals,
int should have a sufficient range.
> > +
> > + /*
> > + * Skip update of task's estimated utilization when the task has not
> > + * yet completed an activation, e.g. being migrated.
> > + */
> > + if (!(flags & DEQUEUE_SLEEP))
> > + return;
> > +
> > + ewma = READ_ONCE(p->se.avg.util_est.ewma);
> > + util_last = task_util(p);
>
> Again, all kinds of long, while the ewma type itself is of 'unsigned
> int'.
Yes, for utilization should be enough...
>
> > +
> > + /*
> > + * Skip update of task's estimated utilization when its EWMA is
> > + * already ~1% close to its last activation value.
> > + */
> > + last_ewma_diff = util_last - ewma;
> > + if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> > + return;
> > +
> > + /*
> > + * Update Task's estimated utilization
> > + *
> > + * When *p completes an activation we can consolidate another sample
> > + * about the task size. This is done by storing the last PELT value
> > + * for this task and using this value to load another sample in the
> > + * exponential weighted moving average:
> > + *
> > + * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
> > + * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> > + * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
> > + * = w * ( last_ewma_diff ) + ewma(t-1)
> > + * = w * (last_ewma_diff + ewma(t-1) / w)
> > + *
> > + * Where 'w' is the weight of new samples, which is configured to be
> > + * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> > + */
> > + ewma = last_ewma_diff + (ewma << UTIL_EST_WEIGHT_SHIFT);
> > + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> > +
> > + WRITE_ONCE(p->se.avg.util_est.ewma, ewma);
> > + WRITE_ONCE(p->se.avg.util_est.enqueued, util_last);
>
> Two stores to that word... can we fix that nicely?
Good point, the single word comes from the goal to fit into the same
cache line of sched_avg.
I think we can fix it by having a struct util_est on stack and then it
should be possible to update the above code to do:
ue = READ_ONCE(p->se.avg.util_est)
... magic code on ue.{enqueued, ewma} ...
WRITE_ONCE(p->se.avg.util_est, ue);
That should be safe on 32bit builds too, right?
> > +}
>
> > +static inline unsigned long _task_util_est(struct task_struct *p)
> > +{
> > + return max(p->se.avg.util_est.ewma, p->se.avg.util_est.enqueued);
> > +}
>
> Aside from the underscore thing I already noted, why is this here and
> not where the fwd declaration is?
Because here is where we have already the definitions of
cpu_util{_est}() and task_util()... that's to try to keep things
together. Does it make sense?
> > +/*
> > + * UtilEstimation. Use estimated CPU utilization.
> > + */
> > +SCHED_FEAT(UTIL_EST, false)
>
> Since you couldn't measure it, do we wants it true?
I'm just a single tester so far, I would be more confident once
someone volunteer to turn this on to give a better coverage.
Moreover, a small out-of-tree patch enabling it for mobile devices is
more then acceptable for the time being ;)
Finally, we are also considering to post a follow-up to enable it via
KConfig along with a PELT half-life tunable, i.e using a 16ms instead
of the default 32ms. Do you think this is something can fly mainline?
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On Tue, Feb 06, 2018 at 06:33:15PM +0000, Patrick Bellasi wrote:
> Good point, I was actually expecting this question and I should have
> added it to the cover letter, sorry.
>
> The reasoning was: the task's estimated utilization is defined as the
> max between PELT and the "estimation". Where "estimation" is
> the max between EWMA and the last ENQUEUED utilization.
>
> Thus I was envisioning these two calls:
>
> _task_util_est := max(EWMA, ENQUEUED)
> task_util_est := max(util_avg, _task_util_est)
>
> but since now we have clients only for the first API, I've not added
> the second one. Still I would prefer to keep the "_" to make it clear
> that's and util_est's internal signal, not the actual task's estimated
> utilization.
>
> Does it make sense?
>
> Do you prefer I just remove the "_" and we will refactor it once we
> should add a customer for the proper task's util_est?
Hurm... I was thinking:
task_util_est := max(util_avg, EWMA)
But the above mixes ENQUEUED into it.. *confused*.
Also, I think I'm confused by the 'enqueued' name... it makes sense for
the cfs use-case, where we track the sum of all 'enqueued' tasks, but it
doesn't make sense for the task use-case where it tracks task_util() at
dequeue time.
> > > +/*
> > > + * Check if the specified (signed) value is within a specified margin,
> > > + * based on the observation that:
> > > + * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
> > > + */
> > > +static inline bool within_margin(long value, unsigned int margin)
> >
> > This mixing of long and int is dodgy, do we want to consistently use int
> > here?
>
> Right, perhaps better "unsigned int" for both params, isn't?
Can't, must be signed, since @value is supposed to be able to be
negative remember? ;-)
> > > + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, util_est);
> > > + WRITE_ONCE(p->se.avg.util_est.enqueued, util_last);
> >
> > Two stores to that word... can we fix that nicely?
>
> Good point, the single word comes from the goal to fit into the same
> cache line of sched_avg.
Its the above two stores I confuzzled to be the same. So n/m all that.
> > > +SCHED_FEAT(UTIL_EST, false)
> >
> > Since you couldn't measure it, do we wants it true?
>
> I'm just a single tester so far, I would be more confident once
> someone volunteer to turn this on to give a better coverage.
Lets just enable it by default, that way its far more likely someone
will complain about it :-), disabling it again is a trivial matter when
needed.
Also, your patch 2/3 have insufficient READ_ONCE()s.
On Tue, Feb 06, 2018 at 08:09:00PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 06, 2018 at 06:33:15PM +0000, Patrick Bellasi wrote:
>
> > Good point, I was actually expecting this question and I should have
> > added it to the cover letter, sorry.
> >
> > The reasoning was: the task's estimated utilization is defined as the
> > max between PELT and the "estimation". Where "estimation" is
> > the max between EWMA and the last ENQUEUED utilization.
> >
> > Thus I was envisioning these two calls:
> >
> > _task_util_est := max(EWMA, ENQUEUED)
> > task_util_est := max(util_avg, _task_util_est)
> >
> > but since now we have clients only for the first API, I've not added
> > the second one. Still I would prefer to keep the "_" to make it clear
> > that's and util_est's internal signal, not the actual task's estimated
> > utilization.
> >
> > Does it make sense?
> >
> > Do you prefer I just remove the "_" and we will refactor it once we
> > should add a customer for the proper task's util_est?
>
> Hurm... I was thinking:
>
> task_util_est := max(util_avg, EWMA)
>
> But the above mixes ENQUEUED into it.. *confused*.
So mixing in ENQUEUED seems to give it an upward BIAS if the very last
activation was 'high'. Thereby improving ramp-up.
That seems to be what we want.. might be nice to have that in a comment
;-)
I'm thinking we want a different name for max(EWMA, ENQUEUED) though,
but I really can't come up with a sensible suggestion, which I suppose,
is why you stuck an underscore on it and went on with things.
On Tue, Feb 6, 2018 at 3:41 PM, Patrick Bellasi <[email protected]> wrote:
> When schedutil looks at the CPU utilization, the current PELT value for
> that CPU is returned straight away. In certain scenarios this can have
> undesired side effects and delays on frequency selection.
>
> For example, since the task utilization is decayed at wakeup time, a
> long sleeping big task newly enqueued does not add immediately a
> significant contribution to the target CPU. This introduces some latency
> before schedutil will be able to detect the best frequency required by
> that task.
>
> Moreover, the PELT signal build-up time is a function of the current
> frequency, because of the scale invariant load tracking support. Thus,
> starting from a lower frequency, the utilization build-up time will
> increase even more and further delays the selection of the actual
> frequency which better serves the task requirements.
>
> In order to reduce this kind of latencies, we integrate the usage
> of the CPU's estimated utilization in the sugov_get_util function.
> This allows to properly consider the expected utilization of a CPU which,
> for example, has just got a big task running after a long sleep period.
> Ultimately this allows to select the best frequency to run a task
> right after its wake-up.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Reviewed-by: Dietmar Eggemann <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Paul Turner <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
Acked-by: Rafael J. Wysocki <[email protected]>
> ---
> Changes in v4:
> - rebased on today's tip/sched/core (commit 460e8c3340a2)
> - use util_est.enqueued for cfs_rq's util_est (Joel)
> - simplify cpu_util_cfs() integration (Dietmar)
>
> Changes in v3:
> - rebase on today's tip/sched/core (commit 07881166a892)
> - moved into Juri's cpu_util_cfs(), which should also
> address Rafael's suggestion to use a local variable.
>
> Changes in v2:
> - rebase on top of v4.15-rc2
> - tested that overhauled PELT code does not affect the util_est
>
> Change-Id: I62c01ed90d8ad45b06383be03d39fcf8c9041646
> ---
> kernel/sched/sched.h | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2e95505e23c6..f3c7b6a83ef4 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2127,7 +2127,12 @@ static inline unsigned long cpu_util_dl(struct rq *rq)
>
> static inline unsigned long cpu_util_cfs(struct rq *rq)
> {
> - return rq->cfs.avg.util_avg;
> + if (!sched_feat(UTIL_EST))
> + return rq->cfs.avg.util_avg;
> +
> + return max_t(unsigned long,
> + rq->cfs.avg.util_avg,
> + rq->cfs.avg.util_est.enqueued);
> }
>
> #endif
> --
> 2.15.1
>
On 06-Feb 20:09, Peter Zijlstra wrote:
> On Tue, Feb 06, 2018 at 06:33:15PM +0000, Patrick Bellasi wrote:
>
> > Good point, I was actually expecting this question and I should have
> > added it to the cover letter, sorry.
> >
> > The reasoning was: the task's estimated utilization is defined as the
> > max between PELT and the "estimation". Where "estimation" is
> > the max between EWMA and the last ENQUEUED utilization.
> >
> > Thus I was envisioning these two calls:
> >
> > _task_util_est := max(EWMA, ENQUEUED)
> > task_util_est := max(util_avg, _task_util_est)
> >
> > but since now we have clients only for the first API, I've not added
> > the second one. Still I would prefer to keep the "_" to make it clear
> > that's and util_est's internal signal, not the actual task's estimated
> > utilization.
> >
> > Does it make sense?
> >
> > Do you prefer I just remove the "_" and we will refactor it once we
> > should add a customer for the proper task's util_est?
>
> Hurm... I was thinking:
>
> task_util_est := max(util_avg, EWMA)
>
> But the above mixes ENQUEUED into it.. *confused*.
>
> Also, I think I'm confused by the 'enqueued' name... it makes sense for
> the cfs use-case, where we track the sum of all 'enqueued' tasks, but it
> doesn't make sense for the task use-case where it tracks task_util() at
> dequeue time.
Can be confusing at first, I've changed name in this version while
trying to consolidate concepts in a reasonable way.
Let see if I can cast some light...
First of all:
a) task_util_est := max(util_avg, EWMA, enqueued)
b) enqueued := util_avg@dequeue time
a) why we mix "enqueued" into?
I think we need all the three components to properly describe these
scenarios:
- a task becoming smaller:
the "EWMA" amortize the task's utilization reduction, usually
converging to the new lower utilization in ~6 activations, which
should be reasonable at least for some mobile use-cases.
- a task becoming bigger:
the "enqueued" value better represents the task utilization while
the "EWMA" catches up, but... that's true only from the second
longer running activation.
For the first bigger activation, when the task starts to run
longer then before, at a certain point the util_avg will be
bigger both "enqueued" and "EWMA", thus task_util_est() should
just track the PELT's "util_avg".
Here is a picture showing all these behaviors using a "ramp" task
generated with rt-app:
https://photos.app.goo.gl/kRYgvpS6PTgvu2AM2
b) why enqueued for tasks?
Since v3 it was:
task : util_avg { util_est { last, ewma } }
cpu : cfs_rq { util_est_enqueued }
and Joel noticed that, since now util_est{} is part of util_avg{},
which is included also by cfs_rq{}, then we can use the same
util_avg{} for cfs_rq{} too.
Problem was that using cfs_rq::util_est::last was not meaningful to
me, hence I've try to find a concept which was working well for both
cpus and tasks.
Enqueued IMO seems to work for both, provided you read the task's
enqueued util_est as:
"the util_est of a task which is currently enqueued in its cfs_rq"
which (only) happens to be the util_avg@dequeue time.
IMO, also for tasks, that's not worst than util_est.last...
Maybe it's just matter to add a bit of documentation in the util_est
struct definition?
Does that makes a bit more sense now?
> > > > +/*
> > > > + * Check if the specified (signed) value is within a specified margin,
> > > > + * based on the observation that:
> > > > + * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
> > > > + */
> > > > +static inline bool within_margin(long value, unsigned int margin)
> > >
> > > This mixing of long and int is dodgy, do we want to consistently use int
> > > here?
> >
> > Right, perhaps better "unsigned int" for both params, isn't?
>
> Can't, must be signed, since @value is supposed to be able to be
> negative remember? ;-)
Right... :)
> > > > + WRITE_ONCE(cfs_rq->avg.util_est.enqueued, util_est);
>
> > > > + WRITE_ONCE(p->se.avg.util_est.enqueued, util_last);
> > >
> > > Two stores to that word... can we fix that nicely?
> >
> > Good point, the single word comes from the goal to fit into the same
> > cache line of sched_avg.
>
> Its the above two stores I confuzzled to be the same. So n/m all that.
Ok, fine... however, isn't a single WRITE_ONCE on "struct util_est"
still better? Did not checked at generated code...
> > > > +SCHED_FEAT(UTIL_EST, false)
> > >
> > > Since you couldn't measure it, do we wants it true?
> >
> > I'm just a single tester so far, I would be more confident once
> > someone volunteer to turn this on to give a better coverage.
>
> Lets just enable it by default, that way its far more likely someone
> will complain about it :-), disabling it again is a trivial matter when
> needed.
Ok, at the end it's your call... but you right, this could help on
testing it better.
Moreover, on low-utilization (desktop like) workloads we do see
measurable benefits for interactive tasks.
> Also, your patch 2/3 have insufficient READ_ONCE()s.
Damn, I'll look at it...
Cheers Patrick
--
#include <best/regards.h>
Patrick Bellasi
On 06-Feb 20:15, Peter Zijlstra wrote:
> On Tue, Feb 06, 2018 at 08:09:00PM +0100, Peter Zijlstra wrote:
> > On Tue, Feb 06, 2018 at 06:33:15PM +0000, Patrick Bellasi wrote:
> >
> > > Good point, I was actually expecting this question and I should have
> > > added it to the cover letter, sorry.
> > >
> > > The reasoning was: the task's estimated utilization is defined as the
> > > max between PELT and the "estimation". Where "estimation" is
> > > the max between EWMA and the last ENQUEUED utilization.
> > >
> > > Thus I was envisioning these two calls:
> > >
> > > _task_util_est := max(EWMA, ENQUEUED)
> > > task_util_est := max(util_avg, _task_util_est)
> > >
> > > but since now we have clients only for the first API, I've not added
> > > the second one. Still I would prefer to keep the "_" to make it clear
> > > that's and util_est's internal signal, not the actual task's estimated
> > > utilization.
> > >
> > > Does it make sense?
> > >
> > > Do you prefer I just remove the "_" and we will refactor it once we
> > > should add a customer for the proper task's util_est?
> >
> > Hurm... I was thinking:
> >
> > task_util_est := max(util_avg, EWMA)
> >
> > But the above mixes ENQUEUED into it.. *confused*.
>
> So mixing in ENQUEUED seems to give it an upward BIAS if the very last
> activation was 'high'. Thereby improving ramp-up.
>
> That seems to be what we want.. might be nice to have that in a comment
> ;-)
Ok, I should have read this one before... to avoid you a longer
(boring) response to your previous email :/
> I'm thinking we want a different name for max(EWMA, ENQUEUED) though,
> but I really can't come up with a sensible suggestion, which I suppose,
> is why you stuck an underscore on it and went on with things.
The only sensible way I come up with is to consider that max as a
util_est "internals"... while the actual task_util_est() (without "_")
should consider util_avg as well, for PELT tracking when a task
is becoming bigger the it's previous activations.
Is that not reasonable enough?
Potentially I can add also the task_util_est() version... but right
now we do not have clients... still gcc should not be upset.
--
#include <best/regards.h>
Patrick Bellasi