2018-03-09 09:55:20

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 0/4] Utilization estimation (util_est) for FAIR tasks

Hi, here is an update of [1], based on today's tip/sched/core [2], which mainly
adds some code cleanups suggested by Peter as well as fixes compilation for
!CONFIG_SMP systems.

Most notably:
a) The util_est's update flag has been renamed into UTIL_AVG_UNCHANGED, which
seems to better match its usages.
b) The cpu_util_est() function has been removed to reduce cluttering by folding
its code directly into cpu_util(). This last function is thus now always
returning the estimated utilization of a CPU, unless this sched feature is
disabled.
c) Not necessary READ_ONCE() have been removed from rq-lock protected code
paths. For util_est variable, that we read/modify/write only from rq-lock
protected, code we keep just the WRITE_ONCE() barriers, which are still
required for synchronization with lockless readers.
The READ_ONCE() have been instead maintained in all the getter functions,
like for example task_util() and cpu_util(), which can potentially be used
by lockless code. e.g. schedutil or load-balancer.

Results on both x86_64 and ARM (Android) targets, which have been collected and
reported in previous postings [1,3], show negligible overheads, especially
compared to the corresponding power/performance benefits on mobile platforms,
where this feature helps to reduce the performance gap between PELT and another
other out-of-tree load tracking solution.

Best,
Patrick


.:: Changelog
=============

Changes in v6:
- remove READ_ONCE from rq-lock protected code paths
- folds cpu_util_est code into cpu_util and update its documentation
- change util_est's update flag name to better match its usage
- slightly clean up cpu_util_wake code
- add other small code cleanups as suggested by Peter
- fix compilation for !CONFIG_SMP systems
- fix documentation to match sphinx syntax
- update changelogs to better match code concepts

Changes in v5:
- rebased on today's tip/sched/core (commit 083c6eeab2cc, based on v4.16-rc2)
- update util_est only on util_avg updates
- add documentation for "struct util_est"
- always use int instead of long whenever possible (Peter)
- pass cfs_rq to util_est_{en,de}queue (Peter)
- pass task_sleep to util_est_dequeue
- use singe WRITE_ONCE at dequeue time
- add some missing {READ,WRITE}_ONCE
- add task_util_est() for code consistency

Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- renamed util_est's "last" into "enqueued"
- using util_est's "enqueued" for both se and cfs_rqs (Joel)
- update margin check to use more ASM friendly code (Peter)
- optimize EWMA updates (Peter)
- ensure cpu_util_wake() is cpu_capacity_orig()'s clamped (Pavan)
- simplify cpu_util_cfs() integration (Dietmar)

Changes in v3:
- rebased on today's tip/sched/core (commit 07881166a892)
- moved util_est into sched_avg (Peter)
- use {READ,WRITE}_ONCE() for EWMA updates (Peter)
- using unsigned int to fit all sched_avg into a single 64B cache line
- schedutil integration using Juri's cpu_util_cfs()
- first patch dropped since it's already queued in tip/sched/core

Changes in v2:
- rebased on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est


.:: References
==============
[1] https://lkml.org/lkml/2018/2/22/639
[email protected]
[2] git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git
sched/core (commit 083c6eeab2cc)
[3] https://lkml.org/lkml/2018/1/23/645
[email protected]

Patrick Bellasi (4):
sched/fair: add util_est on top of PELT
sched/fair: use util_est in LB and WU paths
sched/cpufreq_schedutil: use util_est for OPP selection
sched/fair: update util_est only on util_avg updates

include/linux/sched.h | 29 ++++++
kernel/sched/debug.c | 4 +
kernel/sched/fair.c | 237 ++++++++++++++++++++++++++++++++++++++++++++----
kernel/sched/features.h | 5 +
kernel/sched/sched.h | 9 +-
5 files changed, 264 insertions(+), 20 deletions(-)

--
2.15.1



2018-03-09 09:54:14

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 3/4] sched/cpufreq_schedutil: use util_est for OPP selection

When schedutil looks at the CPU utilization, the current PELT value for
that CPU is returned straight away. In certain scenarios this can have
undesired side effects and delays on frequency selection.

For example, since the task utilization is decayed at wakeup time, a
long sleeping big task newly enqueued does not add immediately a
significant contribution to the target CPU. This introduces some latency
before schedutil will be able to detect the best frequency required by
that task.

Moreover, the PELT signal build-up time is a function of the current
frequency, because of the scale invariant load tracking support. Thus,
starting from a lower frequency, the utilization build-up time will
increase even more and further delays the selection of the actual
frequency which better serves the task requirements.

In order to reduce this kind of latencies, we integrate the usage
of the CPU's estimated utilization in the sugov_get_util function.
This allows to properly consider the expected utilization of a CPU which,
for example, has just got a big task running after a long sleep period.
Ultimately this allows to select the best frequency to run a task
right after its wake-up.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v6:
- small code cleanup suggested by Peter
- add acked-by Viresh tag

Changes in v5:
- add missing READ_ONCE() barrieres
- add acked-by Rafael tag

Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- use util_est.enqueued for cfs_rq's util_est (Joel)
- simplify cpu_util_cfs() integration (Dietmar)

Changes in v3:
- rebase on today's tip/sched/core (commit 07881166a892)
- moved into Juri's cpu_util_cfs(), which should also
address Rafael's suggestion to use a local variable.

Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
kernel/sched/sched.h | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dc6c8b5a24ad..5c459479593b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2122,7 +2122,14 @@ static inline unsigned long cpu_util_dl(struct rq *rq)

static inline unsigned long cpu_util_cfs(struct rq *rq)
{
- return rq->cfs.avg.util_avg;
+ unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);
+
+ if (sched_feat(UTIL_EST)) {
+ util = max_t(unsigned long, util,
+ READ_ONCE(rq->cfs.avg.util_est.enqueued));
+ }
+
+ return util;
}

#endif
--
2.15.1


2018-03-09 09:54:19

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 4/4] sched/fair: update util_est only on util_avg updates

The estimated utilization of a task is currently updated every time the
task is dequeued. However, to keep overheads under control, PELT signals
are effectively updated at maximum once every 1ms.

Thus, for really short running tasks, it can happen that their util_avg
value has not been updates since their last enqueue. If such tasks are
also frequently running tasks (e.g. the kind of workload generated by
hackbench) it can also happen that their util_avg is updated only every
few activations.

This means that updating util_est at every dequeue potentially introduces
not necessary overheads and it's also conceptually wrong if the util_avg
signal has never been updated during a task activation.

Let's introduce a throttling mechanism on task's util_est updates
to sync them with util_avg updates. To make the solution memory
efficient, both in terms of space and load/store operations, we encode a
synchronization flag into the LSB of util_est.enqueued.
This makes util_est an even values only metric, which is still
considered good enough for its purpose.
The synchronization bit is (re)set by __update_load_avg_se() once the
PELT signal of a task has been updated during its last activation.

Such a throttling mechanism allows to keep under control util_est
overheads in the wakeup hot path, thus making it a suitable mechanism
which can be enabled also on high-intensity workload systems.
Thus, this now switches on by default the estimation utilization
scheduler feature.

Suggested-by: Chris Redpath <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]

---
Changes in v6:
- remove READ_ONCE from rq-lock protected code paths
- change flag name to better match its meaning
- fix compilation for !CONFIG_SMP systems
- add missing CCs in the changelog

Changes in v5:
- set SCHED_FEAT(UTIL_EST, true) as default (Peter)
---
kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++----
kernel/sched/features.h | 2 +-
2 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5cf4aa39a6ca..a52868b07850 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3257,6 +3257,32 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
sa->util_avg = sa->util_sum / divider;
}

+/*
+ * When a task is dequeued, its estimated utilization should not be update if
+ * its util_avg has not been updated at least once.
+ * This flag is used to synchronize util_avg updates with util_est updates.
+ * We map this information into the LSB bit of the utilization saved at
+ * dequeue time (i.e. util_est.dequeued).
+ */
+#define UTIL_AVG_UNCHANGED 0x1
+
+static inline void cfs_se_util_change(struct sched_avg *avg)
+{
+ unsigned int enqueued;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Avoid store if the flag has been already set */
+ enqueued = avg->util_est.enqueued;
+ if (!(enqueued & UTIL_AVG_UNCHANGED))
+ return;
+
+ /* Reset flag to report util_avg has been updated */
+ enqueued &= ~UTIL_AVG_UNCHANGED;
+ WRITE_ONCE(avg->util_est.enqueued, enqueued);
+}
+
/*
* sched_entity:
*
@@ -3308,6 +3334,7 @@ __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entit
cfs_rq->curr == se)) {

___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+ cfs_se_util_change(&se->avg);
return 1;
}

@@ -3908,7 +3935,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,

/* Update root cfs_rq's estimated utilization */
enqueued = cfs_rq->avg.util_est.enqueued;
- enqueued += _task_util_est(p);
+ enqueued += (_task_util_est(p) | UTIL_AVG_UNCHANGED);
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
}

@@ -3943,7 +3970,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
if (cfs_rq->nr_running) {
ue.enqueued = cfs_rq->avg.util_est.enqueued;
ue.enqueued -= min_t(unsigned int, ue.enqueued,
- _task_util_est(p));
+ (_task_util_est(p) | UTIL_AVG_UNCHANGED));
}
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);

@@ -3954,12 +3981,19 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
if (!task_sleep)
return;

+ /*
+ * If the PELT values haven't changed since enqueue time,
+ * skip the util_est update.
+ */
+ ue = p->se.avg.util_est;
+ if (ue.enqueued & UTIL_AVG_UNCHANGED)
+ return;
+
/*
* Skip update of task's estimated utilization when its EWMA is
* already ~1% close to its last activation value.
*/
- ue = p->se.avg.util_est;
- ue.enqueued = task_util(p);
+ ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
last_ewma_diff = ue.enqueued - ue.ewma;
if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
return;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index c459a4b61544..85ae8488039c 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -89,4 +89,4 @@ SCHED_FEAT(WA_BIAS, true)
/*
* UtilEstimation. Use estimated CPU utilization.
*/
-SCHED_FEAT(UTIL_EST, false)
+SCHED_FEAT(UTIL_EST, true)
--
2.15.1


2018-03-09 09:54:53

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 2/4] sched/fair: use util_est in LB and WU paths

When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while it is still considered relatively empty.

In order to reduce this kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in the wakeup path, via cpu_util_wake(), as well as in the load-balance
path, via cpu_util() which is used by update_sg_lb_stats().

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization (at
previous dequeue time) of all the tasks currently RUNNABLE on that CPU.
This allows to properly represent the spare capacity of a CPU which, for
example, has just got a big task running since a long sleep period.

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v6:
- folded cpu_util_est code into cpu_util
- updated cpu_util documentation
- slightly cleaned up cpu_util_wake code
- update changelog to better match code concepts

Changes in v5:
- always use int instead of long whenever possible (Peter)
- add missing READ_ONCE barriers (Peter)

Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- ensure cpu_util_wake() is cpu_capacity_orig()'s clamped (Pavan)

Changes in v3:
- rebased on today's tip/sched/core (commit 07881166a892)

Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
kernel/sched/fair.c | 84 ++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 70 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c54560829a52..5cf4aa39a6ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6431,11 +6431,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
return target;
}

-/*
- * cpu_util returns the amount of capacity of a CPU that is used by CFS
- * tasks. The unit of the return value must be the one of capacity so we can
- * compare the utilization with the capacity of the CPU that is available for
- * CFS task (ie cpu_capacity).
+/**
+ * Amount of capacity of a CPU that is (estimated to be) used by CFS tasks
+ * @cpu: the CPU to get the utilization of
+ *
+ * The unit of the return value must be the one of capacity so we can compare
+ * the utilization with the capacity of the CPU that is available for CFS task
+ * (ie cpu_capacity).
*
* cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
* recent utilization of currently non-runnable tasks on a CPU. It represents
@@ -6446,6 +6448,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* current capacity (capacity_curr <= capacity_orig) of the CPU because it is
* the running time on this CPU scaled by capacity_curr.
*
+ * The estimated utilization of a CPU is defined to be the maximum between its
+ * cfs_rq.avg.util_avg and the sum of the estimated utilization of the tasks
+ * currently RUNNABLE on that CPU.
+ * This allows to properly represent the expected utilization of a CPU which
+ * has just got a big task running since a long sleep period. At the same time
+ * however it preserves the benefits of the "blocked utilization" in
+ * describing the potential for other tasks waking up on the same CPU.
+ *
* Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
* higher than capacity_orig because of unfortunate rounding in
* cfs.avg.util_avg or just after migrating tasks and new task wakeups until
@@ -6456,13 +6466,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* available capacity. We allow utilization to overshoot capacity_curr (but not
* capacity_orig) as it useful for predicting the capacity required after task
* migrations (scheduler-driven DVFS).
+ *
+ * Return: the (estimated) utilization for the specified CPU
*/
-static unsigned long cpu_util(int cpu)
+static inline unsigned long cpu_util(int cpu)
{
- unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
- unsigned long capacity = capacity_orig_of(cpu);
+ struct cfs_rq *cfs_rq;
+ unsigned int util;
+
+ cfs_rq = &cpu_rq(cpu)->cfs;
+ util = READ_ONCE(cfs_rq->avg.util_avg);
+
+ if (sched_feat(UTIL_EST))
+ util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));

- return (util >= capacity) ? capacity : util;
+ return min_t(unsigned long, util, capacity_orig_of(cpu));
}

/*
@@ -6471,16 +6489,54 @@ static unsigned long cpu_util(int cpu)
*/
static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
{
- unsigned long util, capacity;
+ struct cfs_rq *cfs_rq;
+ unsigned int util;

/* Task has no contribution or is new */
- if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
+ if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
return cpu_util(cpu);

- capacity = capacity_orig_of(cpu);
- util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+ cfs_rq = &cpu_rq(cpu)->cfs;
+ util = READ_ONCE(cfs_rq->avg.util_avg);
+
+ /* Discount task's blocked util from CPU's util */
+ util -= min_t(unsigned int, util, task_util(p));

- return (util >= capacity) ? capacity : util;
+ /*
+ * Covered cases:
+ *
+ * a) if *p is the only task sleeping on this CPU, then:
+ * cpu_util (== task_util) > util_est (== 0)
+ * and thus we return:
+ * cpu_util_wake = (cpu_util - task_util) = 0
+ *
+ * b) if other tasks are SLEEPING on this CPU, which is now exiting
+ * IDLE, then:
+ * cpu_util >= task_util
+ * cpu_util > util_est (== 0)
+ * and thus we discount *p's blocked utilization to return:
+ * cpu_util_wake = (cpu_util - task_util) >= 0
+ *
+ * c) if other tasks are RUNNABLE on that CPU and
+ * util_est > cpu_util
+ * then we use util_est since it returns a more restrictive
+ * estimation of the spare capacity on that CPU, by just
+ * considering the expected utilization of tasks already
+ * runnable on that CPU.
+ *
+ * Cases a) and b) are covered by the above code, while case c) is
+ * covered by the following code when estimated utilization is
+ * enabled.
+ */
+ if (sched_feat(UTIL_EST))
+ util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
+
+ /*
+ * Utilization (estimated) can exceed the CPU capacity, thus let's
+ * clamp to the maximum CPU capacity to ensure consistency with
+ * the cpu_util call.
+ */
+ return min_t(unsigned long, util, capacity_orig_of(cpu));
}

/*
--
2.15.1


2018-03-09 09:55:17

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH v6 1/4] sched/fair: add util_est on top of PELT

The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.

For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))

This allows to remember how big a task has been reported by PELT in its
previous activations via f(task::util_avg@dequeue), which is the new
_task_util_est(struct task_struct*) function added by this patch.

If a task should change its behavior and it runs longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).

The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization
due to FAIR tasks.
For this reason, the estimated utilization of a root cfs_rq is simply
defined as:

util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)

where:

cfs_rq::util_est::enqueued = sum(_task_util_est(task))
for each RUNNABLE task on that root cfs_rq

It's worth to note that the estimated utilization is tracked only for
objects of interests, specifically:
- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection

Signed-off-by: Patrick Bellasi <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: [email protected]
Cc: [email protected]

---
Changes in v6:
- remove READ_ONCE from rq-lock protected code paths
- fix compilation for !CONFIG_SMP systems
- fix documentation to match sphinx syntax
- update changelog to better match code concepts

Changes in v5:
- add documentation for "struct util_est"
- always use int instead of long whenever possible (Peter)
- pass cfs_rq to util_est_{en,de}queue (Peter)
- pass task_sleep to util_est_dequeue
- use singe WRITE_ONCE at dequeue time
- add some other missing {READ,WRITE}_ONCE
- add task_util_est() for code consistency

Changes in v4:
- rebased on today's tip/sched/core (commit 460e8c3340a2)
- renamed util_est's "last" into "enqueued"
- using util_est's "enqueued" for both se and cfs_rqs (Joel)
- update margin check to use more ASM friendly code (Peter)
- optimize EWMA updates (Peter)

Changes in v3:
- rebased on today's tip/sched/core (commit 07881166a892)
- moved util_est into sched_avg (Peter)
- use {READ,WRITE}_ONCE() for EWMA updates (Peter)
- using unsigned int to fit all sched_avg into a single 64B cache line

Changes in v2:
- rebase on top of v4.15-rc2
- tested that overhauled PELT code does not affect the util_est
---
include/linux/sched.h | 29 ++++++++++++
kernel/sched/debug.c | 4 ++
kernel/sched/fair.c | 121 +++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 5 ++
4 files changed, 153 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b161ef8a902e..82c42e0b5dd0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -275,6 +275,34 @@ struct load_weight {
u32 inv_weight;
};

+/**
+ * struct util_est - Estimation utilization of FAIR tasks
+ * @enqueued: instantaneous estimated utilization of a task/cpu
+ * @ewma: the Exponential Weighted Moving Average (EWMA)
+ * utilization of a task
+ *
+ * Support data structure to track an Exponential Weighted Moving Average
+ * (EWMA) of a FAIR task's utilization. New samples are added to the moving
+ * average each time a task completes an activation. Sample's weight is chosen
+ * so that the EWMA will be relatively insensitive to transient changes to the
+ * task's workload.
+ *
+ * The enqueued attribute has a slightly different meaning for tasks and cpus:
+ * - task: the task's util_avg at last task dequeue time
+ * - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
+ * Thus, the util_est.enqueued of a task represents the contribution on the
+ * estimated utilization of the CPU where that task is currently enqueued.
+ *
+ * Only for tasks we track a moving average of the past instantaneous
+ * estimated utilization. This allows to absorb sporadic drops in utilization
+ * of an otherwise almost periodic task.
+ */
+struct util_est {
+ unsigned int enqueued;
+ unsigned int ewma;
+#define UTIL_EST_WEIGHT_SHIFT 2
+};
+
/*
* The load_avg/util_avg accumulates an infinite geometric series
* (see __update_load_avg() in kernel/sched/fair.c).
@@ -336,6 +364,7 @@ struct sched_avg {
unsigned long load_avg;
unsigned long runnable_load_avg;
unsigned long util_avg;
+ struct util_est util_est;
};

struct sched_statistics {
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1ca0130ed4f9..d4eb5532ea6b 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -567,6 +567,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.runnable_load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
+ SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued",
+ cfs_rq->avg.util_est.enqueued);
SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg",
cfs_rq->removed.load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
@@ -1018,6 +1020,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(se.avg.runnable_load_avg);
P(se.avg.util_avg);
P(se.avg.last_update_time);
+ P(se.avg.util_est.ewma);
+ P(se.avg.util_est.enqueued);
#endif
P(policy);
P(prio);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e1febd252a84..c54560829a52 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3881,6 +3881,112 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)

static int idle_balance(struct rq *this_rq, struct rq_flags *rf);

+static inline unsigned long task_util(struct task_struct *p)
+{
+ return READ_ONCE(p->se.avg.util_avg);
+}
+
+static inline unsigned long _task_util_est(struct task_struct *p)
+{
+ struct util_est ue = READ_ONCE(p->se.avg.util_est);
+
+ return max(ue.ewma, ue.enqueued);
+}
+
+static inline unsigned long task_util_est(struct task_struct *p)
+{
+ return max(READ_ONCE(p->se.avg.util_avg), _task_util_est(p));
+}
+
+static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
+ struct task_struct *p)
+{
+ unsigned int enqueued;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Update root cfs_rq's estimated utilization */
+ enqueued = cfs_rq->avg.util_est.enqueued;
+ enqueued += _task_util_est(p);
+ WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
+}
+
+/*
+ * Check if a (signed) value is within a specified (unsigned) margin,
+ * based on the observation that:
+ * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
+ *
+ * NOTE: this only works when value + maring < INT_MAX.
+ */
+static inline bool within_margin(int value, int margin)
+{
+ return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
+}
+
+static void
+util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
+{
+ long last_ewma_diff;
+ struct util_est ue;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /*
+ * Update root cfs_rq's estimated utilization
+ *
+ * If *p is the last task then the root cfs_rq's estimated utilization
+ * of a CPU is 0 by definition.
+ */
+ ue.enqueued = 0;
+ if (cfs_rq->nr_running) {
+ ue.enqueued = cfs_rq->avg.util_est.enqueued;
+ ue.enqueued -= min_t(unsigned int, ue.enqueued,
+ _task_util_est(p));
+ }
+ WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
+
+ /*
+ * Skip update of task's estimated utilization when the task has not
+ * yet completed an activation, e.g. being migrated.
+ */
+ if (!task_sleep)
+ return;
+
+ /*
+ * Skip update of task's estimated utilization when its EWMA is
+ * already ~1% close to its last activation value.
+ */
+ ue = p->se.avg.util_est;
+ ue.enqueued = task_util(p);
+ last_ewma_diff = ue.enqueued - ue.ewma;
+ if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
+ return;
+
+ /*
+ * Update Task's estimated utilization
+ *
+ * When *p completes an activation we can consolidate another sample
+ * of the task size. This is done by storing the current PELT value
+ * as ue.enqueued and by using this value to update the Exponential
+ * Weighted Moving Average (EWMA):
+ *
+ * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
+ * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
+ * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
+ * = w * ( last_ewma_diff ) + ewma(t-1)
+ * = w * (last_ewma_diff + ewma(t-1) / w)
+ *
+ * Where 'w' is the weight of new samples, which is configured to be
+ * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
+ */
+ ue.ewma <<= UTIL_EST_WEIGHT_SHIFT;
+ ue.ewma += last_ewma_diff;
+ ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;
+ WRITE_ONCE(p->se.avg.util_est, ue);
+}
+
#else /* CONFIG_SMP */

static inline int
@@ -3910,6 +4016,13 @@ static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
return 0;
}

+static inline void
+util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
+
+static inline void
+util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
+ bool task_sleep) {}
+
#endif /* CONFIG_SMP */

static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -5257,6 +5370,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
add_nr_running(rq, 1);

+ util_est_enqueue(&rq->cfs, p);
hrtick_update(rq);
}

@@ -5316,6 +5430,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
sub_nr_running(rq, 1);

+ util_est_dequeue(&rq->cfs, p, task_sleep);
hrtick_update(rq);
}

@@ -5834,7 +5949,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return target;
}

-static inline unsigned long task_util(struct task_struct *p);
static unsigned long cpu_util_wake(int cpu, struct task_struct *p);

static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -6351,11 +6465,6 @@ static unsigned long cpu_util(int cpu)
return (util >= capacity) ? capacity : util;
}

-static inline unsigned long task_util(struct task_struct *p)
-{
- return p->se.avg.util_avg;
-}
-
/*
* cpu_util_wake: Compute cpu utilization with any contributions from
* the waking task p removed.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9552fd5854bf..c459a4b61544 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
SCHED_FEAT(WA_IDLE, true)
SCHED_FEAT(WA_WEIGHT, true)
SCHED_FEAT(WA_BIAS, true)
+
+/*
+ * UtilEstimation. Use estimated CPU utilization.
+ */
+SCHED_FEAT(UTIL_EST, false)
--
2.15.1


2018-03-16 13:27:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v6 0/4] Utilization estimation (util_est) for FAIR tasks

On Fri, Mar 09, 2018 at 09:52:41AM +0000, Patrick Bellasi wrote:
> Hi, here is an update of [1], based on today's tip/sched/core [2], which mainly
> adds some code cleanups suggested by Peter as well as fixes compilation for
> !CONFIG_SMP systems.
>
> Most notably:
> a) The util_est's update flag has been renamed into UTIL_AVG_UNCHANGED, which
> seems to better match its usages.
> b) The cpu_util_est() function has been removed to reduce cluttering by folding
> its code directly into cpu_util(). This last function is thus now always
> returning the estimated utilization of a CPU, unless this sched feature is
> disabled.
> c) Not necessary READ_ONCE() have been removed from rq-lock protected code
> paths. For util_est variable, that we read/modify/write only from rq-lock
> protected, code we keep just the WRITE_ONCE() barriers, which are still
> required for synchronization with lockless readers.
> The READ_ONCE() have been instead maintained in all the getter functions,
> like for example task_util() and cpu_util(), which can potentially be used
> by lockless code. e.g. schedutil or load-balancer.
>
> Results on both x86_64 and ARM (Android) targets, which have been collected and
> reported in previous postings [1,3], show negligible overheads, especially
> compared to the corresponding power/performance benefits on mobile platforms,
> where this feature helps to reduce the performance gap between PELT and another
> other out-of-tree load tracking solution.

Thanks Patrick!

Subject: [tip:sched/core] sched/fair: Add util_est on top of PELT

Commit-ID: 7f65ea42eb00bc902f1c37a71e984e4f4064cfa9
Gitweb: https://git.kernel.org/tip/7f65ea42eb00bc902f1c37a71e984e4f4064cfa9
Author: Patrick Bellasi <[email protected]>
AuthorDate: Fri, 9 Mar 2018 09:52:42 +0000
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 20 Mar 2018 08:11:06 +0100

sched/fair: Add util_est on top of PELT

The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.

For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))

This allows to remember how big a task has been reported by PELT in its
previous activations via f(task::util_avg@dequeue), which is the new
_task_util_est(struct task_struct*) function added by this patch.

If a task should change its behavior and it runs longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).

The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization
due to FAIR tasks.

For this reason, the estimated utilization of a root cfs_rq is simply
defined as:

util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)

where:

cfs_rq::util_est::enqueued = sum(_task_util_est(task))
for each RUNNABLE task on that root cfs_rq

It's worth noting that the estimated utilization is tracked only for
objects of interests, specifically:

- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 29 ++++++++++++
kernel/sched/debug.c | 4 ++
kernel/sched/fair.c | 122 +++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 5 ++
4 files changed, 154 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 21b1168da951..f228c6033832 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -274,6 +274,34 @@ struct load_weight {
u32 inv_weight;
};

+/**
+ * struct util_est - Estimation utilization of FAIR tasks
+ * @enqueued: instantaneous estimated utilization of a task/cpu
+ * @ewma: the Exponential Weighted Moving Average (EWMA)
+ * utilization of a task
+ *
+ * Support data structure to track an Exponential Weighted Moving Average
+ * (EWMA) of a FAIR task's utilization. New samples are added to the moving
+ * average each time a task completes an activation. Sample's weight is chosen
+ * so that the EWMA will be relatively insensitive to transient changes to the
+ * task's workload.
+ *
+ * The enqueued attribute has a slightly different meaning for tasks and cpus:
+ * - task: the task's util_avg at last task dequeue time
+ * - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
+ * Thus, the util_est.enqueued of a task represents the contribution on the
+ * estimated utilization of the CPU where that task is currently enqueued.
+ *
+ * Only for tasks we track a moving average of the past instantaneous
+ * estimated utilization. This allows to absorb sporadic drops in utilization
+ * of an otherwise almost periodic task.
+ */
+struct util_est {
+ unsigned int enqueued;
+ unsigned int ewma;
+#define UTIL_EST_WEIGHT_SHIFT 2
+};
+
/*
* The load_avg/util_avg accumulates an infinite geometric series
* (see __update_load_avg() in kernel/sched/fair.c).
@@ -335,6 +363,7 @@ struct sched_avg {
unsigned long load_avg;
unsigned long runnable_load_avg;
unsigned long util_avg;
+ struct util_est util_est;
};

struct sched_statistics {
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 644d9a464380..332303be4beb 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -541,6 +541,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.runnable_load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
+ SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued",
+ cfs_rq->avg.util_est.enqueued);
SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg",
cfs_rq->removed.load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
@@ -989,6 +991,8 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(se.avg.runnable_load_avg);
P(se.avg.util_avg);
P(se.avg.last_update_time);
+ P(se.avg.util_est.ewma);
+ P(se.avg.util_est.enqueued);
#endif
P(policy);
P(prio);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3582117e1580..22b59a7facd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3873,6 +3873,113 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)

static int idle_balance(struct rq *this_rq, struct rq_flags *rf);

+static inline unsigned long task_util(struct task_struct *p)
+{
+ return READ_ONCE(p->se.avg.util_avg);
+}
+
+static inline unsigned long _task_util_est(struct task_struct *p)
+{
+ struct util_est ue = READ_ONCE(p->se.avg.util_est);
+
+ return max(ue.ewma, ue.enqueued);
+}
+
+static inline unsigned long task_util_est(struct task_struct *p)
+{
+ return max(task_util(p), _task_util_est(p));
+}
+
+static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
+ struct task_struct *p)
+{
+ unsigned int enqueued;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Update root cfs_rq's estimated utilization */
+ enqueued = cfs_rq->avg.util_est.enqueued;
+ enqueued += _task_util_est(p);
+ WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
+}
+
+/*
+ * Check if a (signed) value is within a specified (unsigned) margin,
+ * based on the observation that:
+ *
+ * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
+ *
+ * NOTE: this only works when value + maring < INT_MAX.
+ */
+static inline bool within_margin(int value, int margin)
+{
+ return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
+}
+
+static void
+util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
+{
+ long last_ewma_diff;
+ struct util_est ue;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /*
+ * Update root cfs_rq's estimated utilization
+ *
+ * If *p is the last task then the root cfs_rq's estimated utilization
+ * of a CPU is 0 by definition.
+ */
+ ue.enqueued = 0;
+ if (cfs_rq->nr_running) {
+ ue.enqueued = cfs_rq->avg.util_est.enqueued;
+ ue.enqueued -= min_t(unsigned int, ue.enqueued,
+ _task_util_est(p));
+ }
+ WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);
+
+ /*
+ * Skip update of task's estimated utilization when the task has not
+ * yet completed an activation, e.g. being migrated.
+ */
+ if (!task_sleep)
+ return;
+
+ /*
+ * Skip update of task's estimated utilization when its EWMA is
+ * already ~1% close to its last activation value.
+ */
+ ue = p->se.avg.util_est;
+ ue.enqueued = task_util(p);
+ last_ewma_diff = ue.enqueued - ue.ewma;
+ if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
+ return;
+
+ /*
+ * Update Task's estimated utilization
+ *
+ * When *p completes an activation we can consolidate another sample
+ * of the task size. This is done by storing the current PELT value
+ * as ue.enqueued and by using this value to update the Exponential
+ * Weighted Moving Average (EWMA):
+ *
+ * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
+ * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
+ * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
+ * = w * ( last_ewma_diff ) + ewma(t-1)
+ * = w * (last_ewma_diff + ewma(t-1) / w)
+ *
+ * Where 'w' is the weight of new samples, which is configured to be
+ * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
+ */
+ ue.ewma <<= UTIL_EST_WEIGHT_SHIFT;
+ ue.ewma += last_ewma_diff;
+ ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;
+ WRITE_ONCE(p->se.avg.util_est, ue);
+}
+
#else /* CONFIG_SMP */

static inline int
@@ -3902,6 +4009,13 @@ static inline int idle_balance(struct rq *rq, struct rq_flags *rf)
return 0;
}

+static inline void
+util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
+
+static inline void
+util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
+ bool task_sleep) {}
+
#endif /* CONFIG_SMP */

static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
@@ -5249,6 +5363,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
add_nr_running(rq, 1);

+ util_est_enqueue(&rq->cfs, p);
hrtick_update(rq);
}

@@ -5308,6 +5423,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
sub_nr_running(rq, 1);

+ util_est_dequeue(&rq->cfs, p, task_sleep);
hrtick_update(rq);
}

@@ -5835,7 +5951,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
return target;
}

-static inline unsigned long task_util(struct task_struct *p);
static unsigned long cpu_util_wake(int cpu, struct task_struct *p);

static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -6351,11 +6466,6 @@ static unsigned long cpu_util(int cpu)
return (util >= capacity) ? capacity : util;
}

-static inline unsigned long task_util(struct task_struct *p)
-{
- return p->se.avg.util_avg;
-}
-
/*
* cpu_util_wake: Compute CPU utilization with any contributions from
* the waking task p removed.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9552fd5854bf..c459a4b61544 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -85,3 +85,8 @@ SCHED_FEAT(ATTACH_AGE_LOAD, true)
SCHED_FEAT(WA_IDLE, true)
SCHED_FEAT(WA_WEIGHT, true)
SCHED_FEAT(WA_BIAS, true)
+
+/*
+ * UtilEstimation. Use estimated CPU utilization.
+ */
+SCHED_FEAT(UTIL_EST, false)

Subject: [tip:sched/core] sched/fair: Use util_est in LB and WU paths

Commit-ID: f9be3e5961c5554879a491961187472e923f5ee0
Gitweb: https://git.kernel.org/tip/f9be3e5961c5554879a491961187472e923f5ee0
Author: Patrick Bellasi <[email protected]>
AuthorDate: Fri, 9 Mar 2018 09:52:43 +0000
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 20 Mar 2018 08:11:07 +0100

sched/fair: Use util_est in LB and WU paths

When the scheduler looks at the CPU utilization, the current PELT value
for a CPU is returned straight away. In certain scenarios this can have
undesired side effects on task placement.

For example, since the task utilization is decayed at wakeup time, when
a long sleeping big task is enqueued it does not add immediately a
significant contribution to the target CPU.
As a result we generate a race condition where other tasks can be placed
on the same CPU while it is still considered relatively empty.

In order to reduce this kind of race conditions, this patch introduces the
required support to integrate the usage of the CPU's estimated utilization
in the wakeup path, via cpu_util_wake(), as well as in the load-balance
path, via cpu_util() which is used by update_sg_lb_stats().

The estimated utilization of a CPU is defined to be the maximum between
its PELT's utilization and the sum of the estimated utilization (at
previous dequeue time) of all the tasks currently RUNNABLE on that CPU.
This allows to properly represent the spare capacity of a CPU which, for
example, has just got a big task running since a long sleep period.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 84 ++++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 70 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22b59a7facd2..570b8d056282 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6432,11 +6432,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
return target;
}

-/*
- * cpu_util returns the amount of capacity of a CPU that is used by CFS
- * tasks. The unit of the return value must be the one of capacity so we can
- * compare the utilization with the capacity of the CPU that is available for
- * CFS task (ie cpu_capacity).
+/**
+ * Amount of capacity of a CPU that is (estimated to be) used by CFS tasks
+ * @cpu: the CPU to get the utilization of
+ *
+ * The unit of the return value must be the one of capacity so we can compare
+ * the utilization with the capacity of the CPU that is available for CFS task
+ * (ie cpu_capacity).
*
* cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
* recent utilization of currently non-runnable tasks on a CPU. It represents
@@ -6447,6 +6449,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* current capacity (capacity_curr <= capacity_orig) of the CPU because it is
* the running time on this CPU scaled by capacity_curr.
*
+ * The estimated utilization of a CPU is defined to be the maximum between its
+ * cfs_rq.avg.util_avg and the sum of the estimated utilization of the tasks
+ * currently RUNNABLE on that CPU.
+ * This allows to properly represent the expected utilization of a CPU which
+ * has just got a big task running since a long sleep period. At the same time
+ * however it preserves the benefits of the "blocked utilization" in
+ * describing the potential for other tasks waking up on the same CPU.
+ *
* Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
* higher than capacity_orig because of unfortunate rounding in
* cfs.avg.util_avg or just after migrating tasks and new task wakeups until
@@ -6457,13 +6467,21 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* available capacity. We allow utilization to overshoot capacity_curr (but not
* capacity_orig) as it useful for predicting the capacity required after task
* migrations (scheduler-driven DVFS).
+ *
+ * Return: the (estimated) utilization for the specified CPU
*/
-static unsigned long cpu_util(int cpu)
+static inline unsigned long cpu_util(int cpu)
{
- unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
- unsigned long capacity = capacity_orig_of(cpu);
+ struct cfs_rq *cfs_rq;
+ unsigned int util;
+
+ cfs_rq = &cpu_rq(cpu)->cfs;
+ util = READ_ONCE(cfs_rq->avg.util_avg);
+
+ if (sched_feat(UTIL_EST))
+ util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));

- return (util >= capacity) ? capacity : util;
+ return min_t(unsigned long, util, capacity_orig_of(cpu));
}

/*
@@ -6472,16 +6490,54 @@ static unsigned long cpu_util(int cpu)
*/
static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
{
- unsigned long util, capacity;
+ struct cfs_rq *cfs_rq;
+ unsigned int util;

/* Task has no contribution or is new */
- if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
+ if (cpu != task_cpu(p) || !READ_ONCE(p->se.avg.last_update_time))
return cpu_util(cpu);

- capacity = capacity_orig_of(cpu);
- util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+ cfs_rq = &cpu_rq(cpu)->cfs;
+ util = READ_ONCE(cfs_rq->avg.util_avg);
+
+ /* Discount task's blocked util from CPU's util */
+ util -= min_t(unsigned int, util, task_util(p));

- return (util >= capacity) ? capacity : util;
+ /*
+ * Covered cases:
+ *
+ * a) if *p is the only task sleeping on this CPU, then:
+ * cpu_util (== task_util) > util_est (== 0)
+ * and thus we return:
+ * cpu_util_wake = (cpu_util - task_util) = 0
+ *
+ * b) if other tasks are SLEEPING on this CPU, which is now exiting
+ * IDLE, then:
+ * cpu_util >= task_util
+ * cpu_util > util_est (== 0)
+ * and thus we discount *p's blocked utilization to return:
+ * cpu_util_wake = (cpu_util - task_util) >= 0
+ *
+ * c) if other tasks are RUNNABLE on that CPU and
+ * util_est > cpu_util
+ * then we use util_est since it returns a more restrictive
+ * estimation of the spare capacity on that CPU, by just
+ * considering the expected utilization of tasks already
+ * runnable on that CPU.
+ *
+ * Cases a) and b) are covered by the above code, while case c) is
+ * covered by the following code when estimated utilization is
+ * enabled.
+ */
+ if (sched_feat(UTIL_EST))
+ util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
+
+ /*
+ * Utilization (estimated) can exceed the CPU capacity, thus let's
+ * clamp to the maximum CPU capacity to ensure consistency with
+ * the cpu_util call.
+ */
+ return min_t(unsigned long, util, capacity_orig_of(cpu));
}

/*

Subject: [tip:sched/core] sched/cpufreq/schedutil: Use util_est for OPP selection

Commit-ID: a07630b8b2c16f82fd5b71d890079f4dd7599c1d
Gitweb: https://git.kernel.org/tip/a07630b8b2c16f82fd5b71d890079f4dd7599c1d
Author: Patrick Bellasi <[email protected]>
AuthorDate: Fri, 9 Mar 2018 09:52:44 +0000
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 20 Mar 2018 08:11:08 +0100

sched/cpufreq/schedutil: Use util_est for OPP selection

When schedutil looks at the CPU utilization, the current PELT value for
that CPU is returned straight away. In certain scenarios this can have
undesired side effects and delays on frequency selection.

For example, since the task utilization is decayed at wakeup time, a
long sleeping big task newly enqueued does not add immediately a
significant contribution to the target CPU. This introduces some latency
before schedutil will be able to detect the best frequency required by
that task.

Moreover, the PELT signal build-up time is a function of the current
frequency, because of the scale invariant load tracking support. Thus,
starting from a lower frequency, the utilization build-up time will
increase even more and further delays the selection of the actual
frequency which better serves the task requirements.

In order to reduce these kind of latencies, we integrate the usage
of the CPU's estimated utilization in the sugov_get_util function.

This allows to properly consider the expected utilization of a CPU which,
for example, has just got a big task running after a long sleep period.
Ultimately this allows to select the best frequency to run a task
right after its wake-up.

Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Acked-by: Viresh Kumar <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/sched.h | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22909ffc04fb..c3deaee7a7a2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2163,6 +2163,13 @@ static inline unsigned long cpu_util_dl(struct rq *rq)

static inline unsigned long cpu_util_cfs(struct rq *rq)
{
- return rq->cfs.avg.util_avg;
+ unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);
+
+ if (sched_feat(UTIL_EST)) {
+ util = max_t(unsigned long, util,
+ READ_ONCE(rq->cfs.avg.util_est.enqueued));
+ }
+
+ return util;
}
#endif

Subject: [tip:sched/core] sched/fair: Update util_est only on util_avg updates

Commit-ID: d519329f72a6f36bc4f2b85452640cfe583b4f81
Gitweb: https://git.kernel.org/tip/d519329f72a6f36bc4f2b85452640cfe583b4f81
Author: Patrick Bellasi <[email protected]>
AuthorDate: Fri, 9 Mar 2018 09:52:45 +0000
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 20 Mar 2018 08:11:09 +0100

sched/fair: Update util_est only on util_avg updates

The estimated utilization of a task is currently updated every time the
task is dequeued. However, to keep overheads under control, PELT signals
are effectively updated at maximum once every 1ms.

Thus, for really short running tasks, it can happen that their util_avg
value has not been updates since their last enqueue. If such tasks are
also frequently running tasks (e.g. the kind of workload generated by
hackbench) it can also happen that their util_avg is updated only every
few activations.

This means that updating util_est at every dequeue potentially introduces
not necessary overheads and it's also conceptually wrong if the util_avg
signal has never been updated during a task activation.

Let's introduce a throttling mechanism on task's util_est updates
to sync them with util_avg updates. To make the solution memory
efficient, both in terms of space and load/store operations, we encode a
synchronization flag into the LSB of util_est.enqueued.
This makes util_est an even values only metric, which is still
considered good enough for its purpose.
The synchronization bit is (re)set by __update_load_avg_se() once the
PELT signal of a task has been updated during its last activation.

Such a throttling mechanism allows to keep under control util_est
overheads in the wakeup hot path, thus making it a suitable mechanism
which can be enabled also on high-intensity workload systems.
Thus, this now switches on by default the estimation utilization
scheduler feature.

Suggested-by: Chris Redpath <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J . Wysocki <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 42 ++++++++++++++++++++++++++++++++++++++----
kernel/sched/features.h | 2 +-
2 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 570b8d056282..0951d1c58d2f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3242,6 +3242,32 @@ ___update_load_avg(struct sched_avg *sa, unsigned long load, unsigned long runna
sa->util_avg = sa->util_sum / divider;
}

+/*
+ * When a task is dequeued, its estimated utilization should not be update if
+ * its util_avg has not been updated at least once.
+ * This flag is used to synchronize util_avg updates with util_est updates.
+ * We map this information into the LSB bit of the utilization saved at
+ * dequeue time (i.e. util_est.dequeued).
+ */
+#define UTIL_AVG_UNCHANGED 0x1
+
+static inline void cfs_se_util_change(struct sched_avg *avg)
+{
+ unsigned int enqueued;
+
+ if (!sched_feat(UTIL_EST))
+ return;
+
+ /* Avoid store if the flag has been already set */
+ enqueued = avg->util_est.enqueued;
+ if (!(enqueued & UTIL_AVG_UNCHANGED))
+ return;
+
+ /* Reset flag to report util_avg has been updated */
+ enqueued &= ~UTIL_AVG_UNCHANGED;
+ WRITE_ONCE(avg->util_est.enqueued, enqueued);
+}
+
/*
* sched_entity:
*
@@ -3293,6 +3319,7 @@ __update_load_avg_se(u64 now, int cpu, struct cfs_rq *cfs_rq, struct sched_entit
cfs_rq->curr == se)) {

___update_load_avg(&se->avg, se_weight(se), se_runnable(se));
+ cfs_se_util_change(&se->avg);
return 1;
}

@@ -3900,7 +3927,7 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,

/* Update root cfs_rq's estimated utilization */
enqueued = cfs_rq->avg.util_est.enqueued;
- enqueued += _task_util_est(p);
+ enqueued += (_task_util_est(p) | UTIL_AVG_UNCHANGED);
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
}

@@ -3936,7 +3963,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
if (cfs_rq->nr_running) {
ue.enqueued = cfs_rq->avg.util_est.enqueued;
ue.enqueued -= min_t(unsigned int, ue.enqueued,
- _task_util_est(p));
+ (_task_util_est(p) | UTIL_AVG_UNCHANGED));
}
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, ue.enqueued);

@@ -3947,12 +3974,19 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
if (!task_sleep)
return;

+ /*
+ * If the PELT values haven't changed since enqueue time,
+ * skip the util_est update.
+ */
+ ue = p->se.avg.util_est;
+ if (ue.enqueued & UTIL_AVG_UNCHANGED)
+ return;
+
/*
* Skip update of task's estimated utilization when its EWMA is
* already ~1% close to its last activation value.
*/
- ue = p->se.avg.util_est;
- ue.enqueued = task_util(p);
+ ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
last_ewma_diff = ue.enqueued - ue.ewma;
if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
return;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index c459a4b61544..85ae8488039c 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -89,4 +89,4 @@ SCHED_FEAT(WA_BIAS, true)
/*
* UtilEstimation. Use estimated CPU utilization.
*/
-SCHED_FEAT(UTIL_EST, false)
+SCHED_FEAT(UTIL_EST, true)