2018-06-04 16:06:54

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH 0/2] Improve estimated utilization of preempted FAIR tasks

Here is a small series to improve the estimated utilization of fair
tasks which
are preempted by either another FAIR task of by tasks of an higher
priority
scheduling class (i.e. RT and DL).

It can certainly be improved, but it has been already functionally
tested and given the discussion going on on IRC and in this other
thread:

https://lkml.org/lkml/2018/5/25/410
[email protected]

I wanted to anticipate a possible simple alternative/additional approach
to improve CPU utilization tracking for CFS tasks. Thus, every comment
and suggestion is more then welcome!

The main reasons and the overall idea is explained by the second patch
of this
series. While the first patch is just a first step in the direction of
having a
more memory and run-time efficient implementation.

The ultimate benefit of this series is to make the CFS class more robust
in
terms of its integration with the schedutil governor. Indeed, by better
estimating the CPU bandwidth requirements for CFS tasks we can assert
more
accurately their bandwidth demands and be less sensitive to side effect,
on
PELT, generated by higher priority classes preempting CFS tasks, like
the
CFS util_avg decay.

Cheers Patrick

Patrick Bellasi (2):
sched/fair: pelt: use u32 for util_avg
sched/fair: util_est: add running_sum tracking

include/linux/sched.h | 3 ++-
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 33 ++++++++++++++++++++++++---------
3 files changed, 27 insertions(+), 11 deletions(-)

--
2.15.1



2018-06-04 16:07:10

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

The estimated utilization of a task is affected by the task being
preempted, either by another FAIR task of by a task of an higher
priority class (i.e. RT or DL). Indeed, when a preemption happens, the
PELT utilization of the preempted task is going to be decayed a bit.
That's actually correct for utilization, which goal is to measure the
actual CPU bandwidth consumed by a task.

However, the above behavior does not allow to know exactly what is the
utilization a task "would have used" if it was running without
being preempted. Thus, this reduces the effectiveness of util_est for a
task because it does not always allow to predict how much CPU a task is
likely to require.

Let's improve the estimated utilization by adding a new "sort-of" PELT
signal, explicitly only for SE which has the following behavior:
a) at each enqueue time of a task, its value is the (already decayed)
util_avg of the task being enqueued
b) it's updated at each update_load_avg
c) it can just increase, whenever the task is actually RUNNING on a
CPU, while it's kept stable while the task is RUNNANBLE but not
actively consuming CPU bandwidth

Such a defined signal is exactly equivalent to the util_avg for a task
running alone on a CPU while, in case the task is preempted, it allows
to know at dequeue time how much would have been the task utilization if
it was running alone on that CPU.

This new signal is named "running_avg", since it tracks the actual
RUNNING time of a task by ignoring any form of preemption.

From an implementation standpoint, since the sched_avg should fit into a
single cache line, we save space by tracking only a new runnable sum:
p->se.avg.running_sum
while the conversion into a running_avg is done on demand whenever we
need it, which is at task dequeue time when a new util_est sample has to
be collected.

The conversion from "running_sum" to "running_avg" is done by performing
a single division by LOAD_AVG_MAX, which introduces a small error since
in the division we do not consider the (sa->period_contrib - 1024)
compensation factor used in ___update_load_avg(). However:
a) this error is expected to be limited (~2-3%)
b) it can be safely ignored since the estimated utilization is the only
consumer which is already subject to small estimation errors

The additional corresponding benefit is that, at run-time, we pay the
cost for a additional sum and multiply, while the more expensive
division is required only at dequeue time.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 16 ++++++++++++++--
2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9d8732dab264..2bd5f1c68da9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -399,6 +399,7 @@ struct sched_avg {
u64 load_sum;
u64 runnable_load_sum;
u32 util_sum;
+ u32 running_sum;
u32 period_contrib;
unsigned long load_avg;
unsigned long runnable_load_avg;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f74441be3f44..5d54d6a4c31f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
sa->runnable_load_sum =
decay_load(sa->runnable_load_sum, periods);
sa->util_sum = decay_load((u64)(sa->util_sum), periods);
+ if (running)
+ sa->running_sum = decay_load(sa->running_sum, periods);

/*
* Step 2
@@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
sa->load_sum += load * contrib;
if (runnable)
sa->runnable_load_sum += runnable * contrib;
- if (running)
+ if (running) {
sa->util_sum += contrib * scale_cpu;
+ sa->running_sum += contrib * scale_cpu;
+ }

return periods;
}
@@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
}

+static inline void util_est_enqueue_running(struct task_struct *p)
+{
+ /* Initilize the (non-preempted) utilization */
+ p->se.avg.running_sum = p->se.avg.util_sum;
+}
+
/*
* Check if a (signed) value is within a specified (unsigned) margin,
* based on the observation that:
@@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
* Skip update of task's estimated utilization when its EWMA is
* already ~1% close to its last activation value.
*/
- ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
+ ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
last_ewma_diff = ue.enqueued - ue.ewma;
if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
return;
@@ -5437,6 +5447,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
add_nr_running(rq, 1);

+ util_est_enqueue_running(p);
+
hrtick_update(rq);
}

--
2.15.1


2018-06-04 16:08:12

by Patrick Bellasi

[permalink] [raw]
Subject: [PATCH 1/2] sched/fair: pelt: use u32 for util_avg

The util_avg signal is used to track the utilization (i.e. RUNNING time)
of SEs and RQs. Its values are computed according to the PELT algorithm
and thus, for SE, they are bounded to an (internal) representation which
uses 20bits. For RQ instead they are technically un-bounded, since when
tasks are migrated across RQs we sum their utilization to the
destination RQ.

We currently use an unsigned long to track util_avg which maps into a
64bits storage on 64bits systems. However, 32bits should be good enough
for all practical usages. Indeed, even for RQs, the remaining 12bits
allows to track up to 4K 100% tasks concurrently RUNNABLE on a single
CPU.

Since the sched_avg data structure already completely fits a 64B cache
line, let's get back 4B by using u32 to track util_avg. The recovered
space could be conveniently used to fit other load tracking related
metrics into the same cache line.

Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: Todd Kjos <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Steve Muckle <[email protected]>
Cc: Dietmar Eggemann <[email protected]>
Cc: Morten Rasmussen <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
include/linux/sched.h | 2 +-
kernel/sched/debug.c | 2 +-
kernel/sched/fair.c | 17 ++++++++++-------
3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 28ff3ca9f752..9d8732dab264 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -402,7 +402,7 @@ struct sched_avg {
u32 period_contrib;
unsigned long load_avg;
unsigned long runnable_load_avg;
- unsigned long util_avg;
+ u32 util_avg;
struct util_est util_est;
} ____cacheline_aligned;

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 15b10e210a6b..a985789eeb9c 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -541,7 +541,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.load_avg);
SEQ_printf(m, " .%-30s: %lu\n", "runnable_load_avg",
cfs_rq->avg.runnable_load_avg);
- SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
+ SEQ_printf(m, " .%-30s: %u\n", "util_avg",
cfs_rq->avg.util_avg);
SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued",
cfs_rq->avg.util_est.enqueued);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05aab7f..f74441be3f44 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -750,19 +750,22 @@ static void attach_entity_cfs_rq(struct sched_entity *se);
void post_init_entity_util_avg(struct sched_entity *se)
{
struct cfs_rq *cfs_rq = cfs_rq_of(se);
- struct sched_avg *sa = &se->avg;
long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;

if (cap > 0) {
- if (cfs_rq->avg.util_avg != 0) {
- sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
- sa->util_avg /= (cfs_rq->avg.load_avg + 1);
+ struct sched_avg *sa = &se->avg;
+ u64 util_avg = READ_ONCE(sa->util_avg);

- if (sa->util_avg > cap)
- sa->util_avg = cap;
+ if (cfs_rq->avg.util_avg != 0) {
+ util_avg = cfs_rq->avg.util_avg * se->load.weight;
+ util_avg /= (cfs_rq->avg.load_avg + 1);
+ if (util_avg > cap)
+ util_avg = cap;
} else {
- sa->util_avg = cap;
+ util_avg = cap;
}
+
+ WRITE_ONCE(sa->util_avg, util_avg);
}

if (entity_is_task(se)) {
--
2.15.1


2018-06-04 17:47:05

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

Hi Patrick,

On Mon, Jun 04, 2018 at 05:06:00PM +0100, Patrick Bellasi wrote:
> The estimated utilization of a task is affected by the task being
> preempted, either by another FAIR task of by a task of an higher
> priority class (i.e. RT or DL). Indeed, when a preemption happens, the
> PELT utilization of the preempted task is going to be decayed a bit.
> That's actually correct for utilization, which goal is to measure the
> actual CPU bandwidth consumed by a task.
>
> However, the above behavior does not allow to know exactly what is the
> utilization a task "would have used" if it was running without
> being preempted. Thus, this reduces the effectiveness of util_est for a
> task because it does not always allow to predict how much CPU a task is
> likely to require.
>
> Let's improve the estimated utilization by adding a new "sort-of" PELT
> signal, explicitly only for SE which has the following behavior:
> a) at each enqueue time of a task, its value is the (already decayed)
> util_avg of the task being enqueued
> b) it's updated at each update_load_avg
> c) it can just increase, whenever the task is actually RUNNING on a
> CPU, while it's kept stable while the task is RUNNANBLE but not
> actively consuming CPU bandwidth
>
> Such a defined signal is exactly equivalent to the util_avg for a task
> running alone on a CPU while, in case the task is preempted, it allows
> to know at dequeue time how much would have been the task utilization if
> it was running alone on that CPU.
>
> This new signal is named "running_avg", since it tracks the actual
> RUNNING time of a task by ignoring any form of preemption.
>
> From an implementation standpoint, since the sched_avg should fit into a
> single cache line, we save space by tracking only a new runnable sum:
> p->se.avg.running_sum
> while the conversion into a running_avg is done on demand whenever we
> need it, which is at task dequeue time when a new util_est sample has to
> be collected.
>
> The conversion from "running_sum" to "running_avg" is done by performing
> a single division by LOAD_AVG_MAX, which introduces a small error since
> in the division we do not consider the (sa->period_contrib - 1024)
> compensation factor used in ___update_load_avg(). However:
> a) this error is expected to be limited (~2-3%)
> b) it can be safely ignored since the estimated utilization is the only
> consumer which is already subject to small estimation errors
>
> The additional corresponding benefit is that, at run-time, we pay the
> cost for a additional sum and multiply, while the more expensive
> division is required only at dequeue time.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Steve Muckle <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> include/linux/sched.h | 1 +
> kernel/sched/fair.c | 16 ++++++++++++++--
> 2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9d8732dab264..2bd5f1c68da9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -399,6 +399,7 @@ struct sched_avg {
> u64 load_sum;
> u64 runnable_load_sum;
> u32 util_sum;
> + u32 running_sum;
> u32 period_contrib;
> unsigned long load_avg;
> unsigned long runnable_load_avg;

Should update the documentation comments above the struct too?

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f74441be3f44..5d54d6a4c31f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> sa->runnable_load_sum =
> decay_load(sa->runnable_load_sum, periods);
> sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> + if (running)
> + sa->running_sum = decay_load(sa->running_sum, periods);
>
> /*
> * Step 2
> @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> sa->load_sum += load * contrib;
> if (runnable)
> sa->runnable_load_sum += runnable * contrib;
> - if (running)
> + if (running) {
> sa->util_sum += contrib * scale_cpu;
> + sa->running_sum += contrib * scale_cpu;
> + }
>
> return periods;
> }
> @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> }

PELT changes look nice and makes sense :)

> +static inline void util_est_enqueue_running(struct task_struct *p)
> +{
> + /* Initilize the (non-preempted) utilization */
> + p->se.avg.running_sum = p->se.avg.util_sum;
> +}
> +
> /*
> * Check if a (signed) value is within a specified (unsigned) margin,
> * based on the observation that:
> @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> * Skip update of task's estimated utilization when its EWMA is
> * already ~1% close to its last activation value.
> */
> - ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> + ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;

I guess we are doing extra division here which adds some cost. Does
performance look Ok with the change?

thanks,

- Joel


2018-06-05 01:30:57

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

Hi Patrick,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on next-20180604]
[cannot apply to v4.17]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Patrick-Bellasi/sched-fair-pelt-use-u32-for-util_avg/20180605-082640
config: i386-randconfig-a1-06041847 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All errors (new ones prefixed by >>):

kernel/sched/fair.c: In function 'enqueue_task_fair':
>> kernel/sched/fair.c:5450:2: error: implicit declaration of function 'util_est_enqueue_running' [-Werror=implicit-function-declaration]
util_est_enqueue_running(p);
^
cc1: some warnings being treated as errors

vim +/util_est_enqueue_running +5450 kernel/sched/fair.c

5389
5390 /*
5391 * The enqueue_task method is called before nr_running is
5392 * increased. Here we update the fair scheduling stats and
5393 * then put the task into the rbtree:
5394 */
5395 static void
5396 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
5397 {
5398 struct cfs_rq *cfs_rq;
5399 struct sched_entity *se = &p->se;
5400
5401 /*
5402 * The code below (indirectly) updates schedutil which looks at
5403 * the cfs_rq utilization to select a frequency.
5404 * Let's add the task's estimated utilization to the cfs_rq's
5405 * estimated utilization, before we update schedutil.
5406 */
5407 util_est_enqueue(&rq->cfs, p);
5408
5409 /*
5410 * If in_iowait is set, the code below may not trigger any cpufreq
5411 * utilization updates, so do it here explicitly with the IOWAIT flag
5412 * passed.
5413 */
5414 if (p->in_iowait)
5415 cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
5416
5417 for_each_sched_entity(se) {
5418 if (se->on_rq)
5419 break;
5420 cfs_rq = cfs_rq_of(se);
5421 enqueue_entity(cfs_rq, se, flags);
5422
5423 /*
5424 * end evaluation on encountering a throttled cfs_rq
5425 *
5426 * note: in the case of encountering a throttled cfs_rq we will
5427 * post the final h_nr_running increment below.
5428 */
5429 if (cfs_rq_throttled(cfs_rq))
5430 break;
5431 cfs_rq->h_nr_running++;
5432
5433 flags = ENQUEUE_WAKEUP;
5434 }
5435
5436 for_each_sched_entity(se) {
5437 cfs_rq = cfs_rq_of(se);
5438 cfs_rq->h_nr_running++;
5439
5440 if (cfs_rq_throttled(cfs_rq))
5441 break;
5442
5443 update_load_avg(cfs_rq, se, UPDATE_TG);
5444 update_cfs_group(se);
5445 }
5446
5447 if (!se)
5448 add_nr_running(rq, 1);
5449
> 5450 util_est_enqueue_running(p);
5451
5452 hrtick_update(rq);
5453 }
5454

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (3.10 kB)
.config.gz (25.46 kB)
Download all attachments

2018-06-05 01:32:40

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: pelt: use u32 for util_avg

Hi Patrick,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.17 next-20180604]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Patrick-Bellasi/sched-fair-pelt-use-u32-for-util_avg/20180605-082640
config: i386-defconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All errors (new ones prefixed by >>):

kernel/sched/fair.o: In function `post_init_entity_util_avg':
>> fair.c:(.text+0xa057): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (898.00 B)
.config.gz (25.66 kB)
Download all attachments

2018-06-05 01:35:38

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/fair: pelt: use u32 for util_avg

Hi Patrick,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on v4.17 next-20180604]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Patrick-Bellasi/sched-fair-pelt-use-u32-for-util_avg/20180605-082640
config: i386-randconfig-s0-201822 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All errors (new ones prefixed by >>):

kernel/sched/fair.o: In function `post_init_entity_util_avg':
>> kernel/sched/fair.c:761: undefined reference to `__udivdi3'

vim +761 kernel/sched/fair.c

724
725 /*
726 * With new tasks being created, their initial util_avgs are extrapolated
727 * based on the cfs_rq's current util_avg:
728 *
729 * util_avg = cfs_rq->util_avg / (cfs_rq->load_avg + 1) * se.load.weight
730 *
731 * However, in many cases, the above util_avg does not give a desired
732 * value. Moreover, the sum of the util_avgs may be divergent, such
733 * as when the series is a harmonic series.
734 *
735 * To solve this problem, we also cap the util_avg of successive tasks to
736 * only 1/2 of the left utilization budget:
737 *
738 * util_avg_cap = (1024 - cfs_rq->avg.util_avg) / 2^n
739 *
740 * where n denotes the nth task.
741 *
742 * For example, a simplest series from the beginning would be like:
743 *
744 * task util_avg: 512, 256, 128, 64, 32, 16, 8, ...
745 * cfs_rq util_avg: 512, 768, 896, 960, 992, 1008, 1016, ...
746 *
747 * Finally, that extrapolated util_avg is clamped to the cap (util_avg_cap)
748 * if util_avg > util_avg_cap.
749 */
750 void post_init_entity_util_avg(struct sched_entity *se)
751 {
752 struct cfs_rq *cfs_rq = cfs_rq_of(se);
753 long cap = (long)(SCHED_CAPACITY_SCALE - cfs_rq->avg.util_avg) / 2;
754
755 if (cap > 0) {
756 struct sched_avg *sa = &se->avg;
757 u64 util_avg = READ_ONCE(sa->util_avg);
758
759 if (cfs_rq->avg.util_avg != 0) {
760 util_avg = cfs_rq->avg.util_avg * se->load.weight;
> 761 util_avg /= (cfs_rq->avg.load_avg + 1);
762 if (util_avg > cap)
763 util_avg = cap;
764 } else {
765 util_avg = cap;
766 }
767
768 WRITE_ONCE(sa->util_avg, util_avg);
769 }
770
771 if (entity_is_task(se)) {
772 struct task_struct *p = task_of(se);
773 if (p->sched_class != &fair_sched_class) {
774 /*
775 * For !fair tasks do:
776 *
777 update_cfs_rq_load_avg(now, cfs_rq);
778 attach_entity_load_avg(cfs_rq, se, 0);
779 switched_from_fair(rq, p);
780 *
781 * such that the next switched_to_fair() has the
782 * expected state.
783 */
784 se->avg.last_update_time = cfs_rq_clock_task(cfs_rq);
785 return;
786 }
787 }
788
789 attach_entity_cfs_rq(se);
790 }
791

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (3.30 kB)
.config.gz (27.38 kB)
Download all attachments

2018-06-05 02:27:47

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

Hi Patrick,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on tip/sched/core]
[also build test ERROR on next-20180604]
[cannot apply to v4.17]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Patrick-Bellasi/sched-fair-pelt-use-u32-for-util_avg/20180605-082640
config: x86_64-randconfig-x011-201822 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All errors (new ones prefixed by >>):

kernel/sched/fair.c: In function 'enqueue_task_fair':
>> kernel/sched/fair.c:5450:2: error: implicit declaration of function 'util_est_enqueue_running'; did you mean 'util_est_enqueue'? [-Werror=implicit-function-declaration]
util_est_enqueue_running(p);
^~~~~~~~~~~~~~~~~~~~~~~~
util_est_enqueue
Cyclomatic Complexity 5 include/linux/compiler.h:__read_once_size
Cyclomatic Complexity 1 include/linux/kasan-checks.h:kasan_check_write
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:constant_test_bit
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:variable_test_bit
Cyclomatic Complexity 1 arch/x86/include/asm/bitops.h:fls
Cyclomatic Complexity 1 include/linux/log2.h:__ilog2_u32
Cyclomatic Complexity 1 arch/x86/include/asm/current.h:get_current
Cyclomatic Complexity 1 arch/x86/include/asm/atomic64_64.h:arch_atomic64_add
Cyclomatic Complexity 1 include/asm-generic/atomic-instrumented.h:atomic64_add
Cyclomatic Complexity 2 arch/x86/include/asm/jump_label.h:arch_static_branch
Cyclomatic Complexity 1 include/linux/jump_label.h:static_key_false
Cyclomatic Complexity 1 include/linux/math64.h:mul_u64_u32_shr
Cyclomatic Complexity 2 include/linux/thread_info.h:test_ti_thread_flag
Cyclomatic Complexity 5 arch/x86/include/asm/preempt.h:__preempt_count_add
Cyclomatic Complexity 1 arch/x86/include/asm/preempt.h:__preempt_count_dec_and_test
Cyclomatic Complexity 1 include/linux/lockdep.h:lock_is_held
Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_lock_acquire
Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_lock_release
Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_read_lock
Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_read_unlock
Cyclomatic Complexity 1 include/linux/rcupdate.h:rcu_read_lock_sched_notrace
Cyclomatic Complexity 2 include/linux/rcupdate.h:rcu_read_unlock_sched_notrace
Cyclomatic Complexity 1 include/linux/rbtree.h:rb_link_node
Cyclomatic Complexity 1 include/linux/sched.h:task_thread_info
Cyclomatic Complexity 1 include/linux/sched.h:test_tsk_thread_flag
Cyclomatic Complexity 1 include/linux/sched.h:test_tsk_need_resched
Cyclomatic Complexity 1 include/linux/sched.h:task_cpu
Cyclomatic Complexity 1 include/linux/sched/cputime.h:get_running_cputimer
Cyclomatic Complexity 2 include/linux/sched/cputime.h:account_group_exec_runtime
Cyclomatic Complexity 1 include/linux/cgroup.h:task_css_set
Cyclomatic Complexity 1 include/linux/cgroup.h:task_dfl_cgroup
Cyclomatic Complexity 3 include/linux/cgroup.h:cgroup_parent
Cyclomatic Complexity 1 include/linux/cgroup.h:cpuacct_charge
Cyclomatic Complexity 2 include/linux/cgroup.h:cgroup_account_cputime
Cyclomatic Complexity 1 kernel/sched/sched.h:cpu_of
Cyclomatic Complexity 1 kernel/sched/sched.h:assert_clock_updated
Cyclomatic Complexity 4 kernel/sched/sched.h:rq_clock
Cyclomatic Complexity 4 kernel/sched/sched.h:rq_clock_task
Cyclomatic Complexity 4 kernel/sched/sched.h:rq_clock_skip_update
Cyclomatic Complexity 1 kernel/sched/sched.h:rq_pin_lock
Cyclomatic Complexity 1 kernel/sched/sched.h:rq_unpin_lock
Cyclomatic Complexity 1 kernel/sched/sched.h:task_on_rq_queued
Cyclomatic Complexity 1 kernel/sched/sched.h:put_prev_task
Cyclomatic Complexity 1 kernel/sched/sched.h:sched_update_tick_dependency
Cyclomatic Complexity 2 kernel/sched/sched.h:add_nr_running
Cyclomatic Complexity 1 kernel/sched/sched.h:sub_nr_running
Cyclomatic Complexity 1 kernel/sched/sched.h:hrtick_enabled
Cyclomatic Complexity 1 kernel/sched/sched.h:rq_lock
Cyclomatic Complexity 1 kernel/sched/sched.h:rq_unlock
Cyclomatic Complexity 2 kernel/sched/sched.h:cpufreq_update_util
Cyclomatic Complexity 4 include/trace/events/sched.h:trace_sched_stat_runtime
Cyclomatic Complexity 1 kernel/sched/fair.c:update_load_add
Cyclomatic Complexity 1 kernel/sched/fair.c:update_load_sub
Cyclomatic Complexity 1 kernel/sched/fair.c:update_load_set
Cyclomatic Complexity 2 kernel/sched/fair.c:task_of
Cyclomatic Complexity 2 kernel/sched/fair.c:rq_of
Cyclomatic Complexity 1 kernel/sched/fair.c:task_cfs_rq
Cyclomatic Complexity 1 kernel/sched/fair.c:cfs_rq_of
Cyclomatic Complexity 1 kernel/sched/fair.c:group_cfs_rq
Cyclomatic Complexity 1 kernel/sched/fair.c:list_add_leaf_cfs_rq
Cyclomatic Complexity 1 kernel/sched/fair.c:parent_entity
Cyclomatic Complexity 1 kernel/sched/fair.c:find_matching_se
Cyclomatic Complexity 2 kernel/sched/fair.c:max_vruntime
Cyclomatic Complexity 2 kernel/sched/fair.c:min_vruntime
Cyclomatic Complexity 1 kernel/sched/fair.c:entity_before
Cyclomatic Complexity 2 kernel/sched/fair.c:calc_delta_fair
Cyclomatic Complexity 1 kernel/sched/fair.c:update_tg_load_avg
Cyclomatic Complexity 1 kernel/sched/fair.c:update_stats_wait_start
Cyclomatic Complexity 1 kernel/sched/fair.c:update_stats_wait_end
Cyclomatic Complexity 1 kernel/sched/fair.c:update_stats_enqueue
Cyclomatic Complexity 1 kernel/sched/fair.c:update_stats_dequeue
Cyclomatic Complexity 1 kernel/sched/fair.c:update_stats_curr_start
Cyclomatic Complexity 1 kernel/sched/fair.c:task_tick_numa
Cyclomatic Complexity 2 kernel/sched/fair.c:account_entity_enqueue
Cyclomatic Complexity 2 kernel/sched/fair.c:account_entity_dequeue
Cyclomatic Complexity 1 kernel/sched/fair.c:enqueue_runnable_load_avg
Cyclomatic Complexity 1 kernel/sched/fair.c:dequeue_runnable_load_avg
Cyclomatic Complexity 1 kernel/sched/fair.c:enqueue_load_avg
Cyclomatic Complexity 1 kernel/sched/fair.c:dequeue_load_avg
Cyclomatic Complexity 1 kernel/sched/fair.c:update_cfs_group
Cyclomatic Complexity 3 kernel/sched/fair.c:cfs_rq_util_change
Cyclomatic Complexity 1 kernel/sched/fair.c:update_load_avg
Cyclomatic Complexity 1 kernel/sched/fair.c:attach_entity_load_avg
Cyclomatic Complexity 1 kernel/sched/fair.c:detach_entity_load_avg
Cyclomatic Complexity 1 kernel/sched/fair.c:idle_balance
Cyclomatic Complexity 1 kernel/sched/fair.c:util_est_enqueue
Cyclomatic Complexity 1 kernel/sched/fair.c:util_est_dequeue
Cyclomatic Complexity 1 kernel/sched/fair.c:check_spread
Cyclomatic Complexity 1 kernel/sched/fair.c:check_schedstat_required
Cyclomatic Complexity 3 kernel/sched/fair.c:__clear_buddies_last
Cyclomatic Complexity 3 kernel/sched/fair.c:__clear_buddies_next
Cyclomatic Complexity 3 kernel/sched/fair.c:__clear_buddies_skip
Cyclomatic Complexity 4 kernel/sched/fair.c:clear_buddies
Cyclomatic Complexity 1 kernel/sched/fair.c:account_cfs_rq_runtime
Cyclomatic Complexity 1 kernel/sched/fair.c:check_cfs_rq_runtime
Cyclomatic Complexity 1 kernel/sched/fair.c:check_enqueue_throttle
Cyclomatic Complexity 1 kernel/sched/fair.c:return_cfs_rq_runtime

vim +5450 kernel/sched/fair.c

5389
5390 /*
5391 * The enqueue_task method is called before nr_running is
5392 * increased. Here we update the fair scheduling stats and
5393 * then put the task into the rbtree:
5394 */
5395 static void
5396 enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
5397 {
5398 struct cfs_rq *cfs_rq;
5399 struct sched_entity *se = &p->se;
5400
5401 /*
5402 * The code below (indirectly) updates schedutil which looks at
5403 * the cfs_rq utilization to select a frequency.
5404 * Let's add the task's estimated utilization to the cfs_rq's
5405 * estimated utilization, before we update schedutil.
5406 */
5407 util_est_enqueue(&rq->cfs, p);
5408
5409 /*
5410 * If in_iowait is set, the code below may not trigger any cpufreq
5411 * utilization updates, so do it here explicitly with the IOWAIT flag
5412 * passed.
5413 */
5414 if (p->in_iowait)
5415 cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
5416
5417 for_each_sched_entity(se) {
5418 if (se->on_rq)
5419 break;
5420 cfs_rq = cfs_rq_of(se);
5421 enqueue_entity(cfs_rq, se, flags);
5422
5423 /*
5424 * end evaluation on encountering a throttled cfs_rq
5425 *
5426 * note: in the case of encountering a throttled cfs_rq we will
5427 * post the final h_nr_running increment below.
5428 */
5429 if (cfs_rq_throttled(cfs_rq))
5430 break;
5431 cfs_rq->h_nr_running++;
5432
5433 flags = ENQUEUE_WAKEUP;
5434 }
5435
5436 for_each_sched_entity(se) {
5437 cfs_rq = cfs_rq_of(se);
5438 cfs_rq->h_nr_running++;
5439
5440 if (cfs_rq_throttled(cfs_rq))
5441 break;
5442
5443 update_load_avg(cfs_rq, se, UPDATE_TG);
5444 update_cfs_group(se);
5445 }
5446
5447 if (!se)
5448 add_nr_running(rq, 1);
5449
> 5450 util_est_enqueue_running(p);
5451
5452 hrtick_update(rq);
5453 }
5454

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (9.56 kB)
.config.gz (33.20 kB)
Download all attachments

2018-06-05 06:58:10

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

On 4 June 2018 at 18:06, Patrick Bellasi <[email protected]> wrote:
> The estimated utilization of a task is affected by the task being
> preempted, either by another FAIR task of by a task of an higher
> priority class (i.e. RT or DL). Indeed, when a preemption happens, the
> PELT utilization of the preempted task is going to be decayed a bit.
> That's actually correct for utilization, which goal is to measure the
> actual CPU bandwidth consumed by a task.
>
> However, the above behavior does not allow to know exactly what is the
> utilization a task "would have used" if it was running without
> being preempted. Thus, this reduces the effectiveness of util_est for a
> task because it does not always allow to predict how much CPU a task is
> likely to require.
>
> Let's improve the estimated utilization by adding a new "sort-of" PELT
> signal, explicitly only for SE which has the following behavior:
> a) at each enqueue time of a task, its value is the (already decayed)
> util_avg of the task being enqueued
> b) it's updated at each update_load_avg
> c) it can just increase, whenever the task is actually RUNNING on a
> CPU, while it's kept stable while the task is RUNNANBLE but not
> actively consuming CPU bandwidth
>
> Such a defined signal is exactly equivalent to the util_avg for a task
> running alone on a CPU while, in case the task is preempted, it allows
> to know at dequeue time how much would have been the task utilization if
> it was running alone on that CPU.

I don't agree with this statement above
Let say that you have 2 periodic tasks which wants to run 4ms in every
period of 10ms and wakes up at the same time.
One task will execute 1st and the other will wait but at the end they
have the same period and running time and as a result the same
util_avg which is exactly what they need.

>
> This new signal is named "running_avg", since it tracks the actual
> RUNNING time of a task by ignoring any form of preemption.

In fact, you still take into account this preemption as you remove
some time whereas it's only a shift of the excution

>
> From an implementation standpoint, since the sched_avg should fit into a
> single cache line, we save space by tracking only a new runnable sum:
> p->se.avg.running_sum
> while the conversion into a running_avg is done on demand whenever we
> need it, which is at task dequeue time when a new util_est sample has to
> be collected.
>
> The conversion from "running_sum" to "running_avg" is done by performing
> a single division by LOAD_AVG_MAX, which introduces a small error since
> in the division we do not consider the (sa->period_contrib - 1024)
> compensation factor used in ___update_load_avg(). However:
> a) this error is expected to be limited (~2-3%)
> b) it can be safely ignored since the estimated utilization is the only
> consumer which is already subject to small estimation errors
>
> The additional corresponding benefit is that, at run-time, we pay the
> cost for a additional sum and multiply, while the more expensive
> division is required only at dequeue time.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: Todd Kjos <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Steve Muckle <[email protected]>
> Cc: Dietmar Eggemann <[email protected]>
> Cc: Morten Rasmussen <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> include/linux/sched.h | 1 +
> kernel/sched/fair.c | 16 ++++++++++++++--
> 2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9d8732dab264..2bd5f1c68da9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -399,6 +399,7 @@ struct sched_avg {
> u64 load_sum;
> u64 runnable_load_sum;
> u32 util_sum;
> + u32 running_sum;
> u32 period_contrib;
> unsigned long load_avg;
> unsigned long runnable_load_avg;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f74441be3f44..5d54d6a4c31f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> sa->runnable_load_sum =
> decay_load(sa->runnable_load_sum, periods);
> sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> + if (running)
> + sa->running_sum = decay_load(sa->running_sum, periods);

so you make some time disappearing from the equation as the signal
will not be decayed and make the period looking shorter than reality.
With the example I mentioned above, the 2nd task will be seen as if
its period becomes 6ms instead of 10ms which is not true and the
utilization of the task is overestimated without any good reason
Furthermore, this new signal doesn't have clear meaning because it's
not utilization but it's also not the waiting time of the task.

>
> /*
> * Step 2
> @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> sa->load_sum += load * contrib;
> if (runnable)
> sa->runnable_load_sum += runnable * contrib;
> - if (running)
> + if (running) {
> sa->util_sum += contrib * scale_cpu;
> + sa->running_sum += contrib * scale_cpu;
> + }
>
> return periods;
> }
> @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> }
>
> +static inline void util_est_enqueue_running(struct task_struct *p)
> +{
> + /* Initilize the (non-preempted) utilization */
> + p->se.avg.running_sum = p->se.avg.util_sum;
> +}
> +
> /*
> * Check if a (signed) value is within a specified (unsigned) margin,
> * based on the observation that:
> @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> * Skip update of task's estimated utilization when its EWMA is
> * already ~1% close to its last activation value.
> */
> - ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> + ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> last_ewma_diff = ue.enqueued - ue.ewma;
> if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> return;
> @@ -5437,6 +5447,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!se)
> add_nr_running(rq, 1);
>
> + util_est_enqueue_running(p);
> +
> hrtick_update(rq);
> }
>
> --
> 2.15.1
>

2018-06-05 15:12:42

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

Hi Vincent,

On 05-Jun 08:57, Vincent Guittot wrote:
> On 4 June 2018 at 18:06, Patrick Bellasi <[email protected]> wrote:

[...]

> > Let's improve the estimated utilization by adding a new "sort-of" PELT
> > signal, explicitly only for SE which has the following behavior:
> > a) at each enqueue time of a task, its value is the (already decayed)
> > util_avg of the task being enqueued
> > b) it's updated at each update_load_avg
> > c) it can just increase, whenever the task is actually RUNNING on a
> > CPU, while it's kept stable while the task is RUNNANBLE but not
> > actively consuming CPU bandwidth
> >
> > Such a defined signal is exactly equivalent to the util_avg for a task
> > running alone on a CPU while, in case the task is preempted, it allows
> > to know at dequeue time how much would have been the task utilization if
> > it was running alone on that CPU.
>
> I don't agree with this statement above
> Let say that you have 2 periodic tasks which wants to run 4ms in every
> period of 10ms and wakes up at the same time.
> One task will execute 1st and the other will wait but at the end they
> have the same period and running time and as a result the same
> util_avg which is exactly what they need.

In your example above you say that both tasks will end up with a 40%
util_avg, and that's exactly also the value reported by the new
"running_avg" metric.

Both tasks, if they where running alone, they would have generated a
40% utilization, and above I'm saying:

it allows to know at dequeue time
how much would have been the task utilization
if it it was running alone on that CPU

Don't see where is not correct, maybe I should explain it better?

> > This new signal is named "running_avg", since it tracks the actual
> > RUNNING time of a task by ignoring any form of preemption.
>
> In fact, you still take into account this preemption as you remove
> some time whereas it's only a shift of the excution

When the task is enqueued we cache the (already decayed) util_avg, and
from this time on the running_avg can only increase. It increases only
for the portion of CPU time the task is running and it's never decayed
while the task is preempted.

In your example above, if we look at the second task running, the one
"delayed", you will have:

@t1 wakeup time: running_avg@t1 = util_avg@t1
since we initialize it with the
"sleep decayed" util_avg

NOTE: this initialization is the important part, more on that on your
next comment below related to the period being changed.

@t2 switch_in time: running_avg@t2 = running_avg@t1
since it's not decayed while RUNNUBLE but !RUNNING

while, meanwhile:

util_avg@t2 < util_avg@t1
since it's decayed while RUNNABLE but !RUNNING

@t3 dequeue time: running_avg@t3 > running_avg@t2


When you say "it's only a shift of the execution" I agree, and indeed
the running_avg is not affected at all.

So, we can say that I'm "accounting for preemption time" but really,
the way I do it, is just by _not decaying_ PELT when the task is:

RUNNABLE but !RUNNING

and that's why I say that running_avg gives you the same behavior of
util_avg _if_ the task was running alone, because in that case you never
have the condition above, and thus util_avg is never decayed.

[...]

> > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > sa->runnable_load_sum =
> > decay_load(sa->runnable_load_sum, periods);
> > sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > + if (running)
> > + sa->running_sum = decay_load(sa->running_sum, periods);
>
> so you make some time disappearing from the equation as the signal
> will not be decayed and make the period looking shorter than reality.

Since at enqueue time we always initialize running_avg to whatever is
util_avg, I don't think we are messing up with the period at all.

The util_avg signal properly accounts for the period.
Thus, the combined effect of:

- initializing running_avg at enqueue time with the value of
util_avg, already decayed to properly account for the task period
- not decaying running_avg when the task is RUNNABLE but !RUNNING

should just result into "virtually" considering the two tasks of your
example "as if" they was running concurrently on two different CPUs.

Isn't it?

> With the example I mentioned above, the 2nd task will be seen as if
> its period becomes 6ms instead of 10ms which is not true and the
> utilization of the task is overestimated without any good reason

I don't see that overestimation... and I cannot even measure it.

If I run an experiment with your example above, while using the
performance governor to rule out any possible scale invariance
difference, here is what I measure:

Task1 (40ms delayed by the following Task2):
mean std max
running_avg 455.387449 22.940168 492.0
util_avg 433.233288 17.395477 458.0

Task2 (waking up at same time of Task1 and running before):
mean std max
running_avg 430.281834 22.405175 455.0
util_avg 421.745331 22.098873 456.0

and if I compare Task1 above with another experiment where Task1 is
running alone:

Task1 (running alone):
mean std min
running_avg 460.257895 22.103704 460.0
util_avg 435.119737 17.647556 461.0


Thus, there are some variations ok, but I think they are more due to
the rounding errors we have.
Task1 is still seen as a 455 expected utilization, which is not the
4/6 ~= 660 you say above if it should be accounted as a task running
4ms every 6ms.

> Furthermore, this new signal doesn't have clear meaning because it's
> not utilization but it's also not the waiting time of the task.

The meaning is: the utilization of the task _if_ it should be running
alone on that CPU. Tehe numbers above shows that since
455 (Task1 delayed) ~= 460 (Task1 running alone)

If it's running alone it would be exactly util_avg (minus rounding
errors), but if it's preempted it will be a better representation of
the task's needed CPU bandwidth.

Unless we really need to track the "waiting time of a task" I would
say that the signal above should make util_est much more accurate on
knowing, at wakeup time, how much CPU a task will likely need...
independently from preemption sources.

Here are some other experiments / workloads I'm testing:

https://gist.github.com/derkling/fe01e5ddf639e18b5d85a3b1d2e84b7d

> >
> > /*
> > * Step 2
> > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > sa->load_sum += load * contrib;
> > if (runnable)
> > sa->runnable_load_sum += runnable * contrib;
> > - if (running)
> > + if (running) {
> > sa->util_sum += contrib * scale_cpu;
> > + sa->running_sum += contrib * scale_cpu;
> > + }
> >
> > return periods;
> > }
> > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > }
> >
> > +static inline void util_est_enqueue_running(struct task_struct *p)
> > +{
> > + /* Initilize the (non-preempted) utilization */
> > + p->se.avg.running_sum = p->se.avg.util_sum;

This line above is what should ensure we do not mess up with the task
period.

> > +}
> > +
> > /*
> > * Check if a (signed) value is within a specified (unsigned) margin,
> > * based on the observation that:
> > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> > * Skip update of task's estimated utilization when its EWMA is
> > * already ~1% close to its last activation value.
> > */
> > - ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > + ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> > last_ewma_diff = ue.enqueued - ue.ewma;
> > if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> > return;
> > @@ -5437,6 +5447,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > if (!se)
> > add_nr_running(rq, 1);
> >
> > + util_est_enqueue_running(p);
> > +
> > hrtick_update(rq);
> > }
> >
> > --
> > 2.15.1
> >

--
#include <best/regards.h>

Patrick Bellasi

2018-06-05 15:23:07

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

On 04-Jun 10:46, Joel Fernandes wrote:
> Hi Patrick,
>
> On Mon, Jun 04, 2018 at 05:06:00PM +0100, Patrick Bellasi wrote:
> > The estimated utilization of a task is affected by the task being
> > preempted, either by another FAIR task of by a task of an higher
> > priority class (i.e. RT or DL). Indeed, when a preemption happens, the
> > PELT utilization of the preempted task is going to be decayed a bit.
> > That's actually correct for utilization, which goal is to measure the
> > actual CPU bandwidth consumed by a task.
> >
> > However, the above behavior does not allow to know exactly what is the
> > utilization a task "would have used" if it was running without
> > being preempted. Thus, this reduces the effectiveness of util_est for a
> > task because it does not always allow to predict how much CPU a task is
> > likely to require.
> >
> > Let's improve the estimated utilization by adding a new "sort-of" PELT
> > signal, explicitly only for SE which has the following behavior:
> > a) at each enqueue time of a task, its value is the (already decayed)
> > util_avg of the task being enqueued
> > b) it's updated at each update_load_avg
> > c) it can just increase, whenever the task is actually RUNNING on a
> > CPU, while it's kept stable while the task is RUNNANBLE but not
> > actively consuming CPU bandwidth
> >
> > Such a defined signal is exactly equivalent to the util_avg for a task
> > running alone on a CPU while, in case the task is preempted, it allows
> > to know at dequeue time how much would have been the task utilization if
> > it was running alone on that CPU.
> >
> > This new signal is named "running_avg", since it tracks the actual
> > RUNNING time of a task by ignoring any form of preemption.
> >
> > From an implementation standpoint, since the sched_avg should fit into a
> > single cache line, we save space by tracking only a new runnable sum:
> > p->se.avg.running_sum
> > while the conversion into a running_avg is done on demand whenever we
> > need it, which is at task dequeue time when a new util_est sample has to
> > be collected.
> >
> > The conversion from "running_sum" to "running_avg" is done by performing
> > a single division by LOAD_AVG_MAX, which introduces a small error since
> > in the division we do not consider the (sa->period_contrib - 1024)
> > compensation factor used in ___update_load_avg(). However:
> > a) this error is expected to be limited (~2-3%)
> > b) it can be safely ignored since the estimated utilization is the only
> > consumer which is already subject to small estimation errors
> >
> > The additional corresponding benefit is that, at run-time, we pay the
> > cost for a additional sum and multiply, while the more expensive
> > division is required only at dequeue time.
> >
> > Signed-off-by: Patrick Bellasi <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Juri Lelli <[email protected]>
> > Cc: Todd Kjos <[email protected]>
> > Cc: Joel Fernandes <[email protected]>
> > Cc: Steve Muckle <[email protected]>
> > Cc: Dietmar Eggemann <[email protected]>
> > Cc: Morten Rasmussen <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > ---
> > include/linux/sched.h | 1 +
> > kernel/sched/fair.c | 16 ++++++++++++++--
> > 2 files changed, 15 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 9d8732dab264..2bd5f1c68da9 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -399,6 +399,7 @@ struct sched_avg {
> > u64 load_sum;
> > u64 runnable_load_sum;
> > u32 util_sum;
> > + u32 running_sum;
> > u32 period_contrib;
> > unsigned long load_avg;
> > unsigned long runnable_load_avg;
>
> Should update the documentation comments above the struct too?
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f74441be3f44..5d54d6a4c31f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > sa->runnable_load_sum =
> > decay_load(sa->runnable_load_sum, periods);
> > sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > + if (running)
> > + sa->running_sum = decay_load(sa->running_sum, periods);
> >
> > /*
> > * Step 2
> > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > sa->load_sum += load * contrib;
> > if (runnable)
> > sa->runnable_load_sum += runnable * contrib;
> > - if (running)
> > + if (running) {
> > sa->util_sum += contrib * scale_cpu;
> > + sa->running_sum += contrib * scale_cpu;
> > + }
> >
> > return periods;
> > }
> > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > }
>
> PELT changes look nice and makes sense :)

That's not strictly speaking a PELT change... it's still more in the
idea to work "on top of PELT" to make it more effective in measuring
the tasks expected required CPU bandwidth.

> > +static inline void util_est_enqueue_running(struct task_struct *p)
> > +{
> > + /* Initilize the (non-preempted) utilization */
> > + p->se.avg.running_sum = p->se.avg.util_sum;
> > +}
> > +
> > /*
> > * Check if a (signed) value is within a specified (unsigned) margin,
> > * based on the observation that:
> > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> > * Skip update of task's estimated utilization when its EWMA is
> > * already ~1% close to its last activation value.
> > */
> > - ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > + ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
>
> I guess we are doing extra division here which adds some cost. Does
> performance look Ok with the change?

This extra division is there and done only at dequeue time instead of
doing it at each update_load_avg.

To be more precise, at each ___update_load_avg we should really update
running_avg by:

u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
sa->running_avg = sa->running_sum / divider;

but, this would imply tracking an additional signal in sched_avg and
doing an additional division at ___update_load_avg() time.

Morten suggested that, if we accept the rounding errors due to
considering

divider ~= LOAD_AVG_MAX

thus discarding the (sa->period_contrib - 1024) correction, then we
can completely skip the tracking of running_avg (thus saving space in
sched_avg) and approximate it at dequeue time as per the code line,
just to compute the new util_est sample to accumulate.

Does that make sense now?

--
#include <best/regards.h>

Patrick Bellasi

2018-06-05 15:31:56

by Juri Lelli

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

On 05/06/18 16:11, Patrick Bellasi wrote:

[...]

> If I run an experiment with your example above, while using the
> performance governor to rule out any possible scale invariance
> difference, here is what I measure:
>
> Task1 (40ms delayed by the following Task2):
> mean std max
> running_avg 455.387449 22.940168 492.0
> util_avg 433.233288 17.395477 458.0
>
> Task2 (waking up at same time of Task1 and running before):
> mean std max
> running_avg 430.281834 22.405175 455.0
> util_avg 421.745331 22.098873 456.0
>
> and if I compare Task1 above with another experiment where Task1 is
> running alone:
>
> Task1 (running alone):
> mean std min
> running_avg 460.257895 22.103704 460.0
> util_avg 435.119737 17.647556 461.0

Wait, why again in this last case running_avg != util_avg? :)


2018-06-05 16:55:17

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

On 05-Jun 17:31, Juri Lelli wrote:
> On 05/06/18 16:11, Patrick Bellasi wrote:
>
> [...]
>
> > If I run an experiment with your example above, while using the
> > performance governor to rule out any possible scale invariance
> > difference, here is what I measure:
> >
> > Task1 (40ms delayed by the following Task2):
> > mean std max
> > running_avg 455.387449 22.940168 492.0
> > util_avg 433.233288 17.395477 458.0
> >
> > Task2 (waking up at same time of Task1 and running before):
> > mean std max
> > running_avg 430.281834 22.405175 455.0
> > util_avg 421.745331 22.098873 456.0
> >
> > and if I compare Task1 above with another experiment where Task1 is
> > running alone:
> >
> > Task1 (running alone):
> > mean std min
> > running_avg 460.257895 22.103704 460.0
> > util_avg 435.119737 17.647556 461.0
>
> Wait, why again in this last case running_avg != util_avg? :)

I _think_ it's mostly due to the rouding errors we have because of the
reasons I've explained in the reply to Joel:

https://lkml.org/lkml/2018/6/5/559
20180605152156.GD32302@e110439-lin

at the end, while commenting about the division overhead.

I should try the above examples while tracking the full signal at
___update_load_avg() time.

--
#include <best/regards.h>

Patrick Bellasi

2018-06-05 19:34:20

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

On Tue, Jun 05, 2018 at 04:21:56PM +0100, Patrick Bellasi wrote:
[..]
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index f74441be3f44..5d54d6a4c31f 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > > sa->runnable_load_sum =
> > > decay_load(sa->runnable_load_sum, periods);
> > > sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > > + if (running)
> > > + sa->running_sum = decay_load(sa->running_sum, periods);
> > >
> > > /*
> > > * Step 2
> > > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > > sa->load_sum += load * contrib;
> > > if (runnable)
> > > sa->runnable_load_sum += runnable * contrib;
> > > - if (running)
> > > + if (running) {
> > > sa->util_sum += contrib * scale_cpu;
> > > + sa->running_sum += contrib * scale_cpu;
> > > + }
> > >
> > > return periods;
> > > }
> > > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > > }
> >
> > PELT changes look nice and makes sense :)
>
> That's not strictly speaking a PELT change... it's still more in the
> idea to work "on top of PELT" to make it more effective in measuring
> the tasks expected required CPU bandwidth.

I meant "PELT change" as in change to the code that calculates PELT signals..

> > > +static inline void util_est_enqueue_running(struct task_struct *p)
> > > +{
> > > + /* Initilize the (non-preempted) utilization */
> > > + p->se.avg.running_sum = p->se.avg.util_sum;
> > > +}
> > > +
> > > /*
> > > * Check if a (signed) value is within a specified (unsigned) margin,
> > > * based on the observation that:
> > > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> > > * Skip update of task's estimated utilization when its EWMA is
> > > * already ~1% close to its last activation value.
> > > */
> > > - ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > > + ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> >
> > I guess we are doing extra division here which adds some cost. Does
> > performance look Ok with the change?
>
> This extra division is there and done only at dequeue time instead of
> doing it at each update_load_avg.

I know. :)

> To be more precise, at each ___update_load_avg we should really update
> running_avg by:
>
> u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
> sa->running_avg = sa->running_sum / divider;
>
> but, this would imply tracking an additional signal in sched_avg and
> doing an additional division at ___update_load_avg() time.
>
> Morten suggested that, if we accept the rounding errors due to
> considering
>
> divider ~= LOAD_AVG_MAX
>
> thus discarding the (sa->period_contrib - 1024) correction, then we
> can completely skip the tracking of running_avg (thus saving space in
> sched_avg) and approximate it at dequeue time as per the code line,
> just to compute the new util_est sample to accumulate.
>
> Does that make sense now?

The patch always made sense to me.. I was just pointing out the extra
division this patch adds. I agree since its done on dequeue-only, then its
probably Ok to do..

thanks,

- Joel


2018-06-05 19:44:37

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

On Tue, Jun 05, 2018 at 12:33:17PM -0700, Joel Fernandes wrote:
> On Tue, Jun 05, 2018 at 04:21:56PM +0100, Patrick Bellasi wrote:
[..]
> > To be more precise, at each ___update_load_avg we should really update
> > running_avg by:
> >
> > u32 divider = LOAD_AVG_MAX - 1024 + sa->period_contrib;
> > sa->running_avg = sa->running_sum / divider;
> >
> > but, this would imply tracking an additional signal in sched_avg and
> > doing an additional division at ___update_load_avg() time.
> >
> > Morten suggested that, if we accept the rounding errors due to
> > considering
> >
> > divider ~= LOAD_AVG_MAX
> >
> > thus discarding the (sa->period_contrib - 1024) correction, then we
> > can completely skip the tracking of running_avg (thus saving space in
> > sched_avg) and approximate it at dequeue time as per the code line,
> > just to compute the new util_est sample to accumulate.
> >
> > Does that make sense now?
>
> The patch always made sense to me.. I was just pointing out the extra
> division this patch adds. I agree since its done on dequeue-only, then its
> probably Ok to do..
>

One thing to note about this error is I remember not compensating for it
would make the utilization reduce... which is kind of weird. But yeah if we
can find a way compensate for the error and also keep the overhead low,
that's super.. I know this is probably low in priority considering the other
concerns Vincent brought up which are being discussed in the other thread,..
but just saying. :)

- Joel


2018-06-05 20:47:31

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

On Tue, Jun 05, 2018 at 05:54:31PM +0100, Patrick Bellasi wrote:
> On 05-Jun 17:31, Juri Lelli wrote:
> > On 05/06/18 16:11, Patrick Bellasi wrote:
> >
> > [...]
> >
> > > If I run an experiment with your example above, while using the
> > > performance governor to rule out any possible scale invariance
> > > difference, here is what I measure:
> > >
> > > Task1 (40ms delayed by the following Task2):
> > > mean std max
> > > running_avg 455.387449 22.940168 492.0
> > > util_avg 433.233288 17.395477 458.0
> > >
> > > Task2 (waking up at same time of Task1 and running before):
> > > mean std max
> > > running_avg 430.281834 22.405175 455.0
> > > util_avg 421.745331 22.098873 456.0
> > >
> > > and if I compare Task1 above with another experiment where Task1 is
> > > running alone:
> > >
> > > Task1 (running alone):
> > > mean std min
> > > running_avg 460.257895 22.103704 460.0
> > > util_avg 435.119737 17.647556 461.0
> >
> > Wait, why again in this last case running_avg != util_avg? :)
>
> I _think_ it's mostly due to the rouding errors we have because of the
> reasons I've explained in the reply to Joel:
>
> https://lkml.org/lkml/2018/6/5/559
> 20180605152156.GD32302@e110439-lin
>
> at the end, while commenting about the division overhead.
>
> I should try the above examples while tracking the full signal at
> ___update_load_avg() time.

Is that the only issue? I think if a CFS task is blocked by another CFS task
due to preemption, then with your patch we would account the CFS blocked time
as well into the blocked task's running utilization, which seems incorrect.
Or did I miss something?

thanks,

- Joel


2018-06-05 23:16:50

by Saravana Kannan

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

On 06/05/2018 01:46 PM, Joel Fernandes wrote:
> On Tue, Jun 05, 2018 at 05:54:31PM +0100, Patrick Bellasi wrote:
>> On 05-Jun 17:31, Juri Lelli wrote:
>>> On 05/06/18 16:11, Patrick Bellasi wrote:
>>>
>>> [...]
>>>
>>>> If I run an experiment with your example above, while using the
>>>> performance governor to rule out any possible scale invariance
>>>> difference, here is what I measure:
>>>>
>>>> Task1 (40ms delayed by the following Task2):
>>>> mean std max
>>>> running_avg 455.387449 22.940168 492.0
>>>> util_avg 433.233288 17.395477 458.0
>>>>
>>>> Task2 (waking up at same time of Task1 and running before):
>>>> mean std max
>>>> running_avg 430.281834 22.405175 455.0
>>>> util_avg 421.745331 22.098873 456.0
>>>>
>>>> and if I compare Task1 above with another experiment where Task1 is
>>>> running alone:
>>>>
>>>> Task1 (running alone):
>>>> mean std min
>>>> running_avg 460.257895 22.103704 460.0
>>>> util_avg 435.119737 17.647556 461.0
>>> Wait, why again in this last case running_avg != util_avg? :)
>> I _think_ it's mostly due to the rouding errors we have because of the
>> reasons I've explained in the reply to Joel:
>>
>> https://lkml.org/lkml/2018/6/5/559
>> 20180605152156.GD32302@e110439-lin
>>
>> at the end, while commenting about the division overhead.
>>
>> I should try the above examples while tracking the full signal at
>> ___update_load_avg() time.
> Is that the only issue? I think if a CFS task is blocked by another CFS task
> due to preemption, then with your patch we would account the CFS blocked time
> as well into the blocked task's running utilization, which seems incorrect.
> Or did I miss something?
This is my concern too. This will negatively affect any task packing
because more tasks are going to be runnable but not running and that
going to increase the over all frequency (I'm assuming you want to use
this for frequency guidance eventually?).

-Saravana

--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


2018-06-06 08:29:06

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

Le Tuesday 05 Jun 2018 ? 16:11:29 (+0100), Patrick Bellasi a ?crit :
> Hi Vincent,
>
> On 05-Jun 08:57, Vincent Guittot wrote:
> > On 4 June 2018 at 18:06, Patrick Bellasi <[email protected]> wrote:
>
> [...]
>
> > > Let's improve the estimated utilization by adding a new "sort-of" PELT
> > > signal, explicitly only for SE which has the following behavior:
> > > a) at each enqueue time of a task, its value is the (already decayed)
> > > util_avg of the task being enqueued
> > > b) it's updated at each update_load_avg
> > > c) it can just increase, whenever the task is actually RUNNING on a
> > > CPU, while it's kept stable while the task is RUNNANBLE but not
> > > actively consuming CPU bandwidth
> > >
> > > Such a defined signal is exactly equivalent to the util_avg for a task
> > > running alone on a CPU while, in case the task is preempted, it allows
> > > to know at dequeue time how much would have been the task utilization if
> > > it was running alone on that CPU.
> >
> > I don't agree with this statement above
> > Let say that you have 2 periodic tasks which wants to run 4ms in every
> > period of 10ms and wakes up at the same time.
> > One task will execute 1st and the other will wait but at the end they
> > have the same period and running time and as a result the same
> > util_avg which is exactly what they need.
>
> In your example above you say that both tasks will end up with a 40%
> util_avg, and that's exactly also the value reported by the new
> "running_avg" metric.

It's not really possible because the util_avg of the 2 tasks already give the exact same
right value. So running_avg, which removes some decay for the 2nd task, can't
give same results.

I add more details below

>
>
> Both tasks, if they where running alone, they would have generated a
> 40% utilization, and above I'm saying:
>
> it allows to know at dequeue time
> how much would have been the task utilization
> if it it was running alone on that CPU
>
> Don't see where is not correct, maybe I should explain it better?
>
> > > This new signal is named "running_avg", since it tracks the actual
> > > RUNNING time of a task by ignoring any form of preemption.
> >
> > In fact, you still take into account this preemption as you remove
> > some time whereas it's only a shift of the excution
>
> When the task is enqueued we cache the (already decayed) util_avg, and
> from this time on the running_avg can only increase. It increases only
> for the portion of CPU time the task is running and it's never decayed
> while the task is preempted.
>
> In your example above, if we look at the second task running, the one
> "delayed", you will have:
>
> @t1 wakeup time: running_avg@t1 = util_avg@t1
> since we initialize it with the
> "sleep decayed" util_avg
>
> NOTE: this initialization is the important part, more on that on your
> next comment below related to the period being changed.
>
> @t2 switch_in time: running_avg@t2 = running_avg@t1
> since it's not decayed while RUNNUBLE but !RUNNING
>
> while, meanwhile:
>
> util_avg@t2 < util_avg@t1
> since it's decayed while RUNNABLE but !RUNNING
>
> @t3 dequeue time: running_avg@t3 > running_avg@t2
>
>
> When you say "it's only a shift of the execution" I agree, and indeed
> the running_avg is not affected at all.
>
> So, we can say that I'm "accounting for preemption time" but really,
> the way I do it, is just by _not decaying_ PELT when the task is:
>
> RUNNABLE but !RUNNING
>
> and that's why I say that running_avg gives you the same behavior of
> util_avg _if_ the task was running alone, because in that case you never
> have the condition above, and thus util_avg is never decayed.
>

For the above 2 tasks of the example example we have the pattern

Task 1
state RRRRSSSSSSERRRRSSSSSSERRRRSSSSSS
util_avg AAAADDDDDD AAAADDDDDD AAAADDDDDD

Task 2
state WWWWRRRRSSEWWWWRRRRSSEWWWWRRRRSS
util_avg DDDDAAAADD DDDDAAAADD DDDDAAAADD
running_avg AAAADDC AAAADDC AAAADD

R : Running 1ms, S: Sleep 1ms , W: Wait 1ms, E: Enqueue event
A: Accumulate 1ms, D: Decay 1ms, C: Copy util_avg value

the util_avg of T1 and T2 have the same pattern which is:
AAAADDDDDDAAAADDDDDDAAAADDDDDD
and as a result, the same value which represents their utilization

For the running_avg of T2, there is only 2 decays aftert the last running
phase and before the next enqueue
so the pattern will be AAAADDAAAA
instead of the AAAADDDDDDAAAA

the runninh_avg will report a higher value than reality

>
>
> [...]
>
> > > @@ -3161,6 +3161,8 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > > sa->runnable_load_sum =
> > > decay_load(sa->runnable_load_sum, periods);
> > > sa->util_sum = decay_load((u64)(sa->util_sum), periods);
> > > + if (running)
> > > + sa->running_sum = decay_load(sa->running_sum, periods);
> >
> > so you make some time disappearing from the equation as the signal
> > will not be decayed and make the period looking shorter than reality.
>
> Since at enqueue time we always initialize running_avg to whatever is
> util_avg, I don't think we are messing up with the period at all.
>
> The util_avg signal properly accounts for the period.
> Thus, the combined effect of:
>
> - initializing running_avg at enqueue time with the value of
> util_avg, already decayed to properly account for the task period
> - not decaying running_avg when the task is RUNNABLE but !RUNNING
>
> should just result into "virtually" considering the two tasks of your
> example "as if" they was running concurrently on two different CPUs.
>
> Isn't it?

No because part of the "sleep" phase is done during the waiting time and
you don't take into account this sleep time

see my explanation above

>
> > With the example I mentioned above, the 2nd task will be seen as if
> > its period becomes 6ms instead of 10ms which is not true and the
> > utilization of the task is overestimated without any good reason
>
> I don't see that overestimation... and I cannot even measure it.
>
> If I run an experiment with your example above, while using the
> performance governor to rule out any possible scale invariance
> difference, here is what I measure:
>
> Task1 (40ms delayed by the following Task2):

the 40ms above is a typo mistake ? you mean 4ms, isn't it ?

> mean std max
> running_avg 455.387449 22.940168 492.i0

the 492 is the overestimation generated by running_avg and not a rounding
effect. With 4ms and 10ms the number of missing decay steps is small but if
you use 40ms/100ms instead you will see a greater difference in the max number

> util_avg 433.233288 17.395477 458.0
>
> Task2 (waking up at same time of Task1 and running before):
> mean std max
> running_avg 430.281834 22.405175 455.0
> util_avg 421.745331 22.098873 456.0
>
> and if I compare Task1 above with another experiment where Task1 is
> running alone:
>
> Task1 (running alone):
> mean std min
> running_avg 460.257895 22.103704 460.0
> util_avg 435.119737 17.647556 461.0
>
>
> Thus, there are some variations ok, but I think they are more due to
> the rounding errors we have.
> Task1 is still seen as a 455 expected utilization, which is not the
> 4/6 ~= 660 you say above if it should be accounted as a task running
> 4ms every 6ms.
>
> > Furthermore, this new signal doesn't have clear meaning because it's
> > not utilization but it's also not the waiting time of the task.
>
> The meaning is: the utilization of the task _if_ it should be running
> alone on that CPU. Tehe numbers above shows that since
> 455 (Task1 delayed) ~= 460 (Task1 running alone)

you should look at the max and not the mean that what util_est is interested
in IIUC

Vincent

>
> If it's running alone it would be exactly util_avg (minus rounding
> errors), but if it's preempted it will be a better representation of
> the task's needed CPU bandwidth.
>
> Unless we really need to track the "waiting time of a task" I would
> say that the signal above should make util_est much more accurate on
> knowing, at wakeup time, how much CPU a task will likely need...
> independently from preemption sources.
>
> Here are some other experiments / workloads I'm testing:
>
> https://gist.github.com/derkling/fe01e5ddf639e18b5d85a3b1d2e84b7d
>
> > >
> > > /*
> > > * Step 2
> > > @@ -3176,8 +3178,10 @@ accumulate_sum(u64 delta, int cpu, struct sched_avg *sa,
> > > sa->load_sum += load * contrib;
> > > if (runnable)
> > > sa->runnable_load_sum += runnable * contrib;
> > > - if (running)
> > > + if (running) {
> > > sa->util_sum += contrib * scale_cpu;
> > > + sa->running_sum += contrib * scale_cpu;
> > > + }
> > >
> > > return periods;
> > > }
> > > @@ -3963,6 +3967,12 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> > > WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> > > }
> > >
> > > +static inline void util_est_enqueue_running(struct task_struct *p)
> > > +{
> > > + /* Initilize the (non-preempted) utilization */
> > > + p->se.avg.running_sum = p->se.avg.util_sum;
>
> This line above is what should ensure we do not mess up with the task
> period.
>
> > > +}
> > > +
> > > /*
> > > * Check if a (signed) value is within a specified (unsigned) margin,
> > > * based on the observation that:
> > > @@ -4018,7 +4028,7 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
> > > * Skip update of task's estimated utilization when its EWMA is
> > > * already ~1% close to its last activation value.
> > > */
> > > - ue.enqueued = (task_util(p) | UTIL_AVG_UNCHANGED);
> > > + ue.enqueued = p->se.avg.running_sum / LOAD_AVG_MAX;
> > > last_ewma_diff = ue.enqueued - ue.ewma;
> > > if (within_margin(last_ewma_diff, (SCHED_CAPACITY_SCALE / 100)))
> > > return;
> > > @@ -5437,6 +5447,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > if (!se)
> > > add_nr_running(rq, 1);
> > >
> > > + util_est_enqueue_running(p);
> > > +
> > > hrtick_update(rq);
> > > }
> > >
> > > --
> > > 2.15.1
> > >
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi

2018-06-06 10:39:34

by Patrick Bellasi

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/fair: util_est: add running_sum tracking

Hi Vincent,

On 06-Jun 10:26, Vincent Guittot wrote:

[...]

> For the above 2 tasks of the example example we have the pattern
>
> Task 1
> state RRRRSSSSSSERRRRSSSSSSERRRRSSSSSS
> util_avg AAAADDDDDD AAAADDDDDD AAAADDDDDD
>
> Task 2
> state WWWWRRRRSSEWWWWRRRRSSEWWWWRRRRSS
> util_avg DDDDAAAADD DDDDAAAADD DDDDAAAADD
> running_avg AAAADDC AAAADDC AAAADD
>
> R : Running 1ms, S: Sleep 1ms , W: Wait 1ms, E: Enqueue event
> A: Accumulate 1ms, D: Decay 1ms, C: Copy util_avg value
>
> the util_avg of T1 and T2 have the same pattern which is:
> AAAADDDDDDAAAADDDDDDAAAADDDDDD
> and as a result, the same value which represents their utilization
>
> For the running_avg of T2, there is only 2 decays aftert the last running
> phase and before the next enqueue
> so the pattern will be AAAADDAAAA
> instead of the AAAADDDDDDAAAA
>
> the runninh_avg will report a higher value than reality

Right!... Your example above is really useful, thanks.

Reasoning on the same line, we can easily see that a 50% CFS task
co-scheduled with a 50% RT task, which delays the CFS one and has the
same period, will make the CFS task appear as a 100% task.

Which is definitively not what we want to sample as estimated
utilization.

The good news, if we like, is instead that util_avg is already
correctly reporting 50% at the end of each activation and thus, when
we collect util_est samples we already have the best utilization value
we can collect for a task.

The only time we collect "wrong" estimation samples is when there
is not idle time, thus eventually util_est should be improved by
discarding samples in that cases... but I'm not entirely sure if and
how we can detect them.

Or just to ensure we have idle time... as you are proposing in
the other thread.

Thanks again for pointing out the issue above.

--
#include <best/regards.h>

Patrick Bellasi