2023-12-01 16:17:27

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v2 0/2] Simplify Util_est

Following comment in [1], I prepared a patch to remove UTIL_EST_FASTUP.
This enables us to simplify util_est behavior as proposed in patch 2.

Changes since v2:
- Add Chinese translation
- Add Tag
- Remove remaining ref to ue.enqueued and move some defines

[1] https://lore.kernel.org/lkml/CAKfTPtCAZWp7tRgTpwJmyEAkyN65acmYrfu9naEUpBZVWNTcQA@mail.gmail.com/

Vincent Guittot (2):
sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)
sched/fair: Simplify util_est

Documentation/scheduler/schedutil.rst | 7 +-
.../zh_CN/scheduler/schedutil.rst | 7 +-
include/linux/sched.h | 49 +++--------
kernel/sched/debug.c | 7 +-
kernel/sched/fair.c | 86 +++++++------------
kernel/sched/features.h | 1 -
kernel/sched/pelt.h | 4 +-
7 files changed, 55 insertions(+), 106 deletions(-)

--
2.34.1


2023-12-01 16:17:30

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v2 1/2] sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)

sched_feat(UTIL_EST_FASTUP) has been added to easily disable the feature
in order to check for possibly related regressions. After 3 years, it has
never been used and no regression has been reported. Let remove it
and make fast increase a permanent behavior.

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-and-tested-by: Lukasz Luba <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Reviewed-by: Hongyan Xia <[email protected]>
Reviewed-by: Tang Yizhou <[email protected]>
---
Documentation/scheduler/schedutil.rst | 7 +++----
Documentation/translations/zh_CN/scheduler/schedutil.rst | 7 +++----
kernel/sched/fair.c | 8 +++-----
kernel/sched/features.h | 1 -
4 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/Documentation/scheduler/schedutil.rst b/Documentation/scheduler/schedutil.rst
index 32c7d69fc86c..803fba8fc714 100644
--- a/Documentation/scheduler/schedutil.rst
+++ b/Documentation/scheduler/schedutil.rst
@@ -90,8 +90,8 @@ For more detail see:
- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


-UTIL_EST / UTIL_EST_FASTUP
-==========================
+UTIL_EST
+========

Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
@@ -99,8 +99,7 @@ though when running their expected utilization will be the same, they suffer a

To alleviate this (a default enabled option) UTIL_EST drives an Infinite
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
-highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
-filter to instantly increase and only decay on decrease.
+highest. UTIL_EST filters to instantly increase and only decay on decrease.

A further runqueue wide sum (of runnable tasks) is maintained of:

diff --git a/Documentation/translations/zh_CN/scheduler/schedutil.rst b/Documentation/translations/zh_CN/scheduler/schedutil.rst
index d1ea68007520..7c8d87f21c42 100644
--- a/Documentation/translations/zh_CN/scheduler/schedutil.rst
+++ b/Documentation/translations/zh_CN/scheduler/schedutil.rst
@@ -89,16 +89,15 @@ r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最
- Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


-UTIL_EST / UTIL_EST_FASTUP
-==========================
+UTIL_EST
+========

由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同,
因此它们在再次运行后会面临(DVFS)的上涨。

为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应
(Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。
-另一个默认使能的编译选项UTIL_EST_FASTUP修改了IIR滤波器,使其允许立即增加,
-仅在利用率下降时衰减。
+UTIL_EST滤波使其在遇到更高值时立刻增加,而遇到低值时会缓慢衰减。

进一步,运行队列的(可运行任务的)利用率之和由下式计算:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcea3d55d95d..e94d65da8d66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4870,11 +4870,9 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* to smooth utilization decreases.
*/
ue.enqueued = task_util(p);
- if (sched_feat(UTIL_EST_FASTUP)) {
- if (ue.ewma < ue.enqueued) {
- ue.ewma = ue.enqueued;
- goto done;
- }
+ if (ue.ewma < ue.enqueued) {
+ ue.ewma = ue.enqueued;
+ goto done;
}

/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3ddf84de430..143f55df890b 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,7 +83,6 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
-SCHED_FEAT(UTIL_EST_FASTUP, true)

SCHED_FEAT(LATENCY_WARN, false)

--
2.34.1

2023-12-01 16:17:49

by Vincent Guittot

[permalink] [raw]
Subject: [PATCH v2 2/2] sched/fair: Simplify util_est

With UTIL_EST_FASTUP now being permanent, we can take advantage of the
fact that the ewma jumps directly to a higher utilization at dequeue to
simplify util_est and remove the enqueued field.

Signed-off-by: Vincent Guittot <[email protected]>
Reviewed-and-tested-by: Lukasz Luba <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Reviewed-by: Hongyan Xia <[email protected]>
---
include/linux/sched.h | 49 +++++++-------------------
kernel/sched/debug.c | 7 ++--
kernel/sched/fair.c | 82 ++++++++++++++++---------------------------
kernel/sched/pelt.h | 4 +--
4 files changed, 48 insertions(+), 94 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d258162deb0..03bfe9ab2951 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -415,42 +415,6 @@ struct load_weight {
u32 inv_weight;
};

-/**
- * struct util_est - Estimation utilization of FAIR tasks
- * @enqueued: instantaneous estimated utilization of a task/cpu
- * @ewma: the Exponential Weighted Moving Average (EWMA)
- * utilization of a task
- *
- * Support data structure to track an Exponential Weighted Moving Average
- * (EWMA) of a FAIR task's utilization. New samples are added to the moving
- * average each time a task completes an activation. Sample's weight is chosen
- * so that the EWMA will be relatively insensitive to transient changes to the
- * task's workload.
- *
- * The enqueued attribute has a slightly different meaning for tasks and cpus:
- * - task: the task's util_avg at last task dequeue time
- * - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
- * Thus, the util_est.enqueued of a task represents the contribution on the
- * estimated utilization of the CPU where that task is currently enqueued.
- *
- * Only for tasks we track a moving average of the past instantaneous
- * estimated utilization. This allows to absorb sporadic drops in utilization
- * of an otherwise almost periodic task.
- *
- * The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
- * updates. When a task is dequeued, its util_est should not be updated if its
- * util_avg has not been updated in the meantime.
- * This information is mapped into the MSB bit of util_est.enqueued at dequeue
- * time. Since max value of util_est.enqueued for a task is 1024 (PELT util_avg
- * for a task) it is safe to use MSB.
- */
-struct util_est {
- unsigned int enqueued;
- unsigned int ewma;
-#define UTIL_EST_WEIGHT_SHIFT 2
-#define UTIL_AVG_UNCHANGED 0x80000000
-} __attribute__((__aligned__(sizeof(u64))));
-
/*
* The load/runnable/util_avg accumulates an infinite geometric series
* (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
@@ -505,9 +469,20 @@ struct sched_avg {
unsigned long load_avg;
unsigned long runnable_avg;
unsigned long util_avg;
- struct util_est util_est;
+ unsigned int util_est;
} ____cacheline_aligned;

+/*
+ * The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
+ * updates. When a task is dequeued, its util_est should not be updated if its
+ * util_avg has not been updated in the meantime.
+ * This information is mapped into the MSB bit of util_est at dequeue time.
+ * Since max value of util_est for a task is 1024 (PELT util_avg for a task)
+ * it is safe to use MSB.
+ */
+#define UTIL_EST_WEIGHT_SHIFT 2
+#define UTIL_AVG_UNCHANGED 0x80000000
+
struct sched_statistics {
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 168eecc209b4..8d5d98a5834d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -684,8 +684,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.runnable_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
- SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued",
- cfs_rq->avg.util_est.enqueued);
+ SEQ_printf(m, " .%-30s: %u\n", "util_est",
+ cfs_rq->avg.util_est);
SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg",
cfs_rq->removed.load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
@@ -1075,8 +1075,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(se.avg.runnable_avg);
P(se.avg.util_avg);
P(se.avg.last_update_time);
- P(se.avg.util_est.ewma);
- PM(se.avg.util_est.enqueued, ~UTIL_AVG_UNCHANGED);
+ PM(se.avg.util_est, ~UTIL_AVG_UNCHANGED);
#endif
#ifdef CONFIG_UCLAMP_TASK
__PS("uclamp.min", p->uclamp_req[UCLAMP_MIN].value);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e94d65da8d66..823dd76d0546 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4781,9 +4781,7 @@ static inline unsigned long task_runnable(struct task_struct *p)

static inline unsigned long _task_util_est(struct task_struct *p)
{
- struct util_est ue = READ_ONCE(p->se.avg.util_est);
-
- return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
+ return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
}

static inline unsigned long task_util_est(struct task_struct *p)
@@ -4800,9 +4798,9 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
return;

/* Update root cfs_rq's estimated utilization */
- enqueued = cfs_rq->avg.util_est.enqueued;
+ enqueued = cfs_rq->avg.util_est;
enqueued += _task_util_est(p);
- WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
+ WRITE_ONCE(cfs_rq->avg.util_est, enqueued);

trace_sched_util_est_cfs_tp(cfs_rq);
}
@@ -4816,34 +4814,20 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
return;

/* Update root cfs_rq's estimated utilization */
- enqueued = cfs_rq->avg.util_est.enqueued;
+ enqueued = cfs_rq->avg.util_est;
enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
- WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
+ WRITE_ONCE(cfs_rq->avg.util_est, enqueued);

trace_sched_util_est_cfs_tp(cfs_rq);
}

#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)

-/*
- * Check if a (signed) value is within a specified (unsigned) margin,
- * based on the observation that:
- *
- * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
- *
- * NOTE: this only works when value + margin < INT_MAX.
- */
-static inline bool within_margin(int value, int margin)
-{
- return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
-}
-
static inline void util_est_update(struct cfs_rq *cfs_rq,
struct task_struct *p,
bool task_sleep)
{
- long last_ewma_diff, last_enqueued_diff;
- struct util_est ue;
+ unsigned int ewma, dequeued, last_ewma_diff;

if (!sched_feat(UTIL_EST))
return;
@@ -4855,23 +4839,25 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
if (!task_sleep)
return;

+ /* Get current estimate of utilization */
+ ewma = READ_ONCE(p->se.avg.util_est);
+
/*
* If the PELT values haven't changed since enqueue time,
* skip the util_est update.
*/
- ue = p->se.avg.util_est;
- if (ue.enqueued & UTIL_AVG_UNCHANGED)
+ if (ewma & UTIL_AVG_UNCHANGED)
return;

- last_enqueued_diff = ue.enqueued;
+ /* Get utilization at dequeue */
+ dequeued = task_util(p);

/*
* Reset EWMA on utilization increases, the moving average is used only
* to smooth utilization decreases.
*/
- ue.enqueued = task_util(p);
- if (ue.ewma < ue.enqueued) {
- ue.ewma = ue.enqueued;
+ if (ewma <= dequeued) {
+ ewma = dequeued;
goto done;
}

@@ -4879,27 +4865,22 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* Skip update of task's estimated utilization when its members are
* already ~1% close to its last activation value.
*/
- last_ewma_diff = ue.enqueued - ue.ewma;
- last_enqueued_diff -= ue.enqueued;
- if (within_margin(last_ewma_diff, UTIL_EST_MARGIN)) {
- if (!within_margin(last_enqueued_diff, UTIL_EST_MARGIN))
- goto done;
-
- return;
- }
+ last_ewma_diff = ewma - dequeued;
+ if (last_ewma_diff < UTIL_EST_MARGIN)
+ goto done;

/*
* To avoid overestimation of actual task utilization, skip updates if
* we cannot grant there is idle time in this CPU.
*/
- if (task_util(p) > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
+ if (dequeued > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
return;

/*
* To avoid underestimate of task utilization, skip updates of EWMA if
* we cannot grant that thread got all CPU time it wanted.
*/
- if ((ue.enqueued + UTIL_EST_MARGIN) < task_runnable(p))
+ if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
goto done;


@@ -4907,25 +4888,24 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* Update Task's estimated utilization
*
* When *p completes an activation we can consolidate another sample
- * of the task size. This is done by storing the current PELT value
- * as ue.enqueued and by using this value to update the Exponential
- * Weighted Moving Average (EWMA):
+ * of the task size. This is done by using this value to update the
+ * Exponential Weighted Moving Average (EWMA):
*
* ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
* = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
* = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
- * = w * ( last_ewma_diff ) + ewma(t-1)
- * = w * (last_ewma_diff + ewma(t-1) / w)
+ * = w * ( -last_ewma_diff ) + ewma(t-1)
+ * = w * (-last_ewma_diff + ewma(t-1) / w)
*
* Where 'w' is the weight of new samples, which is configured to be
* 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
*/
- ue.ewma <<= UTIL_EST_WEIGHT_SHIFT;
- ue.ewma += last_ewma_diff;
- ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;
+ ewma <<= UTIL_EST_WEIGHT_SHIFT;
+ ewma -= last_ewma_diff;
+ ewma >>= UTIL_EST_WEIGHT_SHIFT;
done:
- ue.enqueued |= UTIL_AVG_UNCHANGED;
- WRITE_ONCE(p->se.avg.util_est, ue);
+ ewma |= UTIL_AVG_UNCHANGED;
+ WRITE_ONCE(p->se.avg.util_est, ewma);

trace_sched_util_est_se_tp(&p->se);
}
@@ -7653,16 +7633,16 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
if (sched_feat(UTIL_EST)) {
unsigned long util_est;

- util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+ util_est = READ_ONCE(cfs_rq->avg.util_est);

/*
* During wake-up @p isn't enqueued yet and doesn't contribute
- * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
+ * to any cpu_rq(cpu)->cfs.avg.util_est.
* If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
* has been enqueued.
*
* During exec (@dst_cpu = -1) @p is enqueued and does
- * contribute to cpu_rq(cpu)->cfs.util_est.enqueued.
+ * contribute to cpu_rq(cpu)->cfs.util_est.
* Remove it to "simulate" cpu_util without @p's contribution.
*
* Despite the task_on_rq_queued(@p) check there is still a
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 3a0e0dc28721..9e1083465fbc 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -52,13 +52,13 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
return;

/* Avoid store if the flag has been already reset */
- enqueued = avg->util_est.enqueued;
+ enqueued = avg->util_est;
if (!(enqueued & UTIL_AVG_UNCHANGED))
return;

/* Reset flag to report util_avg has been updated */
enqueued &= ~UTIL_AVG_UNCHANGED;
- WRITE_ONCE(avg->util_est.enqueued, enqueued);
+ WRITE_ONCE(avg->util_est, enqueued);
}

static inline u64 rq_clock_pelt(struct rq *rq)
--
2.34.1

2023-12-02 02:42:29

by Yanteng Si

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)


在 2023/12/2 00:16, Vincent Guittot 写道:
> sched_feat(UTIL_EST_FASTUP) has been added to easily disable the feature
> in order to check for possibly related regressions. After 3 years, it has
> never been used and no regression has been reported. Let remove it
> and make fast increase a permanent behavior.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Reviewed-and-tested-by: Lukasz Luba <[email protected]>
> Reviewed-by: Dietmar Eggemann <[email protected]>
> Reviewed-by: Hongyan Xia <[email protected]>
> Reviewed-by: Tang Yizhou <[email protected]>

For Chinese translation,


Reviewed-by: Yanteng Si <[email protected]>


Thanks,

Yanteng

> ---
> Documentation/scheduler/schedutil.rst | 7 +++----
> Documentation/translations/zh_CN/scheduler/schedutil.rst | 7 +++----
> kernel/sched/fair.c | 8 +++-----
> kernel/sched/features.h | 1 -
> 4 files changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/scheduler/schedutil.rst b/Documentation/scheduler/schedutil.rst
> index 32c7d69fc86c..803fba8fc714 100644
> --- a/Documentation/scheduler/schedutil.rst
> +++ b/Documentation/scheduler/schedutil.rst
> @@ -90,8 +90,8 @@ For more detail see:
> - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
>
>
> -UTIL_EST / UTIL_EST_FASTUP
> -==========================
> +UTIL_EST
> +========
>
> Because periodic tasks have their averages decayed while they sleep, even
> though when running their expected utilization will be the same, they suffer a
> @@ -99,8 +99,7 @@ though when running their expected utilization will be the same, they suffer a
>
> To alleviate this (a default enabled option) UTIL_EST drives an Infinite
> Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
> -highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
> -filter to instantly increase and only decay on decrease.
> +highest. UTIL_EST filters to instantly increase and only decay on decrease.
>
> A further runqueue wide sum (of runnable tasks) is maintained of:
>
> diff --git a/Documentation/translations/zh_CN/scheduler/schedutil.rst b/Documentation/translations/zh_CN/scheduler/schedutil.rst
> index d1ea68007520..7c8d87f21c42 100644
> --- a/Documentation/translations/zh_CN/scheduler/schedutil.rst
> +++ b/Documentation/translations/zh_CN/scheduler/schedutil.rst
> @@ -89,16 +89,15 @@ r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最
> - Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
>
>
> -UTIL_EST / UTIL_EST_FASTUP
> -==========================
> +UTIL_EST
> +========
>
> 由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同,
> 因此它们在再次运行后会面临(DVFS)的上涨。
>
> 为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应
> (Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。
> -另一个默认使能的编译选项UTIL_EST_FASTUP修改了IIR滤波器,使其允许立即增加,
> -仅在利用率下降时衰减。
> +UTIL_EST滤波使其在遇到更高值时立刻增加,而遇到低值时会缓慢衰减。
>
> 进一步,运行队列的(可运行任务的)利用率之和由下式计算:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bcea3d55d95d..e94d65da8d66 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4870,11 +4870,9 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> * to smooth utilization decreases.
> */
> ue.enqueued = task_util(p);
> - if (sched_feat(UTIL_EST_FASTUP)) {
> - if (ue.ewma < ue.enqueued) {
> - ue.ewma = ue.enqueued;
> - goto done;
> - }
> + if (ue.ewma < ue.enqueued) {
> + ue.ewma = ue.enqueued;
> + goto done;
> }
>
> /*
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index a3ddf84de430..143f55df890b 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -83,7 +83,6 @@ SCHED_FEAT(WA_BIAS, true)
> * UtilEstimation. Use estimated CPU utilization.
> */
> SCHED_FEAT(UTIL_EST, true)
> -SCHED_FEAT(UTIL_EST_FASTUP, true)
>
> SCHED_FEAT(LATENCY_WARN, false)
>

2023-12-02 23:46:18

by Qais Yousef

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/fair: Simplify util_est

On 12/01/23 17:16, Vincent Guittot wrote:

> /*
> * The load/runnable/util_avg accumulates an infinite geometric series
> * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
> @@ -505,9 +469,20 @@ struct sched_avg {
> unsigned long load_avg;
> unsigned long runnable_avg;
> unsigned long util_avg;
> - struct util_est util_est;
> + unsigned int util_est;
> } ____cacheline_aligned;

unsigned long would be better?

2023-12-04 09:54:49

by Vincent Guittot

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/fair: Simplify util_est

On Sun, 3 Dec 2023 at 00:38, Qais Yousef <[email protected]> wrote:
>
> On 12/01/23 17:16, Vincent Guittot wrote:
>
> > /*
> > * The load/runnable/util_avg accumulates an infinite geometric series
> > * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
> > @@ -505,9 +469,20 @@ struct sched_avg {
> > unsigned long load_avg;
> > unsigned long runnable_avg;
> > unsigned long util_avg;
> > - struct util_est util_est;
> > + unsigned int util_est;
> > } ____cacheline_aligned;
>
> unsigned long would be better?

I thought about changing it to unsigned long but I prefered to keep
using the same type as before for the ewma as we don't need to extend
it

2023-12-07 03:46:01

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)

Nice cleanup.

Reviewed-by: Alex Shi <[email protected]>

On Sat, Dec 2, 2023 at 12:17 AM Vincent Guittot
<[email protected]> wrote:
>
> sched_feat(UTIL_EST_FASTUP) has been added to easily disable the feature
> in order to check for possibly related regressions. After 3 years, it has
> never been used and no regression has been reported. Let remove it
> and make fast increase a permanent behavior.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Reviewed-and-tested-by: Lukasz Luba <[email protected]>
> Reviewed-by: Dietmar Eggemann <[email protected]>
> Reviewed-by: Hongyan Xia <[email protected]>
> Reviewed-by: Tang Yizhou <[email protected]>
> ---
> Documentation/scheduler/schedutil.rst | 7 +++----
> Documentation/translations/zh_CN/scheduler/schedutil.rst | 7 +++----
> kernel/sched/fair.c | 8 +++-----
> kernel/sched/features.h | 1 -
> 4 files changed, 9 insertions(+), 14 deletions(-)
>
> diff --git a/Documentation/scheduler/schedutil.rst b/Documentation/scheduler/schedutil.rst
> index 32c7d69fc86c..803fba8fc714 100644
> --- a/Documentation/scheduler/schedutil.rst
> +++ b/Documentation/scheduler/schedutil.rst
> @@ -90,8 +90,8 @@ For more detail see:
> - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
>
>
> -UTIL_EST / UTIL_EST_FASTUP
> -==========================
> +UTIL_EST
> +========
>
> Because periodic tasks have their averages decayed while they sleep, even
> though when running their expected utilization will be the same, they suffer a
> @@ -99,8 +99,7 @@ though when running their expected utilization will be the same, they suffer a
>
> To alleviate this (a default enabled option) UTIL_EST drives an Infinite
> Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
> -highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
> -filter to instantly increase and only decay on decrease.
> +highest. UTIL_EST filters to instantly increase and only decay on decrease.
>
> A further runqueue wide sum (of runnable tasks) is maintained of:
>
> diff --git a/Documentation/translations/zh_CN/scheduler/schedutil.rst b/Documentation/translations/zh_CN/scheduler/schedutil.rst
> index d1ea68007520..7c8d87f21c42 100644
> --- a/Documentation/translations/zh_CN/scheduler/schedutil.rst
> +++ b/Documentation/translations/zh_CN/scheduler/schedutil.rst
> @@ -89,16 +89,15 @@ r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最
> - Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
>
>
> -UTIL_EST / UTIL_EST_FASTUP
> -==========================
> +UTIL_EST
> +========
>
> 由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同,
> 因此它们在再次运行后会面临(DVFS)的上涨。
>
> 为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应
> (Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。
> -另一个默认使能的编译选项UTIL_EST_FASTUP修改了IIR滤波器,使其允许立即增加,
> -仅在利用率下降时衰减。
> +UTIL_EST滤波使其在遇到更高值时立刻增加,而遇到低值时会缓慢衰减。
>
> 进一步,运行队列的(可运行任务的)利用率之和由下式计算:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bcea3d55d95d..e94d65da8d66 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4870,11 +4870,9 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> * to smooth utilization decreases.
> */
> ue.enqueued = task_util(p);
> - if (sched_feat(UTIL_EST_FASTUP)) {
> - if (ue.ewma < ue.enqueued) {
> - ue.ewma = ue.enqueued;
> - goto done;
> - }
> + if (ue.ewma < ue.enqueued) {
> + ue.ewma = ue.enqueued;
> + goto done;
> }
>
> /*
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index a3ddf84de430..143f55df890b 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -83,7 +83,6 @@ SCHED_FEAT(WA_BIAS, true)
> * UtilEstimation. Use estimated CPU utilization.
> */
> SCHED_FEAT(UTIL_EST, true)
> -SCHED_FEAT(UTIL_EST_FASTUP, true)
>
> SCHED_FEAT(LATENCY_WARN, false)
>
> --
> 2.34.1
>

2023-12-07 03:47:29

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] sched/fair: Simplify util_est

Looks good to me.

Reviewed-by: Alex Shi <[email protected]>

On Sat, Dec 2, 2023 at 12:17 AM Vincent Guittot
<[email protected]> wrote:
>
> With UTIL_EST_FASTUP now being permanent, we can take advantage of the
> fact that the ewma jumps directly to a higher utilization at dequeue to
> simplify util_est and remove the enqueued field.
>
> Signed-off-by: Vincent Guittot <[email protected]>
> Reviewed-and-tested-by: Lukasz Luba <[email protected]>
> Reviewed-by: Dietmar Eggemann <[email protected]>
> Reviewed-by: Hongyan Xia <[email protected]>
> ---
> include/linux/sched.h | 49 +++++++-------------------
> kernel/sched/debug.c | 7 ++--
> kernel/sched/fair.c | 82 ++++++++++++++++---------------------------
> kernel/sched/pelt.h | 4 +--
> 4 files changed, 48 insertions(+), 94 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8d258162deb0..03bfe9ab2951 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -415,42 +415,6 @@ struct load_weight {
> u32 inv_weight;
> };
>
> -/**
> - * struct util_est - Estimation utilization of FAIR tasks
> - * @enqueued: instantaneous estimated utilization of a task/cpu
> - * @ewma: the Exponential Weighted Moving Average (EWMA)
> - * utilization of a task
> - *
> - * Support data structure to track an Exponential Weighted Moving Average
> - * (EWMA) of a FAIR task's utilization. New samples are added to the moving
> - * average each time a task completes an activation. Sample's weight is chosen
> - * so that the EWMA will be relatively insensitive to transient changes to the
> - * task's workload.
> - *
> - * The enqueued attribute has a slightly different meaning for tasks and cpus:
> - * - task: the task's util_avg at last task dequeue time
> - * - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
> - * Thus, the util_est.enqueued of a task represents the contribution on the
> - * estimated utilization of the CPU where that task is currently enqueued.
> - *
> - * Only for tasks we track a moving average of the past instantaneous
> - * estimated utilization. This allows to absorb sporadic drops in utilization
> - * of an otherwise almost periodic task.
> - *
> - * The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
> - * updates. When a task is dequeued, its util_est should not be updated if its
> - * util_avg has not been updated in the meantime.
> - * This information is mapped into the MSB bit of util_est.enqueued at dequeue
> - * time. Since max value of util_est.enqueued for a task is 1024 (PELT util_avg
> - * for a task) it is safe to use MSB.
> - */
> -struct util_est {
> - unsigned int enqueued;
> - unsigned int ewma;
> -#define UTIL_EST_WEIGHT_SHIFT 2
> -#define UTIL_AVG_UNCHANGED 0x80000000
> -} __attribute__((__aligned__(sizeof(u64))));
> -
> /*
> * The load/runnable/util_avg accumulates an infinite geometric series
> * (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
> @@ -505,9 +469,20 @@ struct sched_avg {
> unsigned long load_avg;
> unsigned long runnable_avg;
> unsigned long util_avg;
> - struct util_est util_est;
> + unsigned int util_est;
> } ____cacheline_aligned;
>
> +/*
> + * The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
> + * updates. When a task is dequeued, its util_est should not be updated if its
> + * util_avg has not been updated in the meantime.
> + * This information is mapped into the MSB bit of util_est at dequeue time.
> + * Since max value of util_est for a task is 1024 (PELT util_avg for a task)
> + * it is safe to use MSB.
> + */
> +#define UTIL_EST_WEIGHT_SHIFT 2
> +#define UTIL_AVG_UNCHANGED 0x80000000
> +
> struct sched_statistics {
> #ifdef CONFIG_SCHEDSTATS
> u64 wait_start;
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 168eecc209b4..8d5d98a5834d 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -684,8 +684,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
> cfs_rq->avg.runnable_avg);
> SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
> cfs_rq->avg.util_avg);
> - SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued",
> - cfs_rq->avg.util_est.enqueued);
> + SEQ_printf(m, " .%-30s: %u\n", "util_est",
> + cfs_rq->avg.util_est);
> SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg",
> cfs_rq->removed.load_avg);
> SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
> @@ -1075,8 +1075,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
> P(se.avg.runnable_avg);
> P(se.avg.util_avg);
> P(se.avg.last_update_time);
> - P(se.avg.util_est.ewma);
> - PM(se.avg.util_est.enqueued, ~UTIL_AVG_UNCHANGED);
> + PM(se.avg.util_est, ~UTIL_AVG_UNCHANGED);
> #endif
> #ifdef CONFIG_UCLAMP_TASK
> __PS("uclamp.min", p->uclamp_req[UCLAMP_MIN].value);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e94d65da8d66..823dd76d0546 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4781,9 +4781,7 @@ static inline unsigned long task_runnable(struct task_struct *p)
>
> static inline unsigned long _task_util_est(struct task_struct *p)
> {
> - struct util_est ue = READ_ONCE(p->se.avg.util_est);
> -
> - return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
> + return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
> }
>
> static inline unsigned long task_util_est(struct task_struct *p)
> @@ -4800,9 +4798,9 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
> return;
>
> /* Update root cfs_rq's estimated utilization */
> - enqueued = cfs_rq->avg.util_est.enqueued;
> + enqueued = cfs_rq->avg.util_est;
> enqueued += _task_util_est(p);
> - WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> + WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
>
> trace_sched_util_est_cfs_tp(cfs_rq);
> }
> @@ -4816,34 +4814,20 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
> return;
>
> /* Update root cfs_rq's estimated utilization */
> - enqueued = cfs_rq->avg.util_est.enqueued;
> + enqueued = cfs_rq->avg.util_est;
> enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
> - WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
> + WRITE_ONCE(cfs_rq->avg.util_est, enqueued);
>
> trace_sched_util_est_cfs_tp(cfs_rq);
> }
>
> #define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)
>
> -/*
> - * Check if a (signed) value is within a specified (unsigned) margin,
> - * based on the observation that:
> - *
> - * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
> - *
> - * NOTE: this only works when value + margin < INT_MAX.
> - */
> -static inline bool within_margin(int value, int margin)
> -{
> - return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
> -}
> -
> static inline void util_est_update(struct cfs_rq *cfs_rq,
> struct task_struct *p,
> bool task_sleep)
> {
> - long last_ewma_diff, last_enqueued_diff;
> - struct util_est ue;
> + unsigned int ewma, dequeued, last_ewma_diff;
>
> if (!sched_feat(UTIL_EST))
> return;
> @@ -4855,23 +4839,25 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> if (!task_sleep)
> return;
>
> + /* Get current estimate of utilization */
> + ewma = READ_ONCE(p->se.avg.util_est);
> +
> /*
> * If the PELT values haven't changed since enqueue time,
> * skip the util_est update.
> */
> - ue = p->se.avg.util_est;
> - if (ue.enqueued & UTIL_AVG_UNCHANGED)
> + if (ewma & UTIL_AVG_UNCHANGED)
> return;
>
> - last_enqueued_diff = ue.enqueued;
> + /* Get utilization at dequeue */
> + dequeued = task_util(p);
>
> /*
> * Reset EWMA on utilization increases, the moving average is used only
> * to smooth utilization decreases.
> */
> - ue.enqueued = task_util(p);
> - if (ue.ewma < ue.enqueued) {
> - ue.ewma = ue.enqueued;
> + if (ewma <= dequeued) {
> + ewma = dequeued;
> goto done;
> }
>
> @@ -4879,27 +4865,22 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> * Skip update of task's estimated utilization when its members are
> * already ~1% close to its last activation value.
> */
> - last_ewma_diff = ue.enqueued - ue.ewma;
> - last_enqueued_diff -= ue.enqueued;
> - if (within_margin(last_ewma_diff, UTIL_EST_MARGIN)) {
> - if (!within_margin(last_enqueued_diff, UTIL_EST_MARGIN))
> - goto done;
> -
> - return;
> - }
> + last_ewma_diff = ewma - dequeued;
> + if (last_ewma_diff < UTIL_EST_MARGIN)
> + goto done;
>
> /*
> * To avoid overestimation of actual task utilization, skip updates if
> * we cannot grant there is idle time in this CPU.
> */
> - if (task_util(p) > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
> + if (dequeued > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
> return;
>
> /*
> * To avoid underestimate of task utilization, skip updates of EWMA if
> * we cannot grant that thread got all CPU time it wanted.
> */
> - if ((ue.enqueued + UTIL_EST_MARGIN) < task_runnable(p))
> + if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> goto done;
>
>
> @@ -4907,25 +4888,24 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> * Update Task's estimated utilization
> *
> * When *p completes an activation we can consolidate another sample
> - * of the task size. This is done by storing the current PELT value
> - * as ue.enqueued and by using this value to update the Exponential
> - * Weighted Moving Average (EWMA):
> + * of the task size. This is done by using this value to update the
> + * Exponential Weighted Moving Average (EWMA):
> *
> * ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
> * = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
> * = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
> - * = w * ( last_ewma_diff ) + ewma(t-1)
> - * = w * (last_ewma_diff + ewma(t-1) / w)
> + * = w * ( -last_ewma_diff ) + ewma(t-1)
> + * = w * (-last_ewma_diff + ewma(t-1) / w)
> *
> * Where 'w' is the weight of new samples, which is configured to be
> * 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
> */
> - ue.ewma <<= UTIL_EST_WEIGHT_SHIFT;
> - ue.ewma += last_ewma_diff;
> - ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;
> + ewma <<= UTIL_EST_WEIGHT_SHIFT;
> + ewma -= last_ewma_diff;
> + ewma >>= UTIL_EST_WEIGHT_SHIFT;
> done:
> - ue.enqueued |= UTIL_AVG_UNCHANGED;
> - WRITE_ONCE(p->se.avg.util_est, ue);
> + ewma |= UTIL_AVG_UNCHANGED;
> + WRITE_ONCE(p->se.avg.util_est, ewma);
>
> trace_sched_util_est_se_tp(&p->se);
> }
> @@ -7653,16 +7633,16 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
> if (sched_feat(UTIL_EST)) {
> unsigned long util_est;
>
> - util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> + util_est = READ_ONCE(cfs_rq->avg.util_est);
>
> /*
> * During wake-up @p isn't enqueued yet and doesn't contribute
> - * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
> + * to any cpu_rq(cpu)->cfs.avg.util_est.
> * If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
> * has been enqueued.
> *
> * During exec (@dst_cpu = -1) @p is enqueued and does
> - * contribute to cpu_rq(cpu)->cfs.util_est.enqueued.
> + * contribute to cpu_rq(cpu)->cfs.util_est.
> * Remove it to "simulate" cpu_util without @p's contribution.
> *
> * Despite the task_on_rq_queued(@p) check there is still a
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index 3a0e0dc28721..9e1083465fbc 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -52,13 +52,13 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
> return;
>
> /* Avoid store if the flag has been already reset */
> - enqueued = avg->util_est.enqueued;
> + enqueued = avg->util_est;
> if (!(enqueued & UTIL_AVG_UNCHANGED))
> return;
>
> /* Reset flag to report util_avg has been updated */
> enqueued &= ~UTIL_AVG_UNCHANGED;
> - WRITE_ONCE(avg->util_est.enqueued, enqueued);
> + WRITE_ONCE(avg->util_est, enqueued);
> }
>
> static inline u64 rq_clock_pelt(struct rq *rq)
> --
> 2.34.1
>

Subject: [tip: sched/core] sched/fair: Simplify util_est

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 11137d384996bb05cf33c8163db271e1bac3f4bf
Gitweb: https://git.kernel.org/tip/11137d384996bb05cf33c8163db271e1bac3f4bf
Author: Vincent Guittot <[email protected]>
AuthorDate: Fri, 01 Dec 2023 17:16:52 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Sat, 23 Dec 2023 15:59:58 +01:00

sched/fair: Simplify util_est

With UTIL_EST_FASTUP now being permanent, we can take advantage of the
fact that the ewma jumps directly to a higher utilization at dequeue to
simplify util_est and remove the enqueued field.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Lukasz Luba <[email protected]>
Reviewed-by: Lukasz Luba <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Reviewed-by: Hongyan Xia <[email protected]>
Reviewed-by: Alex Shi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
include/linux/sched.h | 49 ++++++-------------------
kernel/sched/debug.c | 7 +---
kernel/sched/fair.c | 82 +++++++++++++++---------------------------
kernel/sched/pelt.h | 4 +-
4 files changed, 48 insertions(+), 94 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8d25816..03bfe9a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -415,42 +415,6 @@ struct load_weight {
u32 inv_weight;
};

-/**
- * struct util_est - Estimation utilization of FAIR tasks
- * @enqueued: instantaneous estimated utilization of a task/cpu
- * @ewma: the Exponential Weighted Moving Average (EWMA)
- * utilization of a task
- *
- * Support data structure to track an Exponential Weighted Moving Average
- * (EWMA) of a FAIR task's utilization. New samples are added to the moving
- * average each time a task completes an activation. Sample's weight is chosen
- * so that the EWMA will be relatively insensitive to transient changes to the
- * task's workload.
- *
- * The enqueued attribute has a slightly different meaning for tasks and cpus:
- * - task: the task's util_avg at last task dequeue time
- * - cfs_rq: the sum of util_est.enqueued for each RUNNABLE task on that CPU
- * Thus, the util_est.enqueued of a task represents the contribution on the
- * estimated utilization of the CPU where that task is currently enqueued.
- *
- * Only for tasks we track a moving average of the past instantaneous
- * estimated utilization. This allows to absorb sporadic drops in utilization
- * of an otherwise almost periodic task.
- *
- * The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
- * updates. When a task is dequeued, its util_est should not be updated if its
- * util_avg has not been updated in the meantime.
- * This information is mapped into the MSB bit of util_est.enqueued at dequeue
- * time. Since max value of util_est.enqueued for a task is 1024 (PELT util_avg
- * for a task) it is safe to use MSB.
- */
-struct util_est {
- unsigned int enqueued;
- unsigned int ewma;
-#define UTIL_EST_WEIGHT_SHIFT 2
-#define UTIL_AVG_UNCHANGED 0x80000000
-} __attribute__((__aligned__(sizeof(u64))));
-
/*
* The load/runnable/util_avg accumulates an infinite geometric series
* (see __update_load_avg_cfs_rq() in kernel/sched/pelt.c).
@@ -505,9 +469,20 @@ struct sched_avg {
unsigned long load_avg;
unsigned long runnable_avg;
unsigned long util_avg;
- struct util_est util_est;
+ unsigned int util_est;
} ____cacheline_aligned;

+/*
+ * The UTIL_AVG_UNCHANGED flag is used to synchronize util_est with util_avg
+ * updates. When a task is dequeued, its util_est should not be updated if its
+ * util_avg has not been updated in the meantime.
+ * This information is mapped into the MSB bit of util_est at dequeue time.
+ * Since max value of util_est for a task is 1024 (PELT util_avg for a task)
+ * it is safe to use MSB.
+ */
+#define UTIL_EST_WEIGHT_SHIFT 2
+#define UTIL_AVG_UNCHANGED 0x80000000
+
struct sched_statistics {
#ifdef CONFIG_SCHEDSTATS
u64 wait_start;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 168eecc..8d5d98a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -684,8 +684,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
cfs_rq->avg.runnable_avg);
SEQ_printf(m, " .%-30s: %lu\n", "util_avg",
cfs_rq->avg.util_avg);
- SEQ_printf(m, " .%-30s: %u\n", "util_est_enqueued",
- cfs_rq->avg.util_est.enqueued);
+ SEQ_printf(m, " .%-30s: %u\n", "util_est",
+ cfs_rq->avg.util_est);
SEQ_printf(m, " .%-30s: %ld\n", "removed.load_avg",
cfs_rq->removed.load_avg);
SEQ_printf(m, " .%-30s: %ld\n", "removed.util_avg",
@@ -1075,8 +1075,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P(se.avg.runnable_avg);
P(se.avg.util_avg);
P(se.avg.last_update_time);
- P(se.avg.util_est.ewma);
- PM(se.avg.util_est.enqueued, ~UTIL_AVG_UNCHANGED);
+ PM(se.avg.util_est, ~UTIL_AVG_UNCHANGED);
#endif
#ifdef CONFIG_UCLAMP_TASK
__PS("uclamp.min", p->uclamp_req[UCLAMP_MIN].value);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e94d65d..823dd76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4781,9 +4781,7 @@ static inline unsigned long task_runnable(struct task_struct *p)

static inline unsigned long _task_util_est(struct task_struct *p)
{
- struct util_est ue = READ_ONCE(p->se.avg.util_est);
-
- return max(ue.ewma, (ue.enqueued & ~UTIL_AVG_UNCHANGED));
+ return READ_ONCE(p->se.avg.util_est) & ~UTIL_AVG_UNCHANGED;
}

static inline unsigned long task_util_est(struct task_struct *p)
@@ -4800,9 +4798,9 @@ static inline void util_est_enqueue(struct cfs_rq *cfs_rq,
return;

/* Update root cfs_rq's estimated utilization */
- enqueued = cfs_rq->avg.util_est.enqueued;
+ enqueued = cfs_rq->avg.util_est;
enqueued += _task_util_est(p);
- WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
+ WRITE_ONCE(cfs_rq->avg.util_est, enqueued);

trace_sched_util_est_cfs_tp(cfs_rq);
}
@@ -4816,34 +4814,20 @@ static inline void util_est_dequeue(struct cfs_rq *cfs_rq,
return;

/* Update root cfs_rq's estimated utilization */
- enqueued = cfs_rq->avg.util_est.enqueued;
+ enqueued = cfs_rq->avg.util_est;
enqueued -= min_t(unsigned int, enqueued, _task_util_est(p));
- WRITE_ONCE(cfs_rq->avg.util_est.enqueued, enqueued);
+ WRITE_ONCE(cfs_rq->avg.util_est, enqueued);

trace_sched_util_est_cfs_tp(cfs_rq);
}

#define UTIL_EST_MARGIN (SCHED_CAPACITY_SCALE / 100)

-/*
- * Check if a (signed) value is within a specified (unsigned) margin,
- * based on the observation that:
- *
- * abs(x) < y := (unsigned)(x + y - 1) < (2 * y - 1)
- *
- * NOTE: this only works when value + margin < INT_MAX.
- */
-static inline bool within_margin(int value, int margin)
-{
- return ((unsigned int)(value + margin - 1) < (2 * margin - 1));
-}
-
static inline void util_est_update(struct cfs_rq *cfs_rq,
struct task_struct *p,
bool task_sleep)
{
- long last_ewma_diff, last_enqueued_diff;
- struct util_est ue;
+ unsigned int ewma, dequeued, last_ewma_diff;

if (!sched_feat(UTIL_EST))
return;
@@ -4855,23 +4839,25 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
if (!task_sleep)
return;

+ /* Get current estimate of utilization */
+ ewma = READ_ONCE(p->se.avg.util_est);
+
/*
* If the PELT values haven't changed since enqueue time,
* skip the util_est update.
*/
- ue = p->se.avg.util_est;
- if (ue.enqueued & UTIL_AVG_UNCHANGED)
+ if (ewma & UTIL_AVG_UNCHANGED)
return;

- last_enqueued_diff = ue.enqueued;
+ /* Get utilization at dequeue */
+ dequeued = task_util(p);

/*
* Reset EWMA on utilization increases, the moving average is used only
* to smooth utilization decreases.
*/
- ue.enqueued = task_util(p);
- if (ue.ewma < ue.enqueued) {
- ue.ewma = ue.enqueued;
+ if (ewma <= dequeued) {
+ ewma = dequeued;
goto done;
}

@@ -4879,27 +4865,22 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* Skip update of task's estimated utilization when its members are
* already ~1% close to its last activation value.
*/
- last_ewma_diff = ue.enqueued - ue.ewma;
- last_enqueued_diff -= ue.enqueued;
- if (within_margin(last_ewma_diff, UTIL_EST_MARGIN)) {
- if (!within_margin(last_enqueued_diff, UTIL_EST_MARGIN))
- goto done;
-
- return;
- }
+ last_ewma_diff = ewma - dequeued;
+ if (last_ewma_diff < UTIL_EST_MARGIN)
+ goto done;

/*
* To avoid overestimation of actual task utilization, skip updates if
* we cannot grant there is idle time in this CPU.
*/
- if (task_util(p) > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
+ if (dequeued > arch_scale_cpu_capacity(cpu_of(rq_of(cfs_rq))))
return;

/*
* To avoid underestimate of task utilization, skip updates of EWMA if
* we cannot grant that thread got all CPU time it wanted.
*/
- if ((ue.enqueued + UTIL_EST_MARGIN) < task_runnable(p))
+ if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
goto done;


@@ -4907,25 +4888,24 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* Update Task's estimated utilization
*
* When *p completes an activation we can consolidate another sample
- * of the task size. This is done by storing the current PELT value
- * as ue.enqueued and by using this value to update the Exponential
- * Weighted Moving Average (EWMA):
+ * of the task size. This is done by using this value to update the
+ * Exponential Weighted Moving Average (EWMA):
*
* ewma(t) = w * task_util(p) + (1-w) * ewma(t-1)
* = w * task_util(p) + ewma(t-1) - w * ewma(t-1)
* = w * (task_util(p) - ewma(t-1)) + ewma(t-1)
- * = w * ( last_ewma_diff ) + ewma(t-1)
- * = w * (last_ewma_diff + ewma(t-1) / w)
+ * = w * ( -last_ewma_diff ) + ewma(t-1)
+ * = w * (-last_ewma_diff + ewma(t-1) / w)
*
* Where 'w' is the weight of new samples, which is configured to be
* 0.25, thus making w=1/4 ( >>= UTIL_EST_WEIGHT_SHIFT)
*/
- ue.ewma <<= UTIL_EST_WEIGHT_SHIFT;
- ue.ewma += last_ewma_diff;
- ue.ewma >>= UTIL_EST_WEIGHT_SHIFT;
+ ewma <<= UTIL_EST_WEIGHT_SHIFT;
+ ewma -= last_ewma_diff;
+ ewma >>= UTIL_EST_WEIGHT_SHIFT;
done:
- ue.enqueued |= UTIL_AVG_UNCHANGED;
- WRITE_ONCE(p->se.avg.util_est, ue);
+ ewma |= UTIL_AVG_UNCHANGED;
+ WRITE_ONCE(p->se.avg.util_est, ewma);

trace_sched_util_est_se_tp(&p->se);
}
@@ -7653,16 +7633,16 @@ cpu_util(int cpu, struct task_struct *p, int dst_cpu, int boost)
if (sched_feat(UTIL_EST)) {
unsigned long util_est;

- util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+ util_est = READ_ONCE(cfs_rq->avg.util_est);

/*
* During wake-up @p isn't enqueued yet and doesn't contribute
- * to any cpu_rq(cpu)->cfs.avg.util_est.enqueued.
+ * to any cpu_rq(cpu)->cfs.avg.util_est.
* If @dst_cpu == @cpu add it to "simulate" cpu_util after @p
* has been enqueued.
*
* During exec (@dst_cpu = -1) @p is enqueued and does
- * contribute to cpu_rq(cpu)->cfs.util_est.enqueued.
+ * contribute to cpu_rq(cpu)->cfs.util_est.
* Remove it to "simulate" cpu_util without @p's contribution.
*
* Despite the task_on_rq_queued(@p) check there is still a
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 3a0e0dc..9e10834 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -52,13 +52,13 @@ static inline void cfs_se_util_change(struct sched_avg *avg)
return;

/* Avoid store if the flag has been already reset */
- enqueued = avg->util_est.enqueued;
+ enqueued = avg->util_est;
if (!(enqueued & UTIL_AVG_UNCHANGED))
return;

/* Reset flag to report util_avg has been updated */
enqueued &= ~UTIL_AVG_UNCHANGED;
- WRITE_ONCE(avg->util_est.enqueued, enqueued);
+ WRITE_ONCE(avg->util_est, enqueued);
}

static inline u64 rq_clock_pelt(struct rq *rq)

Subject: [tip: sched/core] sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 7736ae5572eb344c090fbef9621a228e7e3d6276
Gitweb: https://git.kernel.org/tip/7736ae5572eb344c090fbef9621a228e7e3d6276
Author: Vincent Guittot <[email protected]>
AuthorDate: Fri, 01 Dec 2023 17:16:51 +01:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Sat, 23 Dec 2023 15:59:56 +01:00

sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)

sched_feat(UTIL_EST_FASTUP) has been added to easily disable the feature
in order to check for possibly related regressions. After 3 years, it has
never been used and no regression has been reported. Let's remove it
and make fast increase a permanent behavior.

Signed-off-by: Vincent Guittot <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Tested-by: Lukasz Luba <[email protected]>
Reviewed-by: Lukasz Luba <[email protected]>
Reviewed-by: Dietmar Eggemann <[email protected]>
Reviewed-by: Hongyan Xia <[email protected]>
Reviewed-by: Tang Yizhou <[email protected]>
Reviewed-by: Yanteng Si <[email protected]> [for the Chinese translation]
Reviewed-by: Alex Shi <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
Documentation/scheduler/schedutil.rst | 7 +++----
Documentation/translations/zh_CN/scheduler/schedutil.rst | 7 +++----
kernel/sched/fair.c | 8 +++-----
kernel/sched/features.h | 1 +-
4 files changed, 9 insertions(+), 14 deletions(-)

diff --git a/Documentation/scheduler/schedutil.rst b/Documentation/scheduler/schedutil.rst
index 32c7d69..803fba8 100644
--- a/Documentation/scheduler/schedutil.rst
+++ b/Documentation/scheduler/schedutil.rst
@@ -90,8 +90,8 @@ For more detail see:
- Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


-UTIL_EST / UTIL_EST_FASTUP
-==========================
+UTIL_EST
+========

Because periodic tasks have their averages decayed while they sleep, even
though when running their expected utilization will be the same, they suffer a
@@ -99,8 +99,7 @@ though when running their expected utilization will be the same, they suffer a

To alleviate this (a default enabled option) UTIL_EST drives an Infinite
Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
-highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
-filter to instantly increase and only decay on decrease.
+highest. UTIL_EST filters to instantly increase and only decay on decrease.

A further runqueue wide sum (of runnable tasks) is maintained of:

diff --git a/Documentation/translations/zh_CN/scheduler/schedutil.rst b/Documentation/translations/zh_CN/scheduler/schedutil.rst
index d1ea680..7c8d87f 100644
--- a/Documentation/translations/zh_CN/scheduler/schedutil.rst
+++ b/Documentation/translations/zh_CN/scheduler/schedutil.rst
@@ -89,16 +89,15 @@ r_cpu被定义为当前CPU的最高性能水平与系统中任何其它CPU的最
- Documentation/translations/zh_CN/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"


-UTIL_EST / UTIL_EST_FASTUP
-==========================
+UTIL_EST
+========

由于周期性任务的平均数在睡眠时会衰减,而在运行时其预期利用率会和睡眠前相同,
因此它们在再次运行后会面临(DVFS)的上涨。

为了缓解这个问题,(一个默认使能的编译选项)UTIL_EST驱动一个无限脉冲响应
(Infinite Impulse Response,IIR)的EWMA,“运行”值在出队时是最高的。
-另一个默认使能的编译选项UTIL_EST_FASTUP修改了IIR滤波器,使其允许立即增加,
-仅在利用率下降时衰减。
+UTIL_EST滤波使其在遇到更高值时立刻增加,而遇到低值时会缓慢衰减。

进一步,运行队列的(可运行任务的)利用率之和由下式计算:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcea3d5..e94d65d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4870,11 +4870,9 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* to smooth utilization decreases.
*/
ue.enqueued = task_util(p);
- if (sched_feat(UTIL_EST_FASTUP)) {
- if (ue.ewma < ue.enqueued) {
- ue.ewma = ue.enqueued;
- goto done;
- }
+ if (ue.ewma < ue.enqueued) {
+ ue.ewma = ue.enqueued;
+ goto done;
}

/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a3ddf84..143f55d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,7 +83,6 @@ SCHED_FEAT(WA_BIAS, true)
* UtilEstimation. Use estimated CPU utilization.
*/
SCHED_FEAT(UTIL_EST, true)
-SCHED_FEAT(UTIL_EST_FASTUP, true)

SCHED_FEAT(LATENCY_WARN, false)