This is a follow up of:
https://lkml.org/lkml/2018/4/6/935
where the original patch has been split into three to better address the
different issues discussed in the previous posting.
The first two patches of this series are fixes for:
- FAIR utilization aggregation
based on blocked utilization decay instead of cfs.h_nr_running
- Estimated utilization updates
which needs to be updated before schedutil is at enqueue time
The last patch provides the remaining bits of the original one in a self
contained re-factoring of how we update schedutil for FAIR tasks.
Cheers Patrick
Patrick Bellasi (3):
sched/cpufreq: always consider blocked FAIR utilization
sched/fair: util_est: update before schedutil
sched/fair: schedutil: explicit update only when required
kernel/sched/cpufreq_schedutil.c | 17 ++++----
kernel/sched/fair.c | 83 ++++++++++++++++++----------------------
2 files changed, 46 insertions(+), 54 deletions(-)
--
2.15.1
When a task is enqueue the estimated utilization of a CPU is updated
to better support the selection of the required frequency.
However, schedutil is (implicitly) updated by update_load_avg() which
always happens before util_est_{en,de}queue(), thus potentially
introducing a latency between estimated utilization updates and
frequency selections.
Let's update util_est at the beginning of enqueue_task_fair(),
which will ensure that all schedutil updates will see the most
updated estimated utilization value for a CPU.
Reported-by: Vincent Guittot <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/fair.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1f6a23a5b451..01dfc47541e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5356,6 +5356,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
+ /* Estimated utilization must be updated before schedutil */
+ util_est_enqueue(&rq->cfs, p);
+
/*
* If in_iowait is set, the code below may not trigger any cpufreq
* utilization updates, so do it here explicitly with the IOWAIT flag
@@ -5397,7 +5400,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!se)
add_nr_running(rq, 1);
- util_est_enqueue(&rq->cfs, p);
hrtick_update(rq);
}
--
2.15.1
Schedutil updates for FAIR tasks are triggered implicitly each time a
cfs_rq's utilization is updated via cfs_rq_util_change(), currently
called by update_cfs_rq_load_avg(), when the utilization of a cfs_rq has
changed, and {attach,detach}_entity_load_avg().
This design is based on the idea that "we should callback schedutil
frequently enough" to properly update the CPU frequency at every
utilization change. However, such an integration strategy has also
some downsides:
- schedutil updates are triggered by RQ's load updates, which makes
sense in general but it does not allow to know exactly which other RQ
related information have been updated.
Recently, for example, we had issues due to schedutil dependencies on
cfs_rq->h_nr_running and estimated utilization updates.
- cfs_rq_util_change() is mainly a wrapper function for an already
existing "public API", cpufreq_update_util(), which is required
just to ensure we actually update schedutil only when we are updating
a root cfs_rq.
Thus, especially when task groups are in use, most of the calls to
this wrapper function are not required.
- the usage of a wrapper function is not completely consistent across
fair.c, since we could still need additional explicit calls to
cpufreq_update_util().
For example this already happens to report the IOWAIT boot flag in
the wakeup path.
- it makes it hard to integrate new features since it could require to
change other function prototypes just to pass in an additional flag,
as it happened for example in commit:
ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
All the above considered, let's make schedutil updates more explicit in
fair.c by removing the cfs_rq_util_change() wrapper function in favour
of the existing cpufreq_update_util() public API.
This can be done by calling cpufreq_update_util() explicitly in the few
call sites where it really makes sense and when all the (potentially)
required cfs_rq's information have been updated.
This patch mainly removes code and adds explicit schedutil updates
only when we:
- {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
- (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
- task_tick_fair() to update the utilization of the root cfs_rq
All the other code paths, currently _indirectly_ covered by a call to
update_load_avg(), are still covered. Indeed, some paths already imply
enqueue/dequeue calls:
- switch_{to,from}_fair()
- sched_move_task()
while others are followed by enqueue/dequeue calls:
- cpu_cgroup_fork() and
post_init_entity_util_avg():
are used at wakeup_new_task() time and thus already followed by an
enqueue_task_fair()
- migrate_task_rq_fair():
updates the removed utilization but not the actual cfs_rq
utilization, which is updated by a following sched event
This new proposal allows also to better aggregate schedutil related
flags, which are required only at enqueue_task_fair() time.
IOWAIT and MIGRATION flags are now requested only when a task is
actually visible at the root cfs_rq level.
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: Juri Lelli <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
NOTE: this patch changes the behavior of the IOWAIT flag: in case of a
task waking up on a throttled RQ we do not assert the flag to schedutil
anymore. However, this seems to make sense since the task will not be
running anyway.
---
kernel/sched/fair.c | 81 ++++++++++++++++++++++++-----------------------------
1 file changed, 36 insertions(+), 45 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01dfc47541e6..87f092151a6e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -772,7 +772,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
* For !fair tasks do:
*
update_cfs_rq_load_avg(now, cfs_rq);
- attach_entity_load_avg(cfs_rq, se, 0);
+ attach_entity_load_avg(cfs_rq, se);
switched_from_fair(rq, p);
*
* such that the next switched_to_fair() has the
@@ -3009,29 +3009,6 @@ static inline void update_cfs_group(struct sched_entity *se)
}
#endif /* CONFIG_FAIR_GROUP_SCHED */
-static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
-{
- struct rq *rq = rq_of(cfs_rq);
-
- if (&rq->cfs == cfs_rq || (flags & SCHED_CPUFREQ_MIGRATION)) {
- /*
- * There are a few boundary cases this might miss but it should
- * get called often enough that that should (hopefully) not be
- * a real problem.
- *
- * It will not get called when we go idle, because the idle
- * thread is a different class (!fair), nor will the utilization
- * number include things like RT tasks.
- *
- * As is, the util number is not freq-invariant (we'd have to
- * implement arch_scale_freq_capacity() for that).
- *
- * See cpu_util().
- */
- cpufreq_update_util(rq, flags);
- }
-}
-
#ifdef CONFIG_SMP
/*
* Approximate:
@@ -3712,9 +3689,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
cfs_rq->load_last_update_time_copy = sa->last_update_time;
#endif
- if (decayed)
- cfs_rq_util_change(cfs_rq, 0);
-
return decayed;
}
@@ -3726,7 +3700,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
* Must call update_cfs_rq_load_avg() before this, since we rely on
* cfs_rq->avg.last_update_time being current.
*/
-static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
@@ -3762,7 +3736,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
- cfs_rq_util_change(cfs_rq, flags);
}
/**
@@ -3781,7 +3754,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
- cfs_rq_util_change(cfs_rq, 0);
}
/*
@@ -3818,7 +3790,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
*
* IOW we're enqueueing a task on a new CPU.
*/
- attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
+ attach_entity_load_avg(cfs_rq, se);
update_tg_load_avg(cfs_rq, 0);
} else if (decayed && (flags & UPDATE_TG))
@@ -4028,13 +4000,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
{
- cfs_rq_util_change(cfs_rq, 0);
}
static inline void remove_entity_load_avg(struct sched_entity *se) {}
static inline void
-attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) {}
+attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
static inline void
detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
@@ -4762,8 +4733,11 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
dequeue = 0;
}
- if (!se)
+ /* The tasks are no more visible from the root cfs_rq */
+ if (!se) {
sub_nr_running(rq, task_delta);
+ cpufreq_update_util(rq, 0);
+ }
cfs_rq->throttled = 1;
cfs_rq->throttled_clock = rq_clock(rq);
@@ -4825,8 +4799,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
break;
}
- if (!se)
+ /* The tasks are now visible from the root cfs_rq */
+ if (!se) {
add_nr_running(rq, task_delta);
+ cpufreq_update_util(rq, 0);
+ }
/* Determine whether we need to wake up potentially idle CPU: */
if (rq->curr == rq->idle && rq->cfs.nr_running)
@@ -5359,14 +5336,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
/* Estimated utilization must be updated before schedutil */
util_est_enqueue(&rq->cfs, p);
- /*
- * If in_iowait is set, the code below may not trigger any cpufreq
- * utilization updates, so do it here explicitly with the IOWAIT flag
- * passed.
- */
- if (p->in_iowait)
- cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
-
for_each_sched_entity(se) {
if (se->on_rq)
break;
@@ -5397,9 +5366,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
}
- if (!se)
+ /* The task is visible from the root cfs_rq */
+ if (!se) {
+ unsigned int flags = 0;
+
add_nr_running(rq, 1);
+ if (p->in_iowait)
+ flags |= SCHED_CPUFREQ_IOWAIT;
+
+ /*
+ * !last_update_time means we've passed through
+ * migrate_task_rq_fair() indicating we migrated.
+ *
+ * IOW we're enqueueing a task on a new CPU.
+ */
+ if (!p->se.avg.last_update_time)
+ flags |= SCHED_CPUFREQ_MIGRATION;
+
+ cpufreq_update_util(rq, flags);
+ }
+
hrtick_update(rq);
}
@@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
update_cfs_group(se);
}
+ /* The task is no more visible from the root cfs_rq */
if (!se)
sub_nr_running(rq, 1);
util_est_dequeue(&rq->cfs, p, task_sleep);
+ cpufreq_update_util(rq, 0);
hrtick_update(rq);
}
@@ -9945,6 +9934,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
+
+ cpufreq_update_util(rq, 0);
}
/*
@@ -10082,7 +10073,7 @@ static void attach_entity_cfs_rq(struct sched_entity *se)
/* Synchronize entity with its cfs_rq */
update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
- attach_entity_load_avg(cfs_rq, se, 0);
+ attach_entity_load_avg(cfs_rq, se);
update_tg_load_avg(cfs_rq, false);
propagate_entity_cfs_rq(se);
}
--
2.15.1
Since the refactoring introduced by:
commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
we aggregate FAIR utilization only if this class has runnable tasks.
This was mainly due to avoid the risk to stay on an high frequency just
because of the blocked utilization of a CPU not being properly decayed
while the CPU was idle.
However, since:
commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
the FAIR blocked utilization is properly decayed also for IDLE CPUs.
This allows us to use the FAIR blocked utilization as a safe mechanism
to gracefully reduce the frequency only if no FAIR tasks show up on a
CPU for a reasonable period of time.
Moreover, we also reduce the frequency drops of CPUs running periodic
tasks which, depending on the task periodicity and the time required
for a frequency switch, was increasing the chances to introduce some
undesirable performance variations.
Reported-by: Vincent Guittot <[email protected]>
Signed-off-by: Patrick Bellasi <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: Vincent Guittot <[email protected]>
Cc: Viresh Kumar <[email protected]>
Cc: Joel Fernandes <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
kernel/sched/cpufreq_schedutil.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index d2c6083304b4..a74d05160e66 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -183,22 +183,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
{
struct rq *rq = cpu_rq(sg_cpu->cpu);
- unsigned long util;
- if (rq->rt.rt_nr_running) {
- util = sg_cpu->max;
- } else {
- util = sg_cpu->util_dl;
- if (rq->cfs.h_nr_running)
- util += sg_cpu->util_cfs;
- }
+ if (rq->rt.rt_nr_running)
+ return sg_cpu->max;
/*
+ * Utilization required by DEADLINE must always be granted while, for
+ * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
+ * gracefully reduce the frequency when no tasks show up for longer
+ * periods of time.
+ *
* Ideally we would like to set util_dl as min/guaranteed freq and
* util_cfs + util_dl as requested freq. However, cpufreq is not yet
* ready for such an interface. So, we only do the latter for now.
*/
- return min(util, sg_cpu->max);
+ return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
}
static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
--
2.15.1
On Thu, May 10, 2018 at 04:05:52PM +0100, Patrick Bellasi wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1f6a23a5b451..01dfc47541e6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5356,6 +5356,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
>
> + /* Estimated utilization must be updated before schedutil */
> + util_est_enqueue(&rq->cfs, p);
Please clarify the comment with a reason; because I'm sure that the next
time I see that comment, I'll go 'but why again!?'.
> /*
> * If in_iowait is set, the code below may not trigger any cpufreq
> * utilization updates, so do it here explicitly with the IOWAIT flag
On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
> All the above considered, let's make schedutil updates more explicit in
> fair.c by removing the cfs_rq_util_change() wrapper function in favour
> of the existing cpufreq_update_util() public API.
> This can be done by calling cpufreq_update_util() explicitly in the few
> call sites where it really makes sense and when all the (potentially)
> required cfs_rq's information have been updated.
Aside from having to redraw my ever stale diagrams _again_ I don't think
I object too much here. As you write tracking the exact point where we
did do the update was fairly tedious.
> @@ -5397,9 +5366,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> update_cfs_group(se);
> }
>
> + /* The task is visible from the root cfs_rq */
> + if (!se) {
> + unsigned int flags = 0;
That one shadows the @flags argument. Some checker is bound to complain
about it.
> +
> add_nr_running(rq, 1);
>
> + if (p->in_iowait)
> + flags |= SCHED_CPUFREQ_IOWAIT;
> +
> + /*
> + * !last_update_time means we've passed through
> + * migrate_task_rq_fair() indicating we migrated.
> + *
> + * IOW we're enqueueing a task on a new CPU.
> + */
> + if (!p->se.avg.last_update_time)
> + flags |= SCHED_CPUFREQ_MIGRATION;
> +
> + cpufreq_update_util(rq, flags);
> + }
> +
> hrtick_update(rq);
> }
On 10-May 18:15, Peter Zijlstra wrote:
> On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
> > All the above considered, let's make schedutil updates more explicit in
> > fair.c by removing the cfs_rq_util_change() wrapper function in favour
> > of the existing cpufreq_update_util() public API.
> > This can be done by calling cpufreq_update_util() explicitly in the few
> > call sites where it really makes sense and when all the (potentially)
> > required cfs_rq's information have been updated.
>
> Aside from having to redraw my ever stale diagrams _again_ I don't think
> I object too much here. As you write tracking the exact point where we
> did do the update was fairly tedious.
Maybe we can simplify even better these diagrams if we get to the
point where schedutil updates are tirggered just from core.c...
This patch could be a first step on this directions, not entirely sure
we can get there... but, having some well defined callbacks maybe it
can make it easier to factorize up the updates in the core scheduler.
> > @@ -5397,9 +5366,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > update_cfs_group(se);
> > }
> >
> > + /* The task is visible from the root cfs_rq */
> > + if (!se) {
> > + unsigned int flags = 0;
>
> That one shadows the @flags argument. Some checker is bound to complain
> about it.
Ohps... right... I'll s/flags/sc_flags/ "sc_" standing for SCHED_CPUFREQ_.
> > +
> > add_nr_running(rq, 1);
> >
> > + if (p->in_iowait)
> > + flags |= SCHED_CPUFREQ_IOWAIT;
> > +
> > + /*
> > + * !last_update_time means we've passed through
> > + * migrate_task_rq_fair() indicating we migrated.
> > + *
> > + * IOW we're enqueueing a task on a new CPU.
> > + */
> > + if (!p->se.avg.last_update_time)
> > + flags |= SCHED_CPUFREQ_MIGRATION;
> > +
> > + cpufreq_update_util(rq, flags);
> > + }
> > +
> > hrtick_update(rq);
> > }
--
#include <best/regards.h>
Patrick Bellasi
On 10-05-18, 16:05, Patrick Bellasi wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> -static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
>
> @@ -3762,7 +3736,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
>
> - cfs_rq_util_change(cfs_rq, flags);
Was leaving a blank line at the end of routine intentional ?
> }
>
> /**
> @@ -3781,7 +3754,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
>
> - cfs_rq_util_change(cfs_rq, 0);
Here too.
> }
--
viresh
On 10-05-18, 16:05, Patrick Bellasi wrote:
> When a task is enqueue the estimated utilization of a CPU is updated
> to better support the selection of the required frequency.
> However, schedutil is (implicitly) updated by update_load_avg() which
> always happens before util_est_{en,de}queue(), thus potentially
> introducing a latency between estimated utilization updates and
> frequency selections.
>
> Let's update util_est at the beginning of enqueue_task_fair(),
> which will ensure that all schedutil updates will see the most
> updated estimated utilization value for a CPU.
>
> Reported-by: Vincent Guittot <[email protected]>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> kernel/sched/fair.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
Fixes and Stable ?
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1f6a23a5b451..01dfc47541e6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5356,6 +5356,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
>
> + /* Estimated utilization must be updated before schedutil */
> + util_est_enqueue(&rq->cfs, p);
> +
> /*
> * If in_iowait is set, the code below may not trigger any cpufreq
> * utilization updates, so do it here explicitly with the IOWAIT flag
> @@ -5397,7 +5400,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (!se)
> add_nr_running(rq, 1);
>
> - util_est_enqueue(&rq->cfs, p);
> hrtick_update(rq);
> }
Acked-by: Viresh Kumar <[email protected]>
--
viresh
On 10-05-18, 16:05, Patrick Bellasi wrote:
> Since the refactoring introduced by:
>
> commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
>
> we aggregate FAIR utilization only if this class has runnable tasks.
> This was mainly due to avoid the risk to stay on an high frequency just
> because of the blocked utilization of a CPU not being properly decayed
> while the CPU was idle.
>
> However, since:
>
> commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
>
> the FAIR blocked utilization is properly decayed also for IDLE CPUs.
>
> This allows us to use the FAIR blocked utilization as a safe mechanism
> to gracefully reduce the frequency only if no FAIR tasks show up on a
> CPU for a reasonable period of time.
>
> Moreover, we also reduce the frequency drops of CPUs running periodic
> tasks which, depending on the task periodicity and the time required
> for a frequency switch, was increasing the chances to introduce some
> undesirable performance variations.
>
> Reported-by: Vincent Guittot <[email protected]>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Vincent Guittot <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> ---
> kernel/sched/cpufreq_schedutil.c | 17 ++++++++---------
> 1 file changed, 8 insertions(+), 9 deletions(-)
Do we need a Fixes tag and Cc stable ?
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index d2c6083304b4..a74d05160e66 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -183,22 +183,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> {
> struct rq *rq = cpu_rq(sg_cpu->cpu);
> - unsigned long util;
>
> - if (rq->rt.rt_nr_running) {
> - util = sg_cpu->max;
> - } else {
> - util = sg_cpu->util_dl;
> - if (rq->cfs.h_nr_running)
> - util += sg_cpu->util_cfs;
> - }
> + if (rq->rt.rt_nr_running)
> + return sg_cpu->max;
>
> /*
> + * Utilization required by DEADLINE must always be granted while, for
> + * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
> + * gracefully reduce the frequency when no tasks show up for longer
> + * periods of time.
> + *
> * Ideally we would like to set util_dl as min/guaranteed freq and
> * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> * ready for such an interface. So, we only do the latter for now.
> */
> - return min(util, sg_cpu->max);
> + return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> }
>
> static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
Acked-by: Viresh Kumar <[email protected]>
--
viresh
On 11-May 11:14, Viresh Kumar wrote:
> On 10-05-18, 16:05, Patrick Bellasi wrote:
> > When a task is enqueue the estimated utilization of a CPU is updated
> > to better support the selection of the required frequency.
> > However, schedutil is (implicitly) updated by update_load_avg() which
> > always happens before util_est_{en,de}queue(), thus potentially
> > introducing a latency between estimated utilization updates and
> > frequency selections.
> >
> > Let's update util_est at the beginning of enqueue_task_fair(),
> > which will ensure that all schedutil updates will see the most
> > updated estimated utilization value for a CPU.
> >
> > Reported-by: Vincent Guittot <[email protected]>
> > Signed-off-by: Patrick Bellasi <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > ---
> > kernel/sched/fair.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
>
> Fixes and Stable ?
Good idea, will add them.
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 1f6a23a5b451..01dfc47541e6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5356,6 +5356,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > struct cfs_rq *cfs_rq;
> > struct sched_entity *se = &p->se;
> >
> > + /* Estimated utilization must be updated before schedutil */
> > + util_est_enqueue(&rq->cfs, p);
> > +
> > /*
> > * If in_iowait is set, the code below may not trigger any cpufreq
> > * utilization updates, so do it here explicitly with the IOWAIT flag
> > @@ -5397,7 +5400,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > if (!se)
> > add_nr_running(rq, 1);
> >
> > - util_est_enqueue(&rq->cfs, p);
> > hrtick_update(rq);
> > }
>
> Acked-by: Viresh Kumar <[email protected]>
>
> --
> viresh
--
#include <best/regards.h>
Patrick Bellasi
On 11-May 11:13, Viresh Kumar wrote:
> On 10-05-18, 16:05, Patrick Bellasi wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > -static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > {
> > u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
> >
> > @@ -3762,7 +3736,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
> >
> > - cfs_rq_util_change(cfs_rq, flags);
>
> Was leaving a blank line at the end of routine intentional ?
No, good catch. Will fix this, thanks!
>
> > }
> >
> > /**
> > @@ -3781,7 +3754,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
> >
> > - cfs_rq_util_change(cfs_rq, 0);
>
> Here too.
>
> > }
> --
> viresh
--
#include <best/regards.h>
Patrick Bellasi
On 11-May 11:14, Viresh Kumar wrote:
> On 10-05-18, 16:05, Patrick Bellasi wrote:
> > Since the refactoring introduced by:
> >
> > commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")
> >
> > we aggregate FAIR utilization only if this class has runnable tasks.
> > This was mainly due to avoid the risk to stay on an high frequency just
> > because of the blocked utilization of a CPU not being properly decayed
> > while the CPU was idle.
> >
> > However, since:
> >
> > commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
> >
> > the FAIR blocked utilization is properly decayed also for IDLE CPUs.
> >
> > This allows us to use the FAIR blocked utilization as a safe mechanism
> > to gracefully reduce the frequency only if no FAIR tasks show up on a
> > CPU for a reasonable period of time.
> >
> > Moreover, we also reduce the frequency drops of CPUs running periodic
> > tasks which, depending on the task periodicity and the time required
> > for a frequency switch, was increasing the chances to introduce some
> > undesirable performance variations.
> >
> > Reported-by: Vincent Guittot <[email protected]>
> > Signed-off-by: Patrick Bellasi <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Vincent Guittot <[email protected]>
> > Cc: Viresh Kumar <[email protected]>
> > Cc: Joel Fernandes <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > ---
> > kernel/sched/cpufreq_schedutil.c | 17 ++++++++---------
> > 1 file changed, 8 insertions(+), 9 deletions(-)
>
> Do we need a Fixes tag and Cc stable ?
Mmm... no sure, I would say that's not a fix.
As I say in the changelog above, 8f111bc357aa was doing the correct
thing but, since the recent Vincent's commit 31e77c93e432, this is an
update worth to have, since now we can trust the decay of blocked
utilization.
Regarding stable, well... if Vincent patches are not going to be
considered for stable, then we should not consider this too, do we?
> >
> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > index d2c6083304b4..a74d05160e66 100644
> > --- a/kernel/sched/cpufreq_schedutil.c
> > +++ b/kernel/sched/cpufreq_schedutil.c
> > @@ -183,22 +183,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
> > {
> > struct rq *rq = cpu_rq(sg_cpu->cpu);
> > - unsigned long util;
> >
> > - if (rq->rt.rt_nr_running) {
> > - util = sg_cpu->max;
> > - } else {
> > - util = sg_cpu->util_dl;
> > - if (rq->cfs.h_nr_running)
> > - util += sg_cpu->util_cfs;
> > - }
> > + if (rq->rt.rt_nr_running)
> > + return sg_cpu->max;
> >
> > /*
> > + * Utilization required by DEADLINE must always be granted while, for
> > + * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
> > + * gracefully reduce the frequency when no tasks show up for longer
> > + * periods of time.
> > + *
> > * Ideally we would like to set util_dl as min/guaranteed freq and
> > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
> > * ready for such an interface. So, we only do the latter for now.
> > */
> > - return min(util, sg_cpu->max);
> > + return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
> > }
> >
> > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
>
> Acked-by: Viresh Kumar <[email protected]>
>
> --
> viresh
--
#include <best/regards.h>
Patrick Bellasi
On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
> Schedutil updates for FAIR tasks are triggered implicitly each time a
> cfs_rq's utilization is updated via cfs_rq_util_change(), currently
> called by update_cfs_rq_load_avg(), when the utilization of a cfs_rq has
> changed, and {attach,detach}_entity_load_avg().
>
> This design is based on the idea that "we should callback schedutil
> frequently enough" to properly update the CPU frequency at every
> utilization change. However, such an integration strategy has also
> some downsides:
Hi Patrick,
I agree making the call explicit would make schedutil integration easier so
that's really awesome. However I also fear that if some path in the fair
class in the future changes the utilization but forgets to update schedutil
explicitly (because they forgot to call the explicit public API) then the
schedutil update wouldn't go through. In this case the previous design of
doing the schedutil update in the wrapper kind of was a nice to have
Just thinking out loud but is there a way you could make the implicit call
anyway incase the explicit call wasn't requested for some reason? That's
probably hard to do correctly though..
Some more comments below:
>
> - schedutil updates are triggered by RQ's load updates, which makes
> sense in general but it does not allow to know exactly which other RQ
> related information have been updated.
> Recently, for example, we had issues due to schedutil dependencies on
> cfs_rq->h_nr_running and estimated utilization updates.
>
> - cfs_rq_util_change() is mainly a wrapper function for an already
> existing "public API", cpufreq_update_util(), which is required
> just to ensure we actually update schedutil only when we are updating
> a root cfs_rq.
> Thus, especially when task groups are in use, most of the calls to
> this wrapper function are not required.
>
> - the usage of a wrapper function is not completely consistent across
> fair.c, since we could still need additional explicit calls to
> cpufreq_update_util().
> For example this already happens to report the IOWAIT boot flag in
> the wakeup path.
>
> - it makes it hard to integrate new features since it could require to
> change other function prototypes just to pass in an additional flag,
> as it happened for example in commit:
>
> ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
>
> All the above considered, let's make schedutil updates more explicit in
> fair.c by removing the cfs_rq_util_change() wrapper function in favour
> of the existing cpufreq_update_util() public API.
> This can be done by calling cpufreq_update_util() explicitly in the few
> call sites where it really makes sense and when all the (potentially)
> required cfs_rq's information have been updated.
>
> This patch mainly removes code and adds explicit schedutil updates
> only when we:
> - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
> - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
> - task_tick_fair() to update the utilization of the root cfs_rq
>
> All the other code paths, currently _indirectly_ covered by a call to
> update_load_avg(), are still covered. Indeed, some paths already imply
> enqueue/dequeue calls:
> - switch_{to,from}_fair()
> - sched_move_task()
> while others are followed by enqueue/dequeue calls:
> - cpu_cgroup_fork() and
> post_init_entity_util_avg():
> are used at wakeup_new_task() time and thus already followed by an
> enqueue_task_fair()
> - migrate_task_rq_fair():
> updates the removed utilization but not the actual cfs_rq
> utilization, which is updated by a following sched event
>
> This new proposal allows also to better aggregate schedutil related
> flags, which are required only at enqueue_task_fair() time.
> IOWAIT and MIGRATION flags are now requested only when a task is
> actually visible at the root cfs_rq level.
>
> Signed-off-by: Patrick Bellasi <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: Viresh Kumar <[email protected]>
> Cc: Joel Fernandes <[email protected]>
> Cc: Juri Lelli <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
>
> ---
>
> NOTE: this patch changes the behavior of the IOWAIT flag: in case of a
> task waking up on a throttled RQ we do not assert the flag to schedutil
> anymore. However, this seems to make sense since the task will not be
> running anyway.
> ---
> kernel/sched/fair.c | 81 ++++++++++++++++++++++++-----------------------------
> 1 file changed, 36 insertions(+), 45 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01dfc47541e6..87f092151a6e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -772,7 +772,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
> * For !fair tasks do:
> *
> update_cfs_rq_load_avg(now, cfs_rq);
> - attach_entity_load_avg(cfs_rq, se, 0);
> + attach_entity_load_avg(cfs_rq, se);
> switched_from_fair(rq, p);
> *
> * such that the next switched_to_fair() has the
> @@ -3009,29 +3009,6 @@ static inline void update_cfs_group(struct sched_entity *se)
> }
> #endif /* CONFIG_FAIR_GROUP_SCHED */
>
> -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> -{
> - struct rq *rq = rq_of(cfs_rq);
> -
> - if (&rq->cfs == cfs_rq || (flags & SCHED_CPUFREQ_MIGRATION)) {
> - /*
> - * There are a few boundary cases this might miss but it should
> - * get called often enough that that should (hopefully) not be
> - * a real problem.
> - *
> - * It will not get called when we go idle, because the idle
> - * thread is a different class (!fair), nor will the utilization
> - * number include things like RT tasks.
> - *
> - * As is, the util number is not freq-invariant (we'd have to
> - * implement arch_scale_freq_capacity() for that).
> - *
> - * See cpu_util().
> - */
> - cpufreq_update_util(rq, flags);
> - }
> -}
> -
> #ifdef CONFIG_SMP
> /*
> * Approximate:
> @@ -3712,9 +3689,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> cfs_rq->load_last_update_time_copy = sa->last_update_time;
> #endif
>
> - if (decayed)
> - cfs_rq_util_change(cfs_rq, 0);
> -
> return decayed;
> }
>
> @@ -3726,7 +3700,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> * Must call update_cfs_rq_load_avg() before this, since we rely on
> * cfs_rq->avg.last_update_time being current.
> */
> -static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
>
> @@ -3762,7 +3736,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
>
> - cfs_rq_util_change(cfs_rq, flags);
> }
>
> /**
> @@ -3781,7 +3754,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>
> add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
>
> - cfs_rq_util_change(cfs_rq, 0);
> }
>
> /*
> @@ -3818,7 +3790,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> *
> * IOW we're enqueueing a task on a new CPU.
> */
> - attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
> + attach_entity_load_avg(cfs_rq, se);
> update_tg_load_avg(cfs_rq, 0);
>
> } else if (decayed && (flags & UPDATE_TG))
> @@ -4028,13 +4000,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>
> static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
> {
> - cfs_rq_util_change(cfs_rq, 0);
How about kill that extra line by doing:
static inline void update_load_avg(struct cfs_rq *cfs_rq,
struct sched_entity *se, int not_used1) {}
> }
>
> static inline void remove_entity_load_avg(struct sched_entity *se) {}
>
> static inline void
> -attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) {}
> +attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
> static inline void
> detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
>
> @@ -4762,8 +4733,11 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> dequeue = 0;
> }
>
> - if (!se)
> + /* The tasks are no more visible from the root cfs_rq */
> + if (!se) {
> sub_nr_running(rq, task_delta);
> + cpufreq_update_util(rq, 0);
> + }
>
> cfs_rq->throttled = 1;
> cfs_rq->throttled_clock = rq_clock(rq);
> @@ -4825,8 +4799,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> break;
> }
>
> - if (!se)
> + /* The tasks are now visible from the root cfs_rq */
> + if (!se) {
> add_nr_running(rq, task_delta);
> + cpufreq_update_util(rq, 0);
> + }
>
> /* Determine whether we need to wake up potentially idle CPU: */
> if (rq->curr == rq->idle && rq->cfs.nr_running)
> @@ -5359,14 +5336,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> /* Estimated utilization must be updated before schedutil */
> util_est_enqueue(&rq->cfs, p);
>
> - /*
> - * If in_iowait is set, the code below may not trigger any cpufreq
> - * utilization updates, so do it here explicitly with the IOWAIT flag
> - * passed.
> - */
> - if (p->in_iowait)
> - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> -
> for_each_sched_entity(se) {
> if (se->on_rq)
> break;
> @@ -5397,9 +5366,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> update_cfs_group(se);
> }
>
> - if (!se)
> + /* The task is visible from the root cfs_rq */
> + if (!se) {
> + unsigned int flags = 0;
> +
> add_nr_running(rq, 1);
>
> + if (p->in_iowait)
> + flags |= SCHED_CPUFREQ_IOWAIT;
> +
> + /*
> + * !last_update_time means we've passed through
> + * migrate_task_rq_fair() indicating we migrated.
> + *
> + * IOW we're enqueueing a task on a new CPU.
> + */
> + if (!p->se.avg.last_update_time)
> + flags |= SCHED_CPUFREQ_MIGRATION;
> +
> + cpufreq_update_util(rq, flags);
> + }
> +
> hrtick_update(rq);
> }
>
> @@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> update_cfs_group(se);
> }
>
> + /* The task is no more visible from the root cfs_rq */
> if (!se)
> sub_nr_running(rq, 1);
>
> util_est_dequeue(&rq->cfs, p, task_sleep);
> + cpufreq_update_util(rq, 0);
One question about this change. In enqueue, throttle and unthrottle - you are
conditionally calling cpufreq_update_util incase the task was
visible/not-visible in the hierarchy.
But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
Is this because of util_est or something? Could you add a comment here
explaining why this is so?
thanks,
- Joel
On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote:
> On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
> > Schedutil updates for FAIR tasks are triggered implicitly each time a
> > cfs_rq's utilization is updated via cfs_rq_util_change(), currently
> > called by update_cfs_rq_load_avg(), when the utilization of a cfs_rq has
> > changed, and {attach,detach}_entity_load_avg().
> >
> > This design is based on the idea that "we should callback schedutil
> > frequently enough" to properly update the CPU frequency at every
> > utilization change. However, such an integration strategy has also
> > some downsides:
>
> Hi Patrick,
>
> I agree making the call explicit would make schedutil integration easier so
> that's really awesome. However I also fear that if some path in the fair
> class in the future changes the utilization but forgets to update schedutil
> explicitly (because they forgot to call the explicit public API) then the
> schedutil update wouldn't go through. In this case the previous design of
> doing the schedutil update in the wrapper kind of was a nice to have
>
> Just thinking out loud but is there a way you could make the implicit call
> anyway incase the explicit call wasn't requested for some reason? That's
> probably hard to do correctly though..
>
> Some more comments below:
>
> >
> > - schedutil updates are triggered by RQ's load updates, which makes
> > sense in general but it does not allow to know exactly which other RQ
> > related information have been updated.
> > Recently, for example, we had issues due to schedutil dependencies on
> > cfs_rq->h_nr_running and estimated utilization updates.
> >
> > - cfs_rq_util_change() is mainly a wrapper function for an already
> > existing "public API", cpufreq_update_util(), which is required
> > just to ensure we actually update schedutil only when we are updating
> > a root cfs_rq.
> > Thus, especially when task groups are in use, most of the calls to
> > this wrapper function are not required.
> >
> > - the usage of a wrapper function is not completely consistent across
> > fair.c, since we could still need additional explicit calls to
> > cpufreq_update_util().
> > For example this already happens to report the IOWAIT boot flag in
> > the wakeup path.
> >
> > - it makes it hard to integrate new features since it could require to
> > change other function prototypes just to pass in an additional flag,
> > as it happened for example in commit:
> >
> > ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
> >
> > All the above considered, let's make schedutil updates more explicit in
> > fair.c by removing the cfs_rq_util_change() wrapper function in favour
> > of the existing cpufreq_update_util() public API.
> > This can be done by calling cpufreq_update_util() explicitly in the few
> > call sites where it really makes sense and when all the (potentially)
> > required cfs_rq's information have been updated.
> >
> > This patch mainly removes code and adds explicit schedutil updates
> > only when we:
> > - {enqueue,dequeue}_task_fair() a task to/from the root cfs_rq
> > - (un)throttle_cfs_rq() a set of tasks up to the root cfs_rq
> > - task_tick_fair() to update the utilization of the root cfs_rq
> >
> > All the other code paths, currently _indirectly_ covered by a call to
> > update_load_avg(), are still covered. Indeed, some paths already imply
> > enqueue/dequeue calls:
> > - switch_{to,from}_fair()
> > - sched_move_task()
> > while others are followed by enqueue/dequeue calls:
> > - cpu_cgroup_fork() and
> > post_init_entity_util_avg():
> > are used at wakeup_new_task() time and thus already followed by an
> > enqueue_task_fair()
> > - migrate_task_rq_fair():
> > updates the removed utilization but not the actual cfs_rq
> > utilization, which is updated by a following sched event
> >
> > This new proposal allows also to better aggregate schedutil related
> > flags, which are required only at enqueue_task_fair() time.
> > IOWAIT and MIGRATION flags are now requested only when a task is
> > actually visible at the root cfs_rq level.
> >
> > Signed-off-by: Patrick Bellasi <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Peter Zijlstra <[email protected]>
> > Cc: Rafael J. Wysocki <[email protected]>
> > Cc: Viresh Kumar <[email protected]>
> > Cc: Joel Fernandes <[email protected]>
> > Cc: Juri Lelli <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> >
> > ---
> >
> > NOTE: this patch changes the behavior of the IOWAIT flag: in case of a
> > task waking up on a throttled RQ we do not assert the flag to schedutil
> > anymore. However, this seems to make sense since the task will not be
> > running anyway.
> > ---
> > kernel/sched/fair.c | 81 ++++++++++++++++++++++++-----------------------------
> > 1 file changed, 36 insertions(+), 45 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 01dfc47541e6..87f092151a6e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -772,7 +772,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
> > * For !fair tasks do:
> > *
> > update_cfs_rq_load_avg(now, cfs_rq);
> > - attach_entity_load_avg(cfs_rq, se, 0);
> > + attach_entity_load_avg(cfs_rq, se);
> > switched_from_fair(rq, p);
> > *
> > * such that the next switched_to_fair() has the
> > @@ -3009,29 +3009,6 @@ static inline void update_cfs_group(struct sched_entity *se)
> > }
> > #endif /* CONFIG_FAIR_GROUP_SCHED */
> >
> > -static inline void cfs_rq_util_change(struct cfs_rq *cfs_rq, int flags)
> > -{
> > - struct rq *rq = rq_of(cfs_rq);
> > -
> > - if (&rq->cfs == cfs_rq || (flags & SCHED_CPUFREQ_MIGRATION)) {
> > - /*
> > - * There are a few boundary cases this might miss but it should
> > - * get called often enough that that should (hopefully) not be
> > - * a real problem.
> > - *
> > - * It will not get called when we go idle, because the idle
> > - * thread is a different class (!fair), nor will the utilization
> > - * number include things like RT tasks.
> > - *
> > - * As is, the util number is not freq-invariant (we'd have to
> > - * implement arch_scale_freq_capacity() for that).
> > - *
> > - * See cpu_util().
> > - */
> > - cpufreq_update_util(rq, flags);
> > - }
> > -}
> > -
> > #ifdef CONFIG_SMP
> > /*
> > * Approximate:
> > @@ -3712,9 +3689,6 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > cfs_rq->load_last_update_time_copy = sa->last_update_time;
> > #endif
> >
> > - if (decayed)
> > - cfs_rq_util_change(cfs_rq, 0);
> > -
> > return decayed;
> > }
> >
> > @@ -3726,7 +3700,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > * Must call update_cfs_rq_load_avg() before this, since we rely on
> > * cfs_rq->avg.last_update_time being current.
> > */
> > -static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > +static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > {
> > u32 divider = LOAD_AVG_MAX - 1024 + cfs_rq->avg.period_contrib;
> >
> > @@ -3762,7 +3736,6 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > add_tg_cfs_propagate(cfs_rq, se->avg.load_sum);
> >
> > - cfs_rq_util_change(cfs_rq, flags);
> > }
> >
> > /**
> > @@ -3781,7 +3754,6 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >
> > add_tg_cfs_propagate(cfs_rq, -se->avg.load_sum);
> >
> > - cfs_rq_util_change(cfs_rq, 0);
> > }
> >
> > /*
> > @@ -3818,7 +3790,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > *
> > * IOW we're enqueueing a task on a new CPU.
> > */
> > - attach_entity_load_avg(cfs_rq, se, SCHED_CPUFREQ_MIGRATION);
> > + attach_entity_load_avg(cfs_rq, se);
> > update_tg_load_avg(cfs_rq, 0);
> >
> > } else if (decayed && (flags & UPDATE_TG))
> > @@ -4028,13 +4000,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >
> > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
> > {
> > - cfs_rq_util_change(cfs_rq, 0);
>
> How about kill that extra line by doing:
>
> static inline void update_load_avg(struct cfs_rq *cfs_rq,
> struct sched_entity *se, int not_used1) {}
>
> > }
> >
> > static inline void remove_entity_load_avg(struct sched_entity *se) {}
> >
> > static inline void
> > -attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) {}
> > +attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
> > static inline void
> > detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se) {}
> >
> > @@ -4762,8 +4733,11 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
> > dequeue = 0;
> > }
> >
> > - if (!se)
> > + /* The tasks are no more visible from the root cfs_rq */
> > + if (!se) {
> > sub_nr_running(rq, task_delta);
> > + cpufreq_update_util(rq, 0);
> > + }
> >
> > cfs_rq->throttled = 1;
> > cfs_rq->throttled_clock = rq_clock(rq);
> > @@ -4825,8 +4799,11 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> > break;
> > }
> >
> > - if (!se)
> > + /* The tasks are now visible from the root cfs_rq */
> > + if (!se) {
> > add_nr_running(rq, task_delta);
> > + cpufreq_update_util(rq, 0);
> > + }
> >
> > /* Determine whether we need to wake up potentially idle CPU: */
> > if (rq->curr == rq->idle && rq->cfs.nr_running)
> > @@ -5359,14 +5336,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > /* Estimated utilization must be updated before schedutil */
> > util_est_enqueue(&rq->cfs, p);
> >
> > - /*
> > - * If in_iowait is set, the code below may not trigger any cpufreq
> > - * utilization updates, so do it here explicitly with the IOWAIT flag
> > - * passed.
> > - */
> > - if (p->in_iowait)
> > - cpufreq_update_util(rq, SCHED_CPUFREQ_IOWAIT);
> > -
> > for_each_sched_entity(se) {
> > if (se->on_rq)
> > break;
> > @@ -5397,9 +5366,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > update_cfs_group(se);
> > }
> >
> > - if (!se)
> > + /* The task is visible from the root cfs_rq */
> > + if (!se) {
> > + unsigned int flags = 0;
> > +
> > add_nr_running(rq, 1);
> >
> > + if (p->in_iowait)
> > + flags |= SCHED_CPUFREQ_IOWAIT;
> > +
> > + /*
> > + * !last_update_time means we've passed through
> > + * migrate_task_rq_fair() indicating we migrated.
> > + *
> > + * IOW we're enqueueing a task on a new CPU.
> > + */
> > + if (!p->se.avg.last_update_time)
> > + flags |= SCHED_CPUFREQ_MIGRATION;
> > +
> > + cpufreq_update_util(rq, flags);
> > + }
> > +
> > hrtick_update(rq);
> > }
> >
> > @@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > update_cfs_group(se);
> > }
> >
> > + /* The task is no more visible from the root cfs_rq */
> > if (!se)
> > sub_nr_running(rq, 1);
> >
> > util_est_dequeue(&rq->cfs, p, task_sleep);
> > + cpufreq_update_util(rq, 0);
>
> One question about this change. In enqueue, throttle and unthrottle - you are
> conditionally calling cpufreq_update_util incase the task was
> visible/not-visible in the hierarchy.
>
> But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
> Is this because of util_est or something? Could you add a comment here
> explaining why this is so?
The big question I have is incase se != NULL, then its still visible at the
root RQ level. In that case should we still call the util_est_dequeue and the
cpufreq_update_util?
Sorry if I missed something obvious.
thanks!
- Joel
On 11 May 2018 at 11:12, Patrick Bellasi <[email protected]> wrote:
>>
>> Do we need a Fixes tag and Cc stable ?
>
> Mmm... no sure, I would say that's not a fix.
>
> As I say in the changelog above, 8f111bc357aa was doing the correct
> thing but, since the recent Vincent's commit 31e77c93e432, this is an
> update worth to have, since now we can trust the decay of blocked
> utilization.
>
> Regarding stable, well... if Vincent patches are not going to be
> considered for stable, then we should not consider this too, do we?
commit 31e77c93e432 is not for stable so this patch should not go too
>
>> >
>> > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
>> > index d2c6083304b4..a74d05160e66 100644
>> > --- a/kernel/sched/cpufreq_schedutil.c
>> > +++ b/kernel/sched/cpufreq_schedutil.c
>> > @@ -183,22 +183,21 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu)
>> > static unsigned long sugov_aggregate_util(struct sugov_cpu *sg_cpu)
>> > {
>> > struct rq *rq = cpu_rq(sg_cpu->cpu);
>> > - unsigned long util;
>> >
>> > - if (rq->rt.rt_nr_running) {
>> > - util = sg_cpu->max;
>> > - } else {
>> > - util = sg_cpu->util_dl;
>> > - if (rq->cfs.h_nr_running)
>> > - util += sg_cpu->util_cfs;
>> > - }
>> > + if (rq->rt.rt_nr_running)
>> > + return sg_cpu->max;
>> >
>> > /*
>> > + * Utilization required by DEADLINE must always be granted while, for
>> > + * FAIR, we use blocked utilization of IDLE CPUs as a mechanism to
>> > + * gracefully reduce the frequency when no tasks show up for longer
>> > + * periods of time.
>> > + *
>> > * Ideally we would like to set util_dl as min/guaranteed freq and
>> > * util_cfs + util_dl as requested freq. However, cpufreq is not yet
>> > * ready for such an interface. So, we only do the latter for now.
>> > */
>> > - return min(util, sg_cpu->max);
>> > + return min(sg_cpu->max, (sg_cpu->util_dl + sg_cpu->util_cfs));
>> > }
>> >
>> > static void sugov_set_iowait_boost(struct sugov_cpu *sg_cpu, u64 time, unsigned int flags)
>>
>> Acked-by: Viresh Kumar <[email protected]>
>>
>> --
>> viresh
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
On 12-May 23:25, Joel Fernandes wrote:
> On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote:
> > On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
> > > Schedutil updates for FAIR tasks are triggered implicitly each time a
> > > cfs_rq's utilization is updated via cfs_rq_util_change(), currently
> > > called by update_cfs_rq_load_avg(), when the utilization of a cfs_rq has
> > > changed, and {attach,detach}_entity_load_avg().
> > >
> > > This design is based on the idea that "we should callback schedutil
> > > frequently enough" to properly update the CPU frequency at every
> > > utilization change. However, such an integration strategy has also
> > > some downsides:
> >
> > Hi Patrick,
Hi Joel,
> > I agree making the call explicit would make schedutil integration easier so
> > that's really awesome. However I also fear that if some path in the fair
> > class in the future changes the utilization but forgets to update schedutil
> > explicitly (because they forgot to call the explicit public API) then the
> > schedutil update wouldn't go through. In this case the previous design of
> > doing the schedutil update in the wrapper kind of was a nice to have
I cannot see right now other possible future paths where we can
actually change the utilization signal without considering that,
eventually, we should call an existing API to update schedutil if it
makes sense.
What I can see more likely instead, also because it already happened a
couple of time, is that because of code changes in fair.c we end up
calling (implicitly) schedutil with a wrong utilization value.
To note this kind of broken dependency it has already been more
difficult than possibly noticing an update of the utilization without
a corresponding explicit call of the public API.
> > Just thinking out loud but is there a way you could make the implicit call
> > anyway incase the explicit call wasn't requested for some reason? That's
> > probably hard to do correctly though..
> >
> > Some more comments below:
> >
[...]
> > >
> > > - it makes it hard to integrate new features since it could require to
> > > change other function prototypes just to pass in an additional flag,
> > > as it happened for example in commit:
> > >
> > > ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
IMHO, the point above is also a good example of how convoluted is
to add support for one new simple feature because of the current
implicit updates.
[...]
> > > @@ -4028,13 +4000,12 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > >
> > > static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int not_used1)
> > > {
> > > - cfs_rq_util_change(cfs_rq, 0);
> >
> > How about kill that extra line by doing:
> >
> > static inline void update_load_avg(struct cfs_rq *cfs_rq,
> > struct sched_entity *se, int not_used1) {}
> >
> > > }
Right, that could make sense, thanks!
[...]
> > > @@ -5397,9 +5366,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > update_cfs_group(se);
> > > }
> > >
> > > - if (!se)
> > > + /* The task is visible from the root cfs_rq */
> > > + if (!se) {
> > > + unsigned int flags = 0;
> > > +
> > > add_nr_running(rq, 1);
> > >
> > > + if (p->in_iowait)
> > > + flags |= SCHED_CPUFREQ_IOWAIT;
> > > +
> > > + /*
> > > + * !last_update_time means we've passed through
> > > + * migrate_task_rq_fair() indicating we migrated.
> > > + *
> > > + * IOW we're enqueueing a task on a new CPU.
> > > + */
> > > + if (!p->se.avg.last_update_time)
> > > + flags |= SCHED_CPUFREQ_MIGRATION;
> > > +
> > > + cpufreq_update_util(rq, flags);
> > > + }
> > > +
> > > hrtick_update(rq);
> > > }
> > >
> > > @@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > update_cfs_group(se);
> > > }
> > >
> > > + /* The task is no more visible from the root cfs_rq */
> > > if (!se)
> > > sub_nr_running(rq, 1);
> > >
> > > util_est_dequeue(&rq->cfs, p, task_sleep);
> > > + cpufreq_update_util(rq, 0);
> >
> > One question about this change. In enqueue, throttle and unthrottle - you are
> > conditionally calling cpufreq_update_util incase the task was
> > visible/not-visible in the hierarchy.
> >
> > But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
> > Is this because of util_est or something? Could you add a comment here
> > explaining why this is so?
>
> The big question I have is incase se != NULL, then its still visible at the
> root RQ level.
My understanding it that you get !se at dequeue time when we are
dequeuing a task from a throttled RQ. Isn't it?
Thus, this means you are dequeuing a throttled task, I guess for
example because of a migration.
However, the point is that a task dequeue from a throttled RQ _is
already_ not visible from the root RQ, because of the sub_nr_running()
done by throttle_cfs_rq().
> In that case should we still call the util_est_dequeue and the
> cpufreq_update_util?
I had a better look at the different code paths and I've possibly come
up with some interesting observations. Lemme try to resume theme here.
First of all, we need to distinguish from estimated utilization
updates and schedutil updates, since they respond to two very
different goals.
.:: Estimated utilization updates
=================================
Goal: account for the amount of utilization we are going to
expect on a CPU
At {en,de}queue time, util_est_{en,de}queue() is always
unconditionally called because it tracks the utilization which is
estimated to be generated by all the RUNNABLE tasks.
We do not care about throttled/un-throttled RQ here because the effect
of throttling is already folded into the estimated utilization.
For example, a 100% tasks which is placed into a 50% bandwidth
limited TG will generate a 50% (estimated) utilization. Thus, when the
task is enqueued we can account immediately for that utilization
although the RQ can be currently throttled.
.:: Schedutil updates
=====================
Goal: select a better frequency, if and _when_ required
At enqueue time, if the task is visible at the root RQ the it's
expected to run within a scheduler latency period. Thus, it makes
sense to call immediately schedutil to account for its estimated
utilization to possibly increase the OPP.
If instead the task is enqueued into a throttled RQ, then I'm
skipping the update since the task will not run until the RQ is
actually un-throttled.
HOWEVER, I would say that in general we could skip this last
optimization and always unconditionally update schedutil at enqueue
time considering the fact that the effects of a throttled RQ are
always reflected into the (estimated) utilization of a task.
At dequeue time instead, since we certainly removed some estimated
utilization, then I unconditionally updated schedutil.
HOWEVER, I was not considering these two things:
1. for a task going to sleep, we still have its blocked utilization
accounted in the cfs_rq utilization.
2. for a task being migrated, at dequeue time we still have not
removed the task's utilization from the cfs_rq's utilization.
This usually happens later, for example we can have:
move_queued_task()
dequeue_task() --> CFS task dequeued
set_task_cpu() --> schedutil updated
migrate_task_rq_fair()
detach_entity_cfs_rq()
detach_entity_load_avg() --> CFS util removal
enqueue_task()
Moreover, the "CFS util removal" actually affects the cfs_rq only if
we hold the RQ lock, otherwise we know that it's just back annotated
as "removed" utilization and the actual cfs_rq utilization is fixed up
at the next chance we have the RQ lock.
Thus, I would say that in both cases, at dequeue time it does not make
sense to update schedutil since we always see the task's utilization
in the cfs_rq and thus we will not reduce the frequency.
NOTE, this is true independently from the refactoring I'm proposing.
At dequeue time, although we call update_load_avg() on the root RQ,
it does not make sense to update schedutil since we still see either
the blocked utilization of a sleeping task or the not yet removed
utilization of a migrating task. In both cases the risk is to ask for
an higher OPP right when a CPU is going to be IDLE.
Moreover, it seems that in general we prefer a "conservative" approach
in frequency reduction.
For example it could be harmful to trigger a frequency reduction when
a task is migrating off a CPU, if right after another task should be
instead migrated into the same CPU.
.:: Conclusions
===============
All that considered, I think I've convinced myself that we really need
to notify schedutil only in these cases:
1. enqueue time
because of the changes in estimated utilization and the
possibility to just straight to a better OPP
2. task tick time
because of the possible ramp-up of the utilization
Another case is related to remote CPUs blocked utilization update,
after the recent Vincent's patches. Currently indeed:
update_blocked_averages()
update_load_avg()
--> update schedutil
and thus, potentially we wake up an IDLE cluster just to reduce its
OPP. If the cluster is in a deep idle state, I'm not entirely sure
this is good from an energy saving standpoint.
However, with the patch I'm proposing we are missing that support,
meaning that an IDLE cluster will get its utilization decayed but we
don't wake it up just to drop its frequency.
Perhaps we should better pass in this information to schedutil via a
flag (e.g. SCHED_FREQ_REMOTE_UPDATE) and implement there a policy to
decide if and when it makes sense to drop the OPP. Or otherwise find a
way for the special DL tasks to always run on the lower capacity_orig
CPUs.
> Sorry if I missed something obvious.
Thanks for the question it has actually triggered a better analysis of
what we have and what we need.
Looking forward to some feedbacks about the above before posting a new
version of this last patch.
> thanks!
>
> - Joel
--
#include <best/regards.h>
Patrick Bellasi
On 14-May 11:18, Vincent Guittot wrote:
> On 11 May 2018 at 11:12, Patrick Bellasi <[email protected]> wrote:
>
> >>
> >> Do we need a Fixes tag and Cc stable ?
> >
> > Mmm... no sure, I would say that's not a fix.
> >
> > As I say in the changelog above, 8f111bc357aa was doing the correct
> > thing but, since the recent Vincent's commit 31e77c93e432, this is an
> > update worth to have, since now we can trust the decay of blocked
> > utilization.
> >
> > Regarding stable, well... if Vincent patches are not going to be
> > considered for stable, then we should not consider this too, do we?
>
> commit 31e77c93e432 is not for stable so this patch should not go too
Right, will not add stable in cc.
Thanks
--
#include <best/regards.h>
Patrick Bellasi
On 14 May 2018 at 18:32, Patrick Bellasi <[email protected]> wrote:
> On 12-May 23:25, Joel Fernandes wrote:
>> On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote:
>> > On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
>> > > Schedutil updates for FAIR tasks are triggered implicitly each time a
>> > > cfs_rq's utilization is updated via cfs_rq_util_change(), currently
>> > > called by update_cfs_rq_load_avg(), when the utilization of a cfs_rq has
>> > > changed, and {attach,detach}_entity_load_avg().
>> > >
>> > > This design is based on the idea that "we should callback schedutil
>> > > frequently enough" to properly update the CPU frequency at every
>> > > utilization change. However, such an integration strategy has also
>> > > some downsides:
>> >
>> > Hi Patrick,
>
> Hi Joel,
>
[...]
>> > > @@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> > > update_cfs_group(se);
>> > > }
>> > >
>> > > + /* The task is no more visible from the root cfs_rq */
>> > > if (!se)
>> > > sub_nr_running(rq, 1);
>> > >
>> > > util_est_dequeue(&rq->cfs, p, task_sleep);
>> > > + cpufreq_update_util(rq, 0);
>> >
>> > One question about this change. In enqueue, throttle and unthrottle - you are
>> > conditionally calling cpufreq_update_util incase the task was
>> > visible/not-visible in the hierarchy.
>> >
>> > But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
>> > Is this because of util_est or something? Could you add a comment here
>> > explaining why this is so?
>>
>> The big question I have is incase se != NULL, then its still visible at the
>> root RQ level.
>
> My understanding it that you get !se at dequeue time when we are
> dequeuing a task from a throttled RQ. Isn't it?
Yes se becomes NULL only when you reach root domain
>
> Thus, this means you are dequeuing a throttled task, I guess for
> example because of a migration.
> However, the point is that a task dequeue from a throttled RQ _is
> already_ not visible from the root RQ, because of the sub_nr_running()
> done by throttle_cfs_rq().
>
>> In that case should we still call the util_est_dequeue and the
>> cpufreq_update_util?
>
> I had a better look at the different code paths and I've possibly come
> up with some interesting observations. Lemme try to resume theme here.
>
> First of all, we need to distinguish from estimated utilization
> updates and schedutil updates, since they respond to two very
> different goals.
>
>
> .:: Estimated utilization updates
> =================================
>
> Goal: account for the amount of utilization we are going to
> expect on a CPU
>
> At {en,de}queue time, util_est_{en,de}queue() is always
> unconditionally called because it tracks the utilization which is
> estimated to be generated by all the RUNNABLE tasks.
>
> We do not care about throttled/un-throttled RQ here because the effect
> of throttling is already folded into the estimated utilization.
>
> For example, a 100% tasks which is placed into a 50% bandwidth
> limited TG will generate a 50% (estimated) utilization. Thus, when the
> task is enqueued we can account immediately for that utilization
> although the RQ can be currently throttled.
>
>
> .:: Schedutil updates
> =====================
>
> Goal: select a better frequency, if and _when_ required
>
> At enqueue time, if the task is visible at the root RQ the it's
> expected to run within a scheduler latency period. Thus, it makes
> sense to call immediately schedutil to account for its estimated
> utilization to possibly increase the OPP.
>
> If instead the task is enqueued into a throttled RQ, then I'm
> skipping the update since the task will not run until the RQ is
> actually un-throttled.
>
> HOWEVER, I would say that in general we could skip this last
> optimization and always unconditionally update schedutil at enqueue
> time considering the fact that the effects of a throttled RQ are
> always reflected into the (estimated) utilization of a task.
I think so too
>
> At dequeue time instead, since we certainly removed some estimated
> utilization, then I unconditionally updated schedutil.
>
> HOWEVER, I was not considering these two things:
>
> 1. for a task going to sleep, we still have its blocked utilization
> accounted in the cfs_rq utilization.
It might be still interesting to reduce the frequency because the
blocked utilization can be lower than its estimated utilization.
>
> 2. for a task being migrated, at dequeue time we still have not
> removed the task's utilization from the cfs_rq's utilization.
> This usually happens later, for example we can have:
>
> move_queued_task()
> dequeue_task() --> CFS task dequeued
> set_task_cpu() --> schedutil updated
> migrate_task_rq_fair()
> detach_entity_cfs_rq()
> detach_entity_load_avg() --> CFS util removal
> enqueue_task()
>
> Moreover, the "CFS util removal" actually affects the cfs_rq only if
> we hold the RQ lock, otherwise we know that it's just back annotated
> as "removed" utilization and the actual cfs_rq utilization is fixed up
> at the next chance we have the RQ lock.
>
> Thus, I would say that in both cases, at dequeue time it does not make
> sense to update schedutil since we always see the task's utilization
> in the cfs_rq and thus we will not reduce the frequency.
Yes only attach/detach make sense from an utilization pov and that's
where we should check for a frequency update for utilization
>
> NOTE, this is true independently from the refactoring I'm proposing.
> At dequeue time, although we call update_load_avg() on the root RQ,
> it does not make sense to update schedutil since we still see either
> the blocked utilization of a sleeping task or the not yet removed
> utilization of a migrating task. In both cases the risk is to ask for
> an higher OPP right when a CPU is going to be IDLE.
We have to take care of not mixing the opportunity to update the
frequency when we are updating the utilization with the policy that we
want to apply regarding (what we think that is) the best time to
update the frequency. Like saying that we should wait a bit more to
make sure that the current utilization is sustainable because a
frequency change is expensive on the platform (or not)
It's not because a task is dequeued that we should not update and
increase the frequency; Or even that we should not decrease it because
we have just taken into account some removed utilization of a previous
migration.
The same happen when a task migrates, we don't know if the utilization
that is about to be migrated, will be higher or lower than the normal
update of the utilization (since the last update) and can not generate
a frequency change
I see your explanation above like a kind of policy where you want to
balance the cost of a frequency change with the probability that we
will not have to re-update the frequency soon.
I agree that some scheduling events give higher chances of a
sustainable utilization level and we should favor these events when
the frequency change is costly but I'm not sure that we should remove
all other opportunity to udjust the frequency to the current
utilization level when the cost is low or negligible.
Can't we classify the utilization events into some kind of major and
minor changes ?
>
> Moreover, it seems that in general we prefer a "conservative" approach
> in frequency reduction.
> For example it could be harmful to trigger a frequency reduction when
> a task is migrating off a CPU, if right after another task should be
> instead migrated into the same CPU.
>
>
> .:: Conclusions
> ===============
>
> All that considered, I think I've convinced myself that we really need
> to notify schedutil only in these cases:
>
> 1. enqueue time
> because of the changes in estimated utilization and the
> possibility to just straight to a better OPP
>
> 2. task tick time
> because of the possible ramp-up of the utilization
>
> Another case is related to remote CPUs blocked utilization update,
> after the recent Vincent's patches. Currently indeed:
>
> update_blocked_averages()
> update_load_avg()
> --> update schedutil
>
> and thus, potentially we wake up an IDLE cluster just to reduce its
> OPP. If the cluster is in a deep idle state, I'm not entirely sure
> this is good from an energy saving standpoint.
> However, with the patch I'm proposing we are missing that support,
> meaning that an IDLE cluster will get its utilization decayed but we
> don't wake it up just to drop its frequency.
So more than deciding in the scheduler if we should wake it up or not,
we should give a chance to cpufreq to decide if it wants to update the
frequency or not as this decision is somehow platform specific: cost
of frequency change, clock topology and shared clock, voltage topology
...
Regards,
Vincent
>
> Perhaps we should better pass in this information to schedutil via a
> flag (e.g. SCHED_FREQ_REMOTE_UPDATE) and implement there a policy to
> decide if and when it makes sense to drop the OPP. Or otherwise find a
> way for the special DL tasks to always run on the lower capacity_orig
> CPUs.
>
>> Sorry if I missed something obvious.
>
> Thanks for the question it has actually triggered a better analysis of
> what we have and what we need.
>
> Looking forward to some feedbacks about the above before posting a new
> version of this last patch.
>
>> thanks!
>>
>> - Joel
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
On 15-May 12:19, Vincent Guittot wrote:
> On 14 May 2018 at 18:32, Patrick Bellasi <[email protected]> wrote:
> > On 12-May 23:25, Joel Fernandes wrote:
> >> On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote:
> >> > On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
[...]
> >> > One question about this change. In enqueue, throttle and unthrottle - you are
> >> > conditionally calling cpufreq_update_util incase the task was
> >> > visible/not-visible in the hierarchy.
> >> >
> >> > But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
> >> > Is this because of util_est or something? Could you add a comment here
> >> > explaining why this is so?
> >>
> >> The big question I have is incase se != NULL, then its still visible at the
> >> root RQ level.
> >
> > My understanding it that you get !se at dequeue time when we are
> > dequeuing a task from a throttled RQ. Isn't it?
>
> Yes se becomes NULL only when you reach root domain
Right, my point was mainly what I'm saing below: a task removed from a
"throttled" cfs_rq is _already_ not visible from the root cfs_rq since
it has been de-accounted at throttle_cfs_rq() time.
[...]
> > At dequeue time instead, since we certainly removed some estimated
> > utilization, then I unconditionally updated schedutil.
> >
> > HOWEVER, I was not considering these two things:
> >
> > 1. for a task going to sleep, we still have its blocked utilization
> > accounted in the cfs_rq utilization.
>
> It might be still interesting to reduce the frequency because the
> blocked utilization can be lower than its estimated utilization.
Good point, this is the case of a task which, in its last activation,
executed for less time then its previous estimated utilization.
However, it could also very well be the opposite, a task which
executed more then its past activation. In this case a schedutil
update could trigger a frequency increase.
Thus, the scheduler knows that we are going to sleep: does is really
makes sense to send a notification in this case?
To me that's not a policy to choose when it makes sense to change the
frequency, but just the proper definition of when it makes sense to
send a notification.
IMHO we should better consider not only (and blindly) the utilization
changes but also what the scheduler knows about the status of a task.
Thus: if the utilization change while a task is running, it's worth to
send a notification. While, when a task is done on a CPU and that CPU
is likely going to be idle, then maybe we should skip the
notification.
> > 2. for a task being migrated, at dequeue time we still have not
> > removed the task's utilization from the cfs_rq's utilization.
> > This usually happens later, for example we can have:
> >
> > move_queued_task()
> > dequeue_task() --> CFS task dequeued
> > set_task_cpu() --> schedutil updated
> > migrate_task_rq_fair()
> > detach_entity_cfs_rq()
> > detach_entity_load_avg() --> CFS util removal
> > enqueue_task()
> >
> > Moreover, the "CFS util removal" actually affects the cfs_rq only if
> > we hold the RQ lock, otherwise we know that it's just back annotated
> > as "removed" utilization and the actual cfs_rq utilization is fixed up
> > at the next chance we have the RQ lock.
> >
> > Thus, I would say that in both cases, at dequeue time it does not make
> > sense to update schedutil since we always see the task's utilization
> > in the cfs_rq and thus we will not reduce the frequency.
>
> Yes only attach/detach make sense from an utilization pov and that's
> where we should check for a frequency update for utilization
Indeed, I was considering the idea to move the callbacks there, which
are the only code places where we know for sure that some utilization
joined or departed from a CPU.
Still have to check better however, because these functions can be
called also for non root cfs_rqs... thus we will need again the
filtering condition we have now in the wrapper function.
> > NOTE, this is true independently from the refactoring I'm proposing.
> > At dequeue time, although we call update_load_avg() on the root RQ,
> > it does not make sense to update schedutil since we still see either
> > the blocked utilization of a sleeping task or the not yet removed
> > utilization of a migrating task. In both cases the risk is to ask for
> > an higher OPP right when a CPU is going to be IDLE.
>
> We have to take care of not mixing the opportunity to update the
> frequency when we are updating the utilization with the policy that we
> want to apply regarding (what we think that is) the best time to
> update the frequency. Like saying that we should wait a bit more to
> make sure that the current utilization is sustainable because a
> frequency change is expensive on the platform (or not)
I completely agree on keeping utilization update notification separated
from schedutil decisions...
> It's not because a task is dequeued that we should not update and
> increase the frequency; Or even that we should not decrease it because
> we have just taken into account some removed utilization of a previous
> migration.
> The same happen when a task migrates, we don't know if the utilization
> that is about to be migrated, will be higher or lower than the normal
> update of the utilization (since the last update) and can not generate
> a frequency change
>
> I see your explanation above like a kind of policy where you want to
> balance the cost of a frequency change with the probability that we
> will not have to re-update the frequency soon.
That was not my thinking. What I wanted to say is just that we should
send notification when it makes really sense, because we have the most
valuable information to pass.
Thus, notifying schedutil when we update the RQ utilization is a bit
of a greedy approach with respect to the information the scheduler has.
In the migration example above:
- first we update the RQ utilization
- then we actually remove from the RQ the utilization of the migrated
task
If we notify schedutil at the first step we are more likely to pass an
already outdated information, since from the scheduler standpoint we
know that we are going to reduce the CPU utilization quite soon.
Thus, would it not be better to defer the notification at detach time?
After all that's the original goal of this patch
> I agree that some scheduling events give higher chances of a
> sustainable utilization level and we should favor these events when
> the frequency change is costly but I'm not sure that we should remove
> all other opportunity to udjust the frequency to the current
> utilization level when the cost is low or negligible.
Maybe we can try to run hackbench to quantify the overhead we add with
useless schedutil updates. However, my main concerns is that if we
want a proper decoupling between the scheduler and schedutil, then we
also have to ensure that we callback for updates only when it really
makes sense.
Otherwise, the risk is that the schedutil policy will take decisions
based on wrong assumptions like: ok, let's increase the OPP (since I
can now and it's cheap) without knowing that the CPU is instead going
to be almost empty or even IDLE.
> Can't we classify the utilization events into some kind of major and
> minor changes ?
Doesn't a classification itself looks more like a policy?
Maybe we can consider it, but still I think we should be able to find
when the scheduler has the most accurate and updated information about
the tasks actually RUNNABLE on a CPU and at that point send a
notification to schedutil.
IMO there are few small events when the utilization could have big
changes: and these are wakeups (because of util_est) and migrations.
For all the rest, the tick should be a good enough update rate,
considering also that, even at 250Hz, in 4ms PELT never build up more
then ~8%.
[...]
> > .:: Conclusions
> > ===============
> >
> > All that considered, I think I've convinced myself that we really need
> > to notify schedutil only in these cases:
> >
> > 1. enqueue time
> > because of the changes in estimated utilization and the
> > possibility to just straight to a better OPP
> >
> > 2. task tick time
> > because of the possible ramp-up of the utilization
> >
> > Another case is related to remote CPUs blocked utilization update,
> > after the recent Vincent's patches. Currently indeed:
> >
> > update_blocked_averages()
> > update_load_avg()
> > --> update schedutil
> >
> > and thus, potentially we wake up an IDLE cluster just to reduce its
> > OPP. If the cluster is in a deep idle state, I'm not entirely sure
> > this is good from an energy saving standpoint.
> > However, with the patch I'm proposing we are missing that support,
> > meaning that an IDLE cluster will get its utilization decayed but we
> > don't wake it up just to drop its frequency.
>
> So more than deciding in the scheduler if we should wake it up or not,
> we should give a chance to cpufreq to decide if it wants to update the
> frequency or not as this decision is somehow platform specific: cost
> of frequency change, clock topology and shared clock, voltage topology
> ...
Fair enough, then we should keep updating schedutil from that code path.
What about adding a new explicit callback at the end of:
update_blocked_averages() ?
Something like:
---8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cb77407ba485..6eb0f31c656d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7740,6 +7740,9 @@ static void update_blocked_averages(int cpu)
if (done)
rq->has_blocked_load = 0;
#endif
+
+ cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE);
+
rq_unlock_irqrestore(rq, &rf);
}
---8<---
Where we can also pass in a new SCHED_CPUFREQ_IDLE flag just to notify
schedutil that the CPU is currently IDLE?
Could that work?
--
#include <best/regards.h>
Patrick Bellasi
On Tue, May 15, 2018 at 03:53:43PM +0100, Patrick Bellasi wrote:
> On 15-May 12:19, Vincent Guittot wrote:
> > On 14 May 2018 at 18:32, Patrick Bellasi <[email protected]> wrote:
> > Yes se becomes NULL only when you reach root domain
root group; domains are something else again ;-)
> Thus, the scheduler knows that we are going to sleep: does is really
> makes sense to send a notification in this case?
It might; esp. on these very slow changing machines.
> What about adding a new explicit callback at the end of:
> update_blocked_averages() ?
>
> Something like:
>
> ---8<---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cb77407ba485..6eb0f31c656d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7740,6 +7740,9 @@ static void update_blocked_averages(int cpu)
> if (done)
> rq->has_blocked_load = 0;
> #endif
> +
> + cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE);
> +
> rq_unlock_irqrestore(rq, &rf);
> }
> ---8<---
>
> Where we can also pass in a new SCHED_CPUFREQ_IDLE flag just to notify
> schedutil that the CPU is currently IDLE?
>
> Could that work?
Simlarly you could add ENQUEUE/DEQUEUE flags I suppose. But let's do all
that later in separate patches and evaluate the impact separately, OK?
On 15-May 18:53, Peter Zijlstra wrote:
> On Tue, May 15, 2018 at 03:53:43PM +0100, Patrick Bellasi wrote:
> > On 15-May 12:19, Vincent Guittot wrote:
> > > On 14 May 2018 at 18:32, Patrick Bellasi <[email protected]> wrote:
> > What about adding a new explicit callback at the end of:
> > update_blocked_averages() ?
> >
> > Something like:
> >
> > ---8<---
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index cb77407ba485..6eb0f31c656d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7740,6 +7740,9 @@ static void update_blocked_averages(int cpu)
> > if (done)
> > rq->has_blocked_load = 0;
> > #endif
> > +
> > + cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE);
> > +
> > rq_unlock_irqrestore(rq, &rf);
> > }
> > ---8<---
> >
> > Where we can also pass in a new SCHED_CPUFREQ_IDLE flag just to notify
> > schedutil that the CPU is currently IDLE?
> >
> > Could that work?
>
> Simlarly you could add ENQUEUE/DEQUEUE flags I suppose. But let's do all
> that later in separate patches and evaluate the impact separately, OK?
Sure, that's better to keep on a separate patch.
For the time being I'll just update this last patch with the above
snippet just to have the same behavior we have right now.
Then maybe we can also split this last patch into a separate series to
better evaluate the ideas to have "explicit" updates.
--
#include <best/regards.h>
Patrick Bellasi
On 15 May 2018 at 18:53, Peter Zijlstra <[email protected]> wrote:
> On Tue, May 15, 2018 at 03:53:43PM +0100, Patrick Bellasi wrote:
>> On 15-May 12:19, Vincent Guittot wrote:
>> > On 14 May 2018 at 18:32, Patrick Bellasi <[email protected]> wrote:
>
>> > Yes se becomes NULL only when you reach root domain
>
> root group; domains are something else again ;-)
yes good point :-)
>
>> Thus, the scheduler knows that we are going to sleep: does is really
>> makes sense to send a notification in this case?
>
> It might; esp. on these very slow changing machines.
>
>> What about adding a new explicit callback at the end of:
>> update_blocked_averages() ?
>>
>> Something like:
>>
>> ---8<---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index cb77407ba485..6eb0f31c656d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7740,6 +7740,9 @@ static void update_blocked_averages(int cpu)
>> if (done)
>> rq->has_blocked_load = 0;
>> #endif
>> +
>> + cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE);
>> +
>> rq_unlock_irqrestore(rq, &rf);
>> }
>> ---8<---
>>
>> Where we can also pass in a new SCHED_CPUFREQ_IDLE flag just to notify
>> schedutil that the CPU is currently IDLE?
>>
>> Could that work?
>
> Simlarly you could add ENQUEUE/DEQUEUE flags I suppose. But let's do all
> that later in separate patches and evaluate the impact separately, OK?
On 15 May 2018 at 16:53, Patrick Bellasi <[email protected]> wrote:
> On 15-May 12:19, Vincent Guittot wrote:
>> On 14 May 2018 at 18:32, Patrick Bellasi <[email protected]> wrote:
>> > On 12-May 23:25, Joel Fernandes wrote:
>> >> On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote:
>> >> > On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
>
> [...]
>
>> >> > One question about this change. In enqueue, throttle and unthrottle - you are
>> >> > conditionally calling cpufreq_update_util incase the task was
>> >> > visible/not-visible in the hierarchy.
>> >> >
>> >> > But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
>> >> > Is this because of util_est or something? Could you add a comment here
>> >> > explaining why this is so?
>> >>
>> >> The big question I have is incase se != NULL, then its still visible at the
>> >> root RQ level.
>> >
>> > My understanding it that you get !se at dequeue time when we are
>> > dequeuing a task from a throttled RQ. Isn't it?
>>
>> Yes se becomes NULL only when you reach root domain
>
> Right, my point was mainly what I'm saing below: a task removed from a
> "throttled" cfs_rq is _already_ not visible from the root cfs_rq since
> it has been de-accounted at throttle_cfs_rq() time.
>
> [...]
>
>> > At dequeue time instead, since we certainly removed some estimated
>> > utilization, then I unconditionally updated schedutil.
>> >
>> > HOWEVER, I was not considering these two things:
>> >
>> > 1. for a task going to sleep, we still have its blocked utilization
>> > accounted in the cfs_rq utilization.
>>
>> It might be still interesting to reduce the frequency because the
>> blocked utilization can be lower than its estimated utilization.
>
> Good point, this is the case of a task which, in its last activation,
> executed for less time then its previous estimated utilization.
>
> However, it could also very well be the opposite, a task which
> executed more then its past activation. In this case a schedutil
> update could trigger a frequency increase.
> Thus, the scheduler knows that we are going to sleep: does is really
> makes sense to send a notification in this case?
Why do you say that we are going to sleep ? a task that does to sleep
doesn't mean that cpu is going to sleep as well
>
> To me that's not a policy to choose when it makes sense to change the
> frequency, but just the proper definition of when it makes sense to
> send a notification.
>
> IMHO we should better consider not only (and blindly) the utilization
> changes but also what the scheduler knows about the status of a task.
> Thus: if the utilization change while a task is running, it's worth to
> send a notification. While, when a task is done on a CPU and that CPU
> is likely going to be idle, then maybe we should skip the
> notification.
I don't think so. Depending of the c-state, the power consumption can
be impacted and in addition we will have to do the frequency change at
wake up
>
>> > 2. for a task being migrated, at dequeue time we still have not
>> > removed the task's utilization from the cfs_rq's utilization.
>> > This usually happens later, for example we can have:
>> >
>> > move_queued_task()
>> > dequeue_task() --> CFS task dequeued
>> > set_task_cpu() --> schedutil updated
>> > migrate_task_rq_fair()
>> > detach_entity_cfs_rq()
>> > detach_entity_load_avg() --> CFS util removal
>> > enqueue_task()
>> >
>> > Moreover, the "CFS util removal" actually affects the cfs_rq only if
>> > we hold the RQ lock, otherwise we know that it's just back annotated
>> > as "removed" utilization and the actual cfs_rq utilization is fixed up
>> > at the next chance we have the RQ lock.
>> >
>> > Thus, I would say that in both cases, at dequeue time it does not make
>> > sense to update schedutil since we always see the task's utilization
>> > in the cfs_rq and thus we will not reduce the frequency.
>>
>> Yes only attach/detach make sense from an utilization pov and that's
>> where we should check for a frequency update for utilization
>
> Indeed, I was considering the idea to move the callbacks there, which
> are the only code places where we know for sure that some utilization
> joined or departed from a CPU.
>
> Still have to check better however, because these functions can be
> called also for non root cfs_rqs... thus we will need again the
> filtering condition we have now in the wrapper function.
yes probably
>
>> > NOTE, this is true independently from the refactoring I'm proposing.
>> > At dequeue time, although we call update_load_avg() on the root RQ,
>> > it does not make sense to update schedutil since we still see either
>> > the blocked utilization of a sleeping task or the not yet removed
>> > utilization of a migrating task. In both cases the risk is to ask for
>> > an higher OPP right when a CPU is going to be IDLE.
>>
>> We have to take care of not mixing the opportunity to update the
>> frequency when we are updating the utilization with the policy that we
>> want to apply regarding (what we think that is) the best time to
>> update the frequency. Like saying that we should wait a bit more to
>> make sure that the current utilization is sustainable because a
>> frequency change is expensive on the platform (or not)
>
> I completely agree on keeping utilization update notification separated
> from schedutil decisions...
>
>> It's not because a task is dequeued that we should not update and
>> increase the frequency; Or even that we should not decrease it because
>> we have just taken into account some removed utilization of a previous
>> migration.
>> The same happen when a task migrates, we don't know if the utilization
>> that is about to be migrated, will be higher or lower than the normal
>> update of the utilization (since the last update) and can not generate
>> a frequency change
>>
>> I see your explanation above like a kind of policy where you want to
>> balance the cost of a frequency change with the probability that we
>> will not have to re-update the frequency soon.
>
> That was not my thinking. What I wanted to say is just that we should
> send notification when it makes really sense, because we have the most
> valuable information to pass.
>
> Thus, notifying schedutil when we update the RQ utilization is a bit
> of a greedy approach with respect to the information the scheduler has.
>
> In the migration example above:
> - first we update the RQ utilization
> - then we actually remove from the RQ the utilization of the migrated
> task
> If we notify schedutil at the first step we are more likely to pass an
> already outdated information, since from the scheduler standpoint we
> know that we are going to reduce the CPU utilization quite soon.
> Thus, would it not be better to defer the notification at detach time?
better or not I would say that this depends of the platform, the cost
of changing the frequency, how many OPP there are and the gap between
these OPP ...
>
> After all that's the original goal of this patch
>
>> I agree that some scheduling events give higher chances of a
>> sustainable utilization level and we should favor these events when
>> the frequency change is costly but I'm not sure that we should remove
>> all other opportunity to udjust the frequency to the current
>> utilization level when the cost is low or negligible.
>
> Maybe we can try to run hackbench to quantify the overhead we add with
> useless schedutil updates. However, my main concerns is that if we
> want a proper decoupling between the scheduler and schedutil, then we
> also have to ensure that we callback for updates only when it really
> makes sense.
>
> Otherwise, the risk is that the schedutil policy will take decisions
> based on wrong assumptions like: ok, let's increase the OPP (since I
> can now and it's cheap) without knowing that the CPU is instead going
> to be almost empty or even IDLE.
>
>> Can't we classify the utilization events into some kind of major and
>> minor changes ?
>
> Doesn't a classification itself looks more like a policy?
>
> Maybe we can consider it, but still I think we should be able to find
> when the scheduler has the most accurate and updated information about
> the tasks actually RUNNABLE on a CPU and at that point send a
> notification to schedutil.
>
> IMO there are few small events when the utilization could have big
> changes: and these are wakeups (because of util_est) and migrations.
This 2 events looks very few
> For all the rest, the tick should be a good enough update rate,
> considering also that, even at 250Hz, in 4ms PELT never build up more
> then ~8%.
And at 100HZ which is default for arm32, it's almost 20%
>
> [...]
>
>> > .:: Conclusions
>> > ===============
>> >
>> > All that considered, I think I've convinced myself that we really need
>> > to notify schedutil only in these cases:
>> >
>> > 1. enqueue time
>> > because of the changes in estimated utilization and the
>> > possibility to just straight to a better OPP
>> >
>> > 2. task tick time
>> > because of the possible ramp-up of the utilization
>> >
>> > Another case is related to remote CPUs blocked utilization update,
>> > after the recent Vincent's patches. Currently indeed:
>> >
>> > update_blocked_averages()
>> > update_load_avg()
>> > --> update schedutil
>> >
>> > and thus, potentially we wake up an IDLE cluster just to reduce its
>> > OPP. If the cluster is in a deep idle state, I'm not entirely sure
>> > this is good from an energy saving standpoint.
>> > However, with the patch I'm proposing we are missing that support,
>> > meaning that an IDLE cluster will get its utilization decayed but we
>> > don't wake it up just to drop its frequency.
>>
>> So more than deciding in the scheduler if we should wake it up or not,
>> we should give a chance to cpufreq to decide if it wants to update the
>> frequency or not as this decision is somehow platform specific: cost
>> of frequency change, clock topology and shared clock, voltage topology
>> ...
>
> Fair enough, then we should keep updating schedutil from that code path.
>
> What about adding a new explicit callback at the end of:
> update_blocked_averages() ?
>
> Something like:
>
> ---8<---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index cb77407ba485..6eb0f31c656d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7740,6 +7740,9 @@ static void update_blocked_averages(int cpu)
> if (done)
> rq->has_blocked_load = 0;
> #endif
> +
> + cpufreq_update_util(rq, SCHED_CPUFREQ_IDLE);
> +
> rq_unlock_irqrestore(rq, &rf);
> }
> ---8<---
>
> Where we can also pass in a new SCHED_CPUFREQ_IDLE flag just to notify
> schedutil that the CPU is currently IDLE?
>
> Could that work?
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
On 16-May 09:12, Vincent Guittot wrote:
> On 15 May 2018 at 16:53, Patrick Bellasi <[email protected]> wrote:
> > On 15-May 12:19, Vincent Guittot wrote:
> >> On 14 May 2018 at 18:32, Patrick Bellasi <[email protected]> wrote:
> >> > On 12-May 23:25, Joel Fernandes wrote:
> >> >> On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote:
> >> >> > On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
> >
> > [...]
> >
> >> >> > One question about this change. In enqueue, throttle and unthrottle - you are
> >> >> > conditionally calling cpufreq_update_util incase the task was
> >> >> > visible/not-visible in the hierarchy.
> >> >> >
> >> >> > But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
> >> >> > Is this because of util_est or something? Could you add a comment here
> >> >> > explaining why this is so?
> >> >>
> >> >> The big question I have is incase se != NULL, then its still visible at the
> >> >> root RQ level.
> >> >
> >> > My understanding it that you get !se at dequeue time when we are
> >> > dequeuing a task from a throttled RQ. Isn't it?
> >>
> >> Yes se becomes NULL only when you reach root domain
> >
> > Right, my point was mainly what I'm saing below: a task removed from a
> > "throttled" cfs_rq is _already_ not visible from the root cfs_rq since
> > it has been de-accounted at throttle_cfs_rq() time.
> >
> > [...]
> >
> >> > At dequeue time instead, since we certainly removed some estimated
> >> > utilization, then I unconditionally updated schedutil.
> >> >
> >> > HOWEVER, I was not considering these two things:
> >> >
> >> > 1. for a task going to sleep, we still have its blocked utilization
> >> > accounted in the cfs_rq utilization.
> >>
> >> It might be still interesting to reduce the frequency because the
> >> blocked utilization can be lower than its estimated utilization.
> >
> > Good point, this is the case of a task which, in its last activation,
> > executed for less time then its previous estimated utilization.
> >
> > However, it could also very well be the opposite, a task which
> > executed more then its past activation. In this case a schedutil
> > update could trigger a frequency increase.
> > Thus, the scheduler knows that we are going to sleep: does is really
> > makes sense to send a notification in this case?
>
> Why do you say that we are going to sleep ? a task that does to sleep
> doesn't mean that cpu is going to sleep as well
Yes, sure... that was just an example of a single task running on that
CPU. If this is not the last task, we will still have a tick in the
next (e.g.) 4ms... which will give us a better reading for all the
classes in any case.
> > To me that's not a policy to choose when it makes sense to change the
> > frequency, but just the proper definition of when it makes sense to
> > send a notification.
> >
> > IMHO we should better consider not only (and blindly) the utilization
> > changes but also what the scheduler knows about the status of a task.
> > Thus: if the utilization change while a task is running, it's worth to
> > send a notification. While, when a task is done on a CPU and that CPU
> > is likely going to be idle, then maybe we should skip the
> > notification.
>
> I don't think so. Depending of the c-state, the power consumption can
> be impacted and in addition we will have to do the frequency change at
> wake up
Idle energy impacts more on shallow idle states, but that means that
(likely) we are also going to sleep for a short amount of time... and
the we will have a wakeup->enqueue where frequency will be updated
(eventually).
It seems that a frequency drop at IDLE enter is something which could
be really beneficial only if the blocked utilization is way smaller
then the estimated one. Thus, the idea to send a schedutil
notification with an explicit SCHED_CPUFREQ_IDLE flag could help in
supporting a proper decision policy from the schedutil side.
[...]
> > That was not my thinking. What I wanted to say is just that we should
> > send notification when it makes really sense, because we have the most
> > valuable information to pass.
> >
> > Thus, notifying schedutil when we update the RQ utilization is a bit
> > of a greedy approach with respect to the information the scheduler has.
> >
> > In the migration example above:
> > - first we update the RQ utilization
> > - then we actually remove from the RQ the utilization of the migrated
> > task
> > If we notify schedutil at the first step we are more likely to pass an
> > already outdated information, since from the scheduler standpoint we
> > know that we are going to reduce the CPU utilization quite soon.
> > Thus, would it not be better to defer the notification at detach time?
>
> better or not I would say that this depends of the platform, the cost
> of changing the frequency, how many OPP there are and the gap between
> these OPP ...
Yes, that decision can be supported by the additional hint provided by
the new flag.
> > After all that's the original goal of this patch
> >
> >> I agree that some scheduling events give higher chances of a
> >> sustainable utilization level and we should favor these events when
> >> the frequency change is costly but I'm not sure that we should remove
> >> all other opportunity to udjust the frequency to the current
> >> utilization level when the cost is low or negligible.
> >
> > Maybe we can try to run hackbench to quantify the overhead we add with
> > useless schedutil updates. However, my main concerns is that if we
> > want a proper decoupling between the scheduler and schedutil, then we
> > also have to ensure that we callback for updates only when it really
> > makes sense.
> >
> > Otherwise, the risk is that the schedutil policy will take decisions
> > based on wrong assumptions like: ok, let's increase the OPP (since I
> > can now and it's cheap) without knowing that the CPU is instead going
> > to be almost empty or even IDLE.
> >
> >> Can't we classify the utilization events into some kind of major and
> >> minor changes ?
> >
> > Doesn't a classification itself looks more like a policy?
> >
> > Maybe we can consider it, but still I think we should be able to find
> > when the scheduler has the most accurate and updated information about
> > the tasks actually RUNNABLE on a CPU and at that point send a
> > notification to schedutil.
> >
> > IMO there are few small events when the utilization could have big
> > changes: and these are wakeups (because of util_est) and migrations.
>
> This 2 events looks very few
>
> > For all the rest, the tick should be a good enough update rate,
> > considering also that, even at 250Hz, in 4ms PELT never build up more
> > then ~8%.
>
> And at 100HZ which is default for arm32, it's almost 20%
Sure, 250Hz is the default for arm64, but still... 8-20% change is the
worst case where you have to ramp-up from 0: a task which toggle
between being small and big, which sounds also more likely to be a
theoretical case. Is there any other case?
Otherwise, for pretty much periodic tasks, util_est will give you the
right utilization for each activation. While new tasks always start
with a fraction of the spare utilization, isn't it?
Thus the tick seems to be "mainly" a fallback mechanism to follow the
utilization increase for tasks changing their behaviors.
Moreover, if your task is running for 10ms, on an Android system,
where usually you have 16ms periods, then it's a pretty big one (~60%)
and thus your utilization increase even slower: in 10ms you can build
up only 8% utilization.
All that to say that, excluding few corner cases, likely a tick based
update should be frequent enough... and I'm not factoring in here all
the approximation we have in PELT regarding the proper evaluation of
how big a task is... having more frequent updates to me it seems
over-engineering the solution thus introducing overheads without
clear benefits.
--
#include <best/regards.h>
Patrick Bellasi
Hi Patrick,
On Mon, May 14, 2018 at 05:32:06PM +0100, Patrick Bellasi wrote:
> On 12-May 23:25, Joel Fernandes wrote:
> > On Sat, May 12, 2018 at 11:04:43PM -0700, Joel Fernandes wrote:
> > > On Thu, May 10, 2018 at 04:05:53PM +0100, Patrick Bellasi wrote:
> > > > Schedutil updates for FAIR tasks are triggered implicitly each time a
> > > > cfs_rq's utilization is updated via cfs_rq_util_change(), currently
> > > > called by update_cfs_rq_load_avg(), when the utilization of a cfs_rq has
> > > > changed, and {attach,detach}_entity_load_avg().
> > > >
> > > > This design is based on the idea that "we should callback schedutil
> > > > frequently enough" to properly update the CPU frequency at every
> > > > utilization change. However, such an integration strategy has also
> > > > some downsides:
> > >
> > > I agree making the call explicit would make schedutil integration easier so
> > > that's really awesome. However I also fear that if some path in the fair
> > > class in the future changes the utilization but forgets to update schedutil
> > > explicitly (because they forgot to call the explicit public API) then the
> > > schedutil update wouldn't go through. In this case the previous design of
> > > doing the schedutil update in the wrapper kind of was a nice to have
>
> I cannot see right now other possible future paths where we can
> actually change the utilization signal without considering that,
> eventually, we should call an existing API to update schedutil if it
> makes sense.
>
> What I can see more likely instead, also because it already happened a
> couple of time, is that because of code changes in fair.c we end up
> calling (implicitly) schedutil with a wrong utilization value.
>
> To note this kind of broken dependency it has already been more
> difficult than possibly noticing an update of the utilization without
> a corresponding explicit call of the public API.
Ok, we are in agreement this is a good thing to do :)
> > > > @@ -5397,9 +5366,27 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > update_cfs_group(se);
> > > > }
> > > >
> > > > - if (!se)
> > > > + /* The task is visible from the root cfs_rq */
> > > > + if (!se) {
> > > > + unsigned int flags = 0;
> > > > +
> > > > add_nr_running(rq, 1);
> > > >
> > > > + if (p->in_iowait)
> > > > + flags |= SCHED_CPUFREQ_IOWAIT;
> > > > +
> > > > + /*
> > > > + * !last_update_time means we've passed through
> > > > + * migrate_task_rq_fair() indicating we migrated.
> > > > + *
> > > > + * IOW we're enqueueing a task on a new CPU.
> > > > + */
> > > > + if (!p->se.avg.last_update_time)
> > > > + flags |= SCHED_CPUFREQ_MIGRATION;
> > > > +
> > > > + cpufreq_update_util(rq, flags);
> > > > + }
> > > > +
> > > > hrtick_update(rq);
> > > > }
> > > >
> > > > @@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > update_cfs_group(se);
> > > > }
> > > >
> > > > + /* The task is no more visible from the root cfs_rq */
> > > > if (!se)
> > > > sub_nr_running(rq, 1);
> > > >
> > > > util_est_dequeue(&rq->cfs, p, task_sleep);
> > > > + cpufreq_update_util(rq, 0);
> > >
> > > One question about this change. In enqueue, throttle and unthrottle - you are
> > > conditionally calling cpufreq_update_util incase the task was
> > > visible/not-visible in the hierarchy.
> > >
> > > But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
> > > Is this because of util_est or something? Could you add a comment here
> > > explaining why this is so?
> >
> > The big question I have is incase se != NULL, then its still visible at the
> > root RQ level.
>
> My understanding it that you get !se at dequeue time when we are
> dequeuing a task from a throttled RQ. Isn't it?
I don't think so? !se means the RQ is not throttled.
> Thus, this means you are dequeuing a throttled task, I guess for
> example because of a migration.
> However, the point is that a task dequeue from a throttled RQ _is
> already_ not visible from the root RQ, because of the sub_nr_running()
> done by throttle_cfs_rq().
Yes that's what I was wondering, so my point was if its already not visible,
then why call schedutil. I felt call schedutil only if its visible like you
were doing for the other paths.
>
> > In that case should we still call the util_est_dequeue and the
> > cpufreq_update_util?
>
> I had a better look at the different code paths and I've possibly come
> up with some interesting observations. Lemme try to resume theme here.
>
> First of all, we need to distinguish from estimated utilization
> updates and schedutil updates, since they respond to two very
> different goals.
I agree with your assessments below and about not calling cpufreq when CPU is
about to idle.
thanks!
- Joel
Hi Joel,
sorry for the late reply, this thread is a bit confusing because we
keep discussing while there was already a v2 posted on list.
However, here are few comments below...
[...]
> > > > > @@ -5456,10 +5443,12 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > > update_cfs_group(se);
> > > > > }
> > > > >
> > > > > + /* The task is no more visible from the root cfs_rq */
> > > > > if (!se)
> > > > > sub_nr_running(rq, 1);
> > > > >
> > > > > util_est_dequeue(&rq->cfs, p, task_sleep);
> > > > > + cpufreq_update_util(rq, 0);
> > > >
> > > > One question about this change. In enqueue, throttle and unthrottle - you are
> > > > conditionally calling cpufreq_update_util incase the task was
> > > > visible/not-visible in the hierarchy.
> > > >
> > > > But in dequeue you're unconditionally calling it. Seems a bit inconsistent.
> > > > Is this because of util_est or something? Could you add a comment here
> > > > explaining why this is so?
> > >
> > > The big question I have is incase se != NULL, then its still visible at the
> > > root RQ level.
> >
> > My understanding it that you get !se at dequeue time when we are
> > dequeuing a task from a throttled RQ. Isn't it?
>
> I don't think so? !se means the RQ is not throttled.
Yes, I agree, I "just" forgot a "not" in the sentence above... my bad!
However, we are on the same page here.
> > Thus, this means you are dequeuing a throttled task, I guess for
> > example because of a migration.
> > However, the point is that a task dequeue from a throttled RQ _is
> > already_ not visible from the root RQ, because of the sub_nr_running()
> > done by throttle_cfs_rq().
>
> Yes that's what I was wondering, so my point was if its already not visible,
> then why call schedutil. I felt call schedutil only if its visible like you
> were doing for the other paths.
Agree, as discussed in Vincent in v2, we should likely move these
schedutil updates at attach/detach time. This is when exectly we know
that the utilization has changed for a CPU.
... and that's what I'll propose in the upcoming v3 for this patch.
[...]
> I agree with your assessments below and about not calling cpufreq
> when CPU is about to idle.
Cool ;)
--
#include <best/regards.h>
Patrick Bellasi